Visual tracking is one of the most popular research fields in vision-based technologies and has made huge progresses in recent decades. In 1981, B. D. Lucas and T. Kanade firstly began to adopt holistic templates in tracking fields . Then, for better describing the appearance changes, the subspace-based tracking methods have been widely used  . So far, many visual features, such as histograms of Haar-like features , oriented gradients (HOG)  and co-variance region descriptor , have been developed in visual tracking. Recently, context information is considered as a helpful factor in visual tracking when the objects are partly or fully occluded .
The visual tracking has a variety of practical applications, such as human computer interaction, motion analysis, activity recognition, surveillance and medical imaging  . Also, this technology has been applied in the construction industry. For example, visual tracking could be used to manage construction resources  and help managers know how many resources have been wasted in order to address inefficiency issues . Also, tracking moving objects in construction sites could prevent potential collisions  and fall accidents . Especially, vision-based tracking has been widely used in the earthmoving works to evaluate the productivity of equipment, such as excavators, loaders, dozers, and backhoes .
Earthmoving works is an important factor which affects the quality and cost of a project. As a mainly employed equipment in the earthmoving, hydraulic excavators has different sizes and can be used in digging foundations, drilling piles, and handling materials. Therefore, tracking excavators is a necessary technique to estimate working productivity. Although visual tracking algorithms have gained promising performance when tracking un-articulated equipment such as dozers, loaders and trucks, there is not a mature tracking algorithms to track hydraulic excavators. This is because the operation of excavators is complex and the activity range is too wide to be predicted. Some researchers have made great efforts on tracking excavators. Sougho and Tomohiro  applied RFID (Radio Frequency Identification) technology to identify hydraulic excavators in order to prevent collision accidents. And Ehsan et al.  tracked excavators through painting markers on the arms of excavators. However, all these techniques (marker and RFID sensor) are time- and money-consuming.
To address these issues, this study introduced the part-based 2D tracking methods to track the hydraulic excavators. First of all, three tracking algorithms: SCM tracker , KCF tracker , and STC tracker  were selected due to the desirable tracking performance in benchmark research studies of the computer vision community. These trackers were tested with multiple videos captured on real construction sites. Then, the KCF tracker is recognized as the most accurate tracker, while the STC tracker is recognized the most robust tracker. The two trackers were used to create two multiple-object tracking methods (called M-KCF and M-STC) for part-based tracking of hydraulic excavators. For potential better performance, the multiple-object tracking methods (called M-K-S), which combined KCF tracker and STC tracker were introduced. The M-KCF, M-STC and M-K-S tracker were further compared and discussed. It is improved that the part-based methods have significantly increased the tracking performance of excavators.
2. Related Work
In this section, the recent research studies in 2D visual tracking methods were firstly introduced. Then, state-of-art research focused on visual tracking construction workforces was reviewed. Also, some widely accepted evaluation metrics to assess the performance of trackers in the benchmarks were illustrated.
2.1. 2D Visual Tracking Methods
In 1981, B.D. Lucas and T. Kanade  firstly adopted holistic templates for tracking. In order to seek better templates, lots of visual features, such as histograms of oriented gradients (HOG) , Haar-like features  and co-variance region descriptor , have been used for tracking technologies. Furthermore, the subspace-based tracking methods have been widely employed to describe the appearance changes. Meanwhile, the sparse-representation-based algorithms, which were proposed by Ling and Mei , have been improved . So far, the deep learning  and machine learning  were widely developed in current researches and have got promising performance when tracking occlusion objects.
Generally, most short-term single-object model-free trackers are considered in the same framework which breaks a tracker into five components . These components include motion model, feature extractor, observation model, model updater and ensemble post-processor. A tracking system is always initialized with given the position information of the bounding box of the target, then the motion model generates many candidate regions for prediction. Then the feature extractor converts these candidate regions into different features. And the observation model estimates the candidate regions’ possibility of being targets. Finally, the motion updater updates the observation model and provide the tracking results. In a tracking system, there may not include only one tracker, the ensemble post-processor would combine the prediction results of each tracker and provide the best estimation result.
2.2. Visual Tracking in Construction
The visual tracking technology has been recently applied in the construction industry to facilitate construction automation. For example, it was used to do pothole distress assessment in pavement design , identify construction cumulative trauma disorders , recognize dirt loading cycles in excavation , and manage construction workforces in real-time . Another essential application of tracking in construction is safety monitoring. It is well known that the possibility of fatalities in construction sites is quite large when compared to the scale of the workforce and to other industries. Visual tracking technologies help project managers to enhance the safety of workers when they are working in heights . It is also feasible to locate workers and equipment in order to protect workers from potential collisions .
As an important equipment in construction, hydraulic excavators have attracted lots of interests in visual tracking. Some researchers used RFID technique to track excavators . The RFID system consists of a reader and a tag. The RFID tag periodically makes the object identifiable by a battery, which has a unique ID, and the RFID reader receives this ID number information from the RFID tag. Therefore, excavators can be tracked through attaching a tag on it. Also marker-based methods are intended to detect excavators in harsh construction environments . This technique requests painting different markers on the arms of excavator. Algorithms could even precisely detect and estimate the arm poses through detecting the boundaries of markers. Many effective libraries have been developed in marker-based research field, such as ARToolKit  and ARTag . However, both installing RFID tags and painting markers are cost-consuming. Many construction sites cannot adopt these technologies due to the inconveniences. So it is important to develop a visual tracking method to track excavators with high performance in real time.
2.3. Evaluation Criteria
How to fairly evaluate tracker’s performance remains a task in visual tracking fields. A reasonable evaluation system will help researchers to grasp tracker’s strengths and weaknesses. Typically, popular evaluation metrics, which adopted by lots of benchmarks are introduced as fellow:
・ The region overlap score  calculates the overlap region of prediction region from the whole area combing the tracker and the ground truth area. It is defined as, while and mean the intersection and union of two areas, respectively.
・ The center error  measures the average Euclidean distance between the central location of manual ground truths and tracked results. The average central location error over all frames is usually adopted to evaluate the performance of trackers.
・ The tracking length  reflects the robustness performance of trackers. This metric is calculated as the number of the frames from the first frame to the frame where its first failure.
・ The failure rate  is calculated as possibility of tracking failure in per image during the whole sequence. And this metric needs manually re-initialization when a tracker fails to track its targets.
Single evaluation metric is hard to reflect the robustness and effectiveness at the same time. So these evaluation metrics are always combined together. Nawaz and Cavallaro  proposed the Combined Tracking Performance Score (CoTPS) method to gain a comprehensive evaluation. In the CoTPS, the accuracy score is calculated as the number of successfully tracked frames, while the failure information is calculated on the base of the tracking length. On the other hand, Matej et al.  also considered the accuracy and robustness effectiveness into one graph in order to decide which tracker shows the better performance both in accuracy and robustness. The accuracy is reflected by the overlap score, while the robustness is measured by the times which the tracker fails to track the object during tracking.
3.1. Trackers Selection
In this paper, authors selected three trackers (KCF, SCM and STC) from computer vision as the experiments trackers based on our knowledge. There exists many popular benchmark works which provide directions to us. KCF tracker and SCM tracker were selected because they have shown the promising performance in existing visual tracking benchmarks. In Wu’s benchmark work , the Robust Object Tracking via Sparsity- based Collaborative Model (SCM)  was ranked the first in occlusion, illumination and background clutter conditions and the second in the scale variation condition. From the comparisons in  by Matej et al., trackers were evaluated through the accuracy-robustness graph. In this benchmark, the Kernelized-Correlation Filter tracker (KCF) showed the best performance in accuracy and the second in overall performance. On the other hand, a super-fast algorithm which employed the spatio-temporal context information (STC)  was used. The STC tracker creates a spatial context model between the object and the background near the object in one scene. Then, this model will be updated with a spatio-temporal context model in the next frame and the best results is predicted when maximizing the confidence map.
In order to assess these single-object tracking algorithms’ strengths and weaknesses in construction scenarios, these trackers were tested by construction sequences which includes excavators, backhoes, trucks and workers. And the trackers are evaluated from accuracy and robustness respectively. For the accuracy evaluation, the average overlap score and center location error are employed for analysis. Because these two metrics are considered as the easiest to compute, interpret and describe the entire sequence. For the robustness analysis, the failure rate is employed here as its minimal annotation requirement. Also the failure rate can better describe the entire performance of trackers in robustness when comparing with the tracking length. Part of comparison results is showed in Table 1. According to the comparison work, the KCF tracker is the most accurate one with the better overlap score and lower center error, while the STC tracker showed the better performance in robustness with the lowest failure rate.
3.2. Part-Based Tracking
It can be noticed that single-object tracking algorithms perform un-guaranteed in tracking excavators, especially in dirt-loading activities from comparison works. It is because the excavator buckets always rotate and move quickly in operations. Generally, an excavator includes four mainly tracking components: boom, dipper, bucket and “house” (driving cab). An excavator model which illustrates each component clearly is showed in Figure 1. The single-objects tracking algorithms usually focus on the house of the excavators because this component has biggest area and moves slowly. Because of the buckets move fast, it results in the ground truth tracking box changes quickly and hard to be predicted. Therefore, there are two initial tracking boxes adopted in this study, which is showed in the Figure 2. The first part is the “house” and grab rails, and the second part is bucket and dipper. And we find the two tracking boxes can always reflect the tracking box of the whole excavator.
Based on the STC algorithm of Zhang et al. , one more rectangle was added to represent the second target at the beginning of the algorithm. Therefore, two sets of confidence map and context prior models can be produced at the same time. So far, it has learned two spatial context models respectively. The maximum point of two confidence map will be the two targets’ location separately. This two-object algorithm is called M-STC. Adopting the similar concept of M-STC, the M-KCF tracker were created based on the KCF tracker. At the beginning of the KCF algorithm, two initial targets are defined in the first frame. Hence, for every frame, we extract dense features from the image in order to train the Gaussian kernel model. The target’s location in
Table 1. Part of comparison of trackers in construction site.
Figure 1. Model of the excavator structure (CAT@5100B).
Figure 2. Example of initial positions of tracking boxes.
next frame will be automatically stored and visualized. The STC tracker was assessed the better robust tracker in the comparison part. Therefore, it may be better if we use STC to track the part of bucket and dipper, which is hard to be tracked because of the high moving speed. So the STC and KCF tracker are combined together to track two targets respectively, which named M-K-S. For the results of three multiple trackers, each algorithm computes the coordinator of two targets tracking boxes. Based on these coordinators, the extra code is added to plot a big rectangle which contains two targets. After that, the performance of multiple trackers can be compared with manually annotated ground truths.
4. Experiment Results
In this experiment, the datasets were tested in the platform of Matlab R2014b, a 64-bit operating system, Microsoft Windows 7 Enterprise. And the hardware configuration includes an Intel® i7-4720HQ CPU @2.60 GHz (central processing Unit), a 16 gigabytes memory, and an NVIDIA® GeForce® GTX 965M with 2GB GDDR5 GPU (graphic processing unit). Three sequences were used in this study and all sequences are loading dirt and in the night time. It means the tracking conditions such as the motion blur, low resolution and background clutter are tough. In this study, it used average overlap score, center location error to evaluate the performance of three single-object algorithms and three multiple algorithms. The overlap score reflects the accuracy of trackers. And the center error measures the ability that tracking boxes follow the ground truth boxes. Some example sequences of evaluation results are showed in the Figure 3. The tracking performance is illustrated in the following Table 2.
It is obvious that the part-based algorithms have more accurate and effective performance than three single-object algorithms. The mean value of average overlap score of M-STC, M-KCF, and M-K-S is 0.86, while the mean value of rest of trackers is 0.57. And part-based algorithms also perform remarkable in center error with 15.47 pixels in
(a) (b)(c) (d)(e) (f)
Figure 3. Examples of tracking results in the Frame 300 of testing trackers. (a) KCF tracking result in Frame 300; (b) SCM tracking result in Frame 300; (c) STC tracking result in Frame 300; (d) M-STC tracking result in Frame 300; (e) M-KCF tracking result in Frame 300; (f) M-S-K tracking result in Frame 300.
Table 2. Tracking performance of experiment trackers.
average, while single-object algorithms got 89.99 pixels in average center error. It proves that dividing the excavators into two parts and tracking them separately at the same time really enhances the tracking results. In this study, we created the M-K-S tracker which combines STC and KCF together. This tracker used STC to track the bucket part, which moves with high-speed and accurate KCF to track “house” part. And the M-K-S tracker actually achieved the best performance among these six trackers with 0.93 in average overlap score.
In this study, the part-based 2D tracking methods were introduced to track the hydraulic excavators. Three tracking algorithms: SCM, KCF, and STC were selected out based on the desirable performance in benchmark studies. These trackers were tested and compared with construction videos. Then, the KCF tracker and STC tracker were used to create part-based trackers for tracking hydraulic excavators. Finally, all six trackers were tested by excavator videos and the part-based methods have better performance than single-object algorithms.
In fact, this concept also could be used in tracking other equipment. The two-object algorithms can be changed to three, four or more objects algorithms in order to track more complex equipment and activities in construction. On the other hand, the single- object trackers used in this study can be replaced with other better performed trackers and it is supposed to receive better results. There exist certain limitations here. Because of the limited space of this paper, the tracking time and robustness of trackers have not been considered which are important in visual tracking. More objects tracked, much time is spent. When the target is divided into some parts, it is easier to lose the quickly moving part and results in the decreasing of robustness. And the part-based algorithms may not make breakthroughs in tracking occlusions because it cannot exceed the ability of original trackers.
 Mei, X. and Ling, H. (2011) Robust Visual Tracking and Vehicle Classification via Sparse Representation. IEEE Transactions on Pattern Analysis and Machine Intelligence, 33, 2259- 2272. http://dx.doi.org/10.1109/TPAMI.2011.66
 Dalal, N. and Triggs, B. (2005) Histograms of Oriented Gradients for Human Detection. IEEE Computer Society Conference on Computer Vision and Pattern Recogni-tion, CVPR’2005, 886-893. http://dx.doi.org/10.1109/cvpr.2005.177
 Gong, J. and Caldas, C.H. (2011) An Object Recognition, Tracking, and Contextual Reasoning-Based Video Interpretation Method for Rapid Productivity Analysis of Construction Operations. Automation in Construction, 20, 1211-1226. http://dx.doi.org/10.1016/j.autcon.2011.05.005
 Weerasinghe, I.P.T. and Ruwanpura, J.Y. (2009) Automated Data Acquisition System to Assess Construction Worker Performance. Proceedings of 2009 Construction Research Congress, ASCE, Reston, VA, 11-20. http://dx.doi.org/10.1061/41020(339)7
 Park, M.W., Makhmalbaf, A. and Brilakis, I. (2011) Comparative Study of Vision Tracking Methods for Tracking of Construction Site Resources. Automation in Construction, 20, 905-915. http://dx.doi.org/10.1016/j.autcon.2011.03.007
 Han, S. and Lee, S. (2013) A Vision-Based Motion Capture and Recognition Framework for Behavior-Based Safety Management. Automation in Construction, 35, 131-141. http://dx.doi.org/10.1016/j.autcon.2013.05.001
 Chae, S. and Yoshida, T. (2010) Application of RFID Technology to Prevention of Collision Accident with Heavy Equipment. Automation in Construction, 19, 368-374. http://dx.doi.org/10.1016/j.autcon.2009.12.008
 Azar, E.R., Feng, C. and Kamat, V.R. (2015) Feasibility of In-Plane and Articulation Monitoring of Excavators arm using Planar Marker. Journal of Information Technology in Construction (ITcon), 20, 213-229.
 Zhong, W., Lu, H. and Yang, M.-H. (2014) Robust Object Tracking via Sparse Collaborative Appearance Model. IEEE Transactions on Image Processing, 23, 2356-2368. http://dx.doi.org/10.1109/TIP.2014.2313227
 Henriques, J.F., Caseiro, R., Martins, P. and Batista, J. (2014) High-Speed Tracking with Kernelized Correlation Filters. IEEE Transactions on Pattern Analysis and Machine Intelligence, 1, 125-141.
 Wang, N., Shi, J., Yeung, D.Y. and Jia, J. (2015) Understanding and Diagnosing Visual Tracking Systems. Proceedings of the IEEE International Conference on Computer Vision, 3101-3109. http://dx.doi.org/10.1109/iccv.2015.355
 Koch, C., Jog, G. and Brilakis, I. (2013) Automated Pothole Distress Assessment Using Asphalt Pavement Video Data. Journal of Computing in Civil Engineering, 27, 370-378. http://dx.doi.org/10.1061/(asce)cp.1943-5487.0000232
 Kato, H. and Billinghurst, M. (1999) Marker Tracking and hmd Calibration for a Video-Based Augmented Reality Conferencing System. Proceedings of the 2nd IEEE and ACM International Workshop on Augmented Reality, 85-94. http://dx.doi.org/10.1109/IWAR.1999.803809
 Fiala, M. (2005) ARTag, a Fiducial Marker System Using Digital Techniques. Proceed-ings of the 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR’05), San Diego, Vol. 2, 590-596.
 Smeulders, A.W.M., Chu, D.M., Cucchiara, R., Calderara, S., Dehghan, A. and Shah, M. (2013) Visual Tracking: An Expe-rimental Survey. IEEE Transactions on Pattern Analysis and Machine Intelligence, 36, 1442-1468.
 Babenko, B., Yang, M.H. and Be-longie, S. (2011) Robust Object Tracking with Online Multiple Instance Learning. IEEE Transactions on Pattern Analysis and Machine Intelligence, 33, 1619-1632. http://dx.doi.org/10.1109/TPAMI.2010.226
 Kristan, M., Pflugfelder, R., Leonardis, A., Matas, J., Cehovin, L., Nebehay, G., Vojir, T., Fernández, G., et al. (2014) The Visual Object Tracking vot2014 Challenge Results. In: ECCV2014 Workshops, Workshop on Visual Object Tracking Challenge, Volume 8926, 191-217.