This article proposes the high-speed and high-accuracy code clone detecting method. Code clone is a fragment of source code that is identical or similar to other portion of source code  . Code clones often reduce the maintainability of software. Suppose that there are two code fragments A and B, and fragment B is a clone of fragment A. If errors (bugs) are included in A, B must contain same errors and they must be removed at the same time when errors in A are removed. If the programmer does not recognize the existence of clone B, he or she may forget to revise them in clone B. This may cause the deterioration of software quality. It is often said that 10% to 20% of codes are duplicated code (code clones) in large-scale software   . If the software is large, finding all code clones are hard work.
From syntactical point of view, there are three types of code clone named TYPE-1, TYPE-2, and TYPE-3  . In TYPE-1, all parts of original and clones are identical. In TYPE-2, some differences such as the difference of identifier name and function name, and the value of constants exist but structure of code is identical. In TYPE-3, several statements are inserted or removed from original source code.
In order to detect these code clones, several methods are proposed in previous works. These methods are based on two principles; one is token-based method   and the other is tree-based method   . Token-based methods can detect TYPE-1 and TYPE-2 clones very fast but hard to detect TYPE-3 clones. This method is relatively fast and can be applied to large-scale software. On the other hand, tree-based methods can detect all types of code clone but require large computing resources (CPU time and memory).
In this article, we propose the new method that can detect all types of code clone and run relatively fast. Our method combines token-based method and tree-based method. By using token-based method that runs fast, some candidates of code clones are extracted. After the extraction of candidates, each candidate fragment is examined by using tree-based method if it is clone or not. By combining token-based method and tree-based method, our method can detect all types of code clone faster.
2. Definition of Code Clone
2.1. Definitions and Types of Code Clones
Figure 1 is a conceptual illustration of code clone. Let us consider one sample source code shown in Figure 1. The code fragment shown at the left-hand side is the original code. There are another two code fragments at the right-hand side of the source code illustrated by rectangles. These two fragments are assumed to be identical or similar to the original. Then these two fragments are code clones. Note that there are only two fragments in Figure 1, three or more code clones may exist in the source code. Suppose there is one bug in the original code shown by the red star in Figure 1. Then similar bugs may exist in another code
Figure 1. Conceptual illustration of code clone.
fragments of code clones. In order to remove all bugs in the source code, programmer has to find all bugs in other code clones. If he or she forgets to revise other bugs, the quality of software will reduce.
As the size of program increases, the number of code clones increases. Code clones are usually brought into original code by copy-and-paste operation. Finding all code clones in large complex program becomes hard. Therefore, detecting code clones and improving the structure of original code play an important role in software quality assurance.
Bellon et al. classified code clones into three types based on the features of clones  . They are:
・ Type-1 (Exact clone): Exact duplication of original part except white space, tab, carriage code and other coding style related characters.
・ Type-2 (Parameterized clone): Syntactically identical but some names of identifiers (variable names, function/method names etc.) and the values of constants are different in two code fragments.
・ Type-3 (Gap clone): Duplication with some insertion and/or deletion of statements.
2.2. Related Works of Clone Detection
Several methods to detect code clones are proposed in previous works. For example, Baker et al tried to detect clones by line-wise comparison of two files  . Furthermore, they introduced the parse tree and used it for detecting clones  . Currently, detecting methods are classified into three categories.
・ Text-based method (or line-based method): This method detects code clones by comparing two codes fragments line by line. This method runs fast and lightweight. But this method can detect only Type-1 clones. Some sophisticated method can detect Type-2 clones. This method is the fastest.
・ Token-based method: This method detects code clones by comparing two sequences of tokens (minimum unit of lexically meaningful sequence of characters). As tokens only represent the kind of elements in programming language, this method can detect Type-1 and Type-2 clones. But it requires sophisticated modification to detect Type-3 clones. This method is relatively faster and lightweight.
・ Tree-based method: This method detects code clones by comparing two abstract syntax tree (AST)  or other tree representation of source codes. In this method, a source code is tokenized and parsed firstly. Source code is converted into AST (or other tree-based representation) and compared the similarity of two (or more) trees. This method requires much computing resources (slow and large memory) but detect all types of clones.
Table 1 is a result of the comparison of previous methods.
2.3. Issues of Previous Works
As shown in Table 1, all previous methods have some advantages and disadvantages. Text-based and token-based methods can run relatively fast and do not require much computing resources. For example, Kamiya et al. reported that 40 seconds are required to find clones in 235,000 lines source code  . Baker showed only 12 seconds were necessary to the same source code  . But finding Type-3 clones is difficult by these two methods.
Our preliminaries investigation on some sample programs found that the distribution of the clone types is shown in Figure 2. This graph shows that approximately 97% of clones are Type-2 and Type-3, and Type-3 occupies about 40%. Therefore, ignoring Type-3 clones makes the quality of software worse. As
Table 1. Comparison of previous methods of code clone detection.
Figure 2. Ratio of code clone type.
you can easily imagine, detecting Type-1 and Type-2 is not so difficult and detecting Type-3 is hard.
With this investigation, we can also see that the number of exact clones is relatively small. This make sense since in most of the cases of code clone, we take a part of the code and change it to fit new need of the function which is near the cloned program in terms of service or computation to provide. In the process of modification, some identifiers may be renamed, some statements are inserted or removed, and some conditional expression may be changed. Therefore, difficulty of finding Type-3 clones is a serious drawback from the practical point of view.
On the other hand, tree-based methods can detect almost all clones including Type-3 clones that are hard to be detected by text or token-based method. But tree-based methods require much computing time and resources. Baxter  and Krinke  indicated that about 3 hours and 63 hours were necessary to find all code clones in approximately 115,000 lines codes. The reason why tree-based method requires much computing resources is basically the comparison time of two or more trees. When the number of nodes in the tree T is N, the computational time complexity of naive tree comparison is proportion to N6  . Therefore, when the number of nodes, which corresponds to the size of software, becomes 10 times larger, the computation time becomes 1 million times longer than original one. It means that the tree-based method is hard to apply to large-scale software.
3. Proposed Method
3.1. Overall of Proposed Method
Based on the above consideration, we propose our new code clone detection method that is based on the combination of token-based method and tree-based method. Token-based method runs faster but is not appropriate for detecting Type-3 clones. Tree-based method can find all types of clones but runs slower. The reason of large amount of computing time of tree-based method is its comparison time of trees. The larger the software becomes, the more the number of compared trees. Therefore, we use the token-based method to narrow down the number of trees to be compared. By reducing the number of trees to be compared, we can execute the tree-based method much faster. Figure 3 is the overall steps of proposed method. Proposed method consists of following steps.
1) Lexical analyzing source code and generate token sequence,
2) Applying token-based method to extract the candidates of code clones,
3) Generating abstract syntax trees (ASTs) of code clone candidates,
4) Comparing ASTs to fix code clones of all types.
In step (1), source code is converted into the sequence of tokens. Then some conversion will be done to the sequence of tokens to detect TYPE-2 clones. This conversion includes the replacement of specific tokens such as identifier and function/method name by special characters. Figure 4 is an example of conversion. In this example, all identifiers are replace by special character “$” and
Figure 3. Overall steps of proposed method.
constants are replaced by “#”. After the conversion, detection process will be executed. After arranging all tokens as shown in Figure 5, check the identical token and mark as “*”. Some code fragments, which have sufficient length of diagonal lines of Figure 5, are candidates of code clone. After the extraction of code clone candidates in step (2), these code fragments are converted into abstract syntax trees in step (3). Then these trees are compared to find Type-3 code clones by using tree distance in step (4).
3.2. Gap of Diagonal Line
Type-1 and Type-2 code clones draw the continuous diagonal lines when we represent them in a manner shown in Figure 5. But as Type-3 code clones have some insertions and/or deletions of statements, there are several gaps within the clone diagonal lines. Figure 6 shows the example of this gap diagonal line. In this case, original source code has several inserted and/or removed tokens. This will cause a gap as shown in Figure 6.
In order to detect Type-3 code clones, we have to detect the candidates with gap in a diagonal line. In order to detect this gap, we need to find a method to look for other part of code that might follow this gap.
Figure 7 shows an illustration of gap diagnose lines. In this case, there are three diagonal lines , , and . To find Type-3 code clones by merging several code clone fragments, we will use the following algorithms.
Step 1 Extracting diagonal lines (code clone candidates) with longer than predefined length.
Step 2 Let be a diagonal line whose starting point is and
Figure 4. Example of conversion.
ending points is .
Step 3 For all diagonal lines, if there is another diagonal line where and , merge two lines and and draw a diagonal line starting at and ending at . Note that and are predefined tolerable limit. The fragment corresponding this new diagonal line is a merged virtual diagonal line and will be a candidate of Type-3 code clone.
Step 4 When there are two or more lines whose starting points satisfies and , draw a line from to where is an ending point of virtual line where and
. The code fragment corresponding to the
Figure 5. Arranging token streams to find code clones.
Figure 6. Example of gapped diagonal line.
Figure 7. Illustration of gapped diagonal lines.
diagonal line to is a candidate of Type-3 code clone. Figure 8 illustrates the gap code clone that has virtual diagonal line. This virtual diagnose line represents the candidate of TYPE-3 code clone.
3.3. Applying Tree-Based Method
Exact and parameterized code clones (Type-1 and Type-2) can be found by token-based method. But gap clone (Type-3) cannot be identified by token-based method. In order to identify Type-3 code clone, we need further steps. They are 1) translating source code into abstract syntax tree (AST) or similar tree-based representation method for source code, 2) compute the difference (distance) of any pair of code clone candidates, and 3) identify Type-3 code clone by examination of distance. Transforming source code into AST is a well-known processing of language processor such as compiler. We will not mention this process any more.
In order to compute the distance of two trees, we use the tree edit distance (TED)  . In TED, several primitive operations are used to modify trees. For example, we can assume that following three operations are primitive operations for tree transformation: 1) insertion of node, 2) deletion of node, and 3) renaming of node. Let and be two trees. By applying above-mentioned three primitive operations for tree modification, we can always transform into . The trivial method of transforming into is that a) delete all nodes in and generate void (null) tree, b) insert all nodes of into void tree. Let and be the number of nodes in and respectively. Then maximum TED of and is (deleting all nodes of requires operations and inserting all nodes of requires operations). There may be another shorter sequence of operations that transform into . TED is defined as the minimum number of these primitive operators. For example, let us consider two trees and in Figure 9. The number of nodes in is 6
Figure 8. Virtual diagonal line.
Figure 9. Tree transformation.
and that of is 6. Therefore by using trivial method, can be converted into by 6 + 6 = 12 operations. This is the maximum length of transformation operations. However, can be converted by the following sequence: i) delete node “D” from tree , ii) insert node “G” between node “A” and “B” of tree , iii) rename node “F” of tree into “H”. Total number of primitive operation is 3 and this is the minimum number of operations for
1If strictly speaking, we have to prove that 3 is the minimum number of the sequence of transformation.
transforming into . Therefore TED of and is 31.
By applying TED and computing the distance of any two trees based on the established algorithms  , we can filter the Type-3 code clones.
4. Prototype Implementation
4.1. Overall of Prototype
We have implemented the code clone detection system which adopted proposed method. Following is a target program of experiment.
・ Implementation Language: Java
・ Program name: JDK 1.5.0
・ The number of files: 108
・ The number of lines: 33,128 lines
Proposed system is implemented under following environment.
・ OS: Window 7 Professional 64 bits
・ CPU: Intel Core i7 3.2 GHz
・ RAM: 6.0 GB (2.0 GB is used for executing proposed method)
Proposed system is implemented by Java. The total number of prototype code is 4333 lines.
4.2. Experimental Result
Figure 10 shows the result of token-based detection. In this experiment, 50 or longer sequence of tokens were selected as candidates of code clones. As test file contains more than 400,000 tokens, we divided all token sequence into 4000 tokens block. We can identify several clone candidates around the upper left part of this diagram. By using this diagram, we can extract Type-1 and Type-2 code clones easily.
After extracting Type-1 and Type-2 code clones, we furthermore try to extract the candidates of Type-3 (gap) clones. Threshold values of and mentioned in section 3.3 were set to 500. Result is shown in Figure 11. For example, line 121 to 210 of code B is a code clone of source code from line 113 to 202 of source code A.
4.3. Computing Time Comparison
Figure 12 shows the computing time of code clone detection. This graph shows
Figure 10. Result of token-based clone detection.
Figure 11. Result of the detection of gap code clone.
Figure 12. Computing time of three methods.
that the total computing time is heavily influenced by AST-based time. Approximately AST-based computing time occupies 80% to 90% of total computing time.
Let n be the number of tokens. Then complexity of token-based time is . And let m be the number of nodes in the tree. Then complexity of tree-based time is  . Therefore the complexity of total computing time is . Table 2 shows the comparison result of proposed system and other systems.
CCFinder is the fastest system. But it uses token-based method, therefore it is not easy to detect Type-3 code clone. Other two systems and the proposed system
Table 2. Comparison of computing time.
3The terms “precision rate” and “recall rate” are commonly used in the field of information retrieval and pattern recognition. Precision rate is the fraction of detected instance that are relevant, and recall rate is the fraction of relevant instance that are detected.
adopt tree-based method. Therefore, they can detect all types of code clones. This table shows that proposed system is three to four times faster than other two tree-based systems (DECKARD and CloneDR). Note that DECKARD uses same CPU clocks and memory size, and those of CloneDR are lower and small. Therefore comparison to CloneDR is not accurate. Clone detection computing does not include so many disk accesses. The computing time mainly depends on the CPU time. The CPU time is roughly in proportion to clock frequency. Therefore if CloneDR runs on the same clock frequency of DECKARD and proposed system, the computing time is approximately 6000 sec2. The reason why proposed system is faster than other two tree-based systems is the difference of the number of tree comparison. Our system narrows the candidates of code clones by token-based method.
4.4. Quality of Proposed Method
Currently there is no standard criterion of the quality of code clone detection method. One of the simplest quality measures is the ratio of the number of detected code clones and the number of all code clones. But this measure is virtually useless because we cannot count all code clones in large-scale software. Bellon  proposed new assessment metric. Select several sample code clones from the set of detected clones and evaluate whether they are clones or not. By using this metric, we can assess the precision rate of detection. But we cannot assess the recall rate by using this metric3. However, in the detection of code clone, precision rate is more important than recall rate. Low precision rate means that the system generates many noisy code fragments that are not clones. This causes the waste of time to check if they are clones or not. On the other hand, even if undetected code clones remain in the source code, serious negative effect to the quality of software will not occur. Therefore, we use the Bellon’s method for assessing the proposed method.
Figure 13 is an example of detected code clone. Left hand code contains 39 tokens and right hand code contains 46 tokens. The number of common tokens
Figure 13. Example of detected code clones.
in these two is 22 and the ratios of common tokens are 56% (left) and 48% (right). After converting these two codes into two ASTs, we compare the ratio of common nodes in ASTs. The number of nodes of left code is 77 and that of right code is 79. The number of common nodes of these two ASTs is 63. Therefore ratios of common nodes are 78% and 82% respectively. Higher common ratio (tokens and nodes) means the high similarity of two codes. Comparing the common tokens with common nodes, ratio of common nodes is higher than that of common tokens. This means that using tree-based method can detect code clones more precisely.
Total average ratio of common tokens in our experiment is 57% and that of common nodes is 78%. Bellon’s experiment says average ratio of precision of previous works is approximately 64%  . Therefore, we can conclude that our proposed method has sufficient precision ratio.
In this article, we proposed the new method of code clone detection which can detect all types of code clones. Our method combines token-based method and tree-based method. The former can run faster but cannot detect all types of code clones, the latter can detect all types of code clones but requires large computing resources (memory and CPU time). Therefore, applying tree-based methods to large-scale software is virtually impossible. We combine these two methods; token-based method is used to extract the candidates of code clones. Extracted candidates are transformed into abstract syntax trees (ASTs) representation and tree-based method is applied to these trees. By narrowing the candidates of code clones and reducing the number of comparison operations, computing time is reduced to reasonable time.
Experimental evaluation is conducted using sample files with approximately 35,000 lines source code. Proposed method is 3 to 4 times faster than conventional tools such as DECKARD and CloneDR, all of which adopt tree-based method. Detection accuracy is assessed using Bellon’s criterion. Proposed method keeps almost the same accuracy of conventional tools. Based on these evaluation results, we can conclude that the proposed method keeps the accuracy of detection and runs faster than conventional tools and therefore is useful for the improvement of code clone detection of large-scale software.