Scaling Token-Based Code Clone DetectionThe Team
Hitesh Sajnani, Vaibhav Saini, Cristina Lopes
AbstractWe propose a new token-based approach for large scale code clone detection which is based on a filtering heuristic that reduces the number of token comparisons when the two code blocks are compared. We also present a MapReduce based parallel algorithm that uses the filtering heuristic and scales to thousands of projects. The filtering heuristic is generic and can also be used in conjunction with other token-based approaches. In that context, we demonstrate how it can increase the retrieval speed and decrease the memory usage of the index-based approaches.
In our experiments on 36 open source Java projects, we found that: (i) filtering reduces token comparisons by a factor of 10, and thus increasing the speed of clone detection by a factor of 1.5; (ii) the speed-up and scale-up of the parallel approach using filtering is near-linear on a cluster of 2-32 nodes for 150-2800 projects; and (iii) filtering decreases the memory usage of index-based approach by half and the search time by a factor of 5.
Replicating the Experiment
System RequirementsYou need a machine with at least 12GB ram running ubuntu or mac-osx, with Java-1.6 or higher installed. Please note that you need system with higher ram because the experiment involves very large subject systems. However, if you are running this on smaller systems (see Tools below), there is no such constraint.
Steps to replicate the experiment:
ToolsIn order to run the tool on any arbitrary project, please follow the steps below:
A. Generating the input file of the project for which you want to detect clones
B. Running the clone detection tool on the generated input file
C. Generated output
DataThe table below describes all the subject systems, and the corresponding output. This data was used to calculate and report numbers in the paper. Column 1 has links to the source code of each subject system. Column 2 is the input file that is generated by running INPUTGEN tool on the subject systems. This input file is in turn used by NCCF and FCCD to compute clones. Column 4 has links to the computed clones. Since both the tools produced exactly the same output, you will see only one file per subject system. Column 4 has the final analysis results - runtime for tool, and total token comparisons done. These numbers are different for both the tools, so you will see two analysis files per project - one produced by NCCF, and another by FCCD.