Scaling Token-Based Code Clone Detection

The Team
Hitesh Sajnani, Vaibhav Saini, Cristina Lopes


We propose a new token-based approach for large scale code clone detection which is based on a filtering heuristic that reduces the number of token comparisons when the two code blocks are compared. We also present a MapReduce based parallel algorithm that uses the filtering heuristic and scales to thousands of projects. The filtering heuristic is generic and can also be used in conjunction with other token-based approaches. In that context, we demonstrate how it can increase the retrieval speed and decrease the memory usage of the index-based approaches.
In our experiments on 36 open source Java projects, we found that: (i) filtering reduces token comparisons by a factor of 10, and thus increasing the speed of clone detection by a factor of 1.5; (ii) the speed-up and scale-up of the parallel approach using filtering is near-linear on a cluster of 2-32 nodes for 150-2800 projects; and (iii) filtering decreases the memory usage of index-based approach by half and the search time by a factor of 5.‚Äč

Replicating the Experiment

System Requirements

You need a machine with at least 12GB ram running ubuntu or mac-osx, with Java-1.6 or higher installed. Please note that you need system with higher ram because the experiment involves very large subject systems. However, if you are running this on smaller systems (see Tools below), there is no such constraint.

Steps to replicate the experiment:

  1. Click here to download the distribution (
  2. unzip Navigate to dist/ using terminal
  3. execute the command ./ 1     Please Note: this will run the experiment 1 time. To run it n times, change the command to ./ n
You can see the generated output in the ./output folder. Files with extension .txt will have computed clones and the files with .csv extension will have the performance analysis result.


In order to run the tool on any arbitrary project, please follow the steps below:

A. Generating the input file of the project for which you want to detect clones
  1. Click here to download input generator for the code clone detector (
  2. Unzip and import the project ast in your eclipse workspace.
  3. Run it as an "Eclipse Application". This should open another eclipse instance where you will import the projects for which you want to generate the input file.
  4. After importing the project in the workspace of the new eclipse instance, click on the "Sample Menu" in the top menu bar and then click on the "Sample command" to run. This should generate the output (desired input file) in the path specified by variable "outputdirPath".
  5. Please note that you will have to change the location of output directory on line 61 of = "/Users/vaibhavsaini/Documents/codetime/repo/ast/output/"; to your desired output directory.
  6. The generated input file name will be of the format: <ProjectName>-clone-INPUT.txt. For example, if your project name is jython, then the generated input file name should be jython-clone-INPUT.txt

B. Running the clone detection tool on the generated input file
  1. Click here to download the CloneDetector (
  2. Unzip and navigate to tool/ using terminal
  3. Copy the input file generated above (<ProjectName>-clone-INPUT.txt) into input/dataset directory.
  4. Open, and assign <ProjectName> as value to the variable arrayname (line #5). For example, If your generated input file is jython-clone-INPUT.txt, line #5 should be arrayname=(jython)
  5. Execute the command ./

C. Generated output
  1. The generated output will be in the ./output folder.
  2. Files with extension .txt will have the computed clones and the files with .csv extension will have the time taken to detect clones


The table below describes all the subject systems, and the corresponding output. This data was used to calculate and report numbers in the paper. Column 1 has links to the source code of each subject system. Column 2 is the input file that is generated by running INPUTGEN tool on the subject systems. This input file is in turn used by NCCF and FCCD to compute clones. Column 4 has links to the computed clones. Since both the tools produced exactly the same output, you will see only one file per subject system. Column 4 has the final analysis results - runtime for tool, and total token comparisons done. These numbers are different for both the tools, so you will see two analysis files per project - one produced by NCCF, and another by FCCD.
Subject System Generated Input for Tool Clones Detected Analysis