Code Clone Detection

Project Description

Given the availability of large-scale source-code repositories, there have been a large number of applications for clone detection. Unfortunately, despite a decade of active research, there is a marked lack in clone detectors that scale to large software repositories. In particular for detecting near-miss clones where significant editing activities may take place in the cloned code.
We present SourcererCC, a token-based clone detector that targets the first three clone types, and exploits an index to achieve scalability to large inter-project repositories using a standard workstation. SourcererCC uses an optimized inverted-index to quickly query the potential clones of a given code block. Filtering heuristics based on token ordering are used to significantly reduce the size of the index, the number of code-block comparisons needed to detect the clones, as well as the number of required token-comparisons needed to judge a potential clone.
We evaluate the scalability, execution time, recall and precision of SourcererCC, and compare it to four publicly available and state-of-the-art tools. To measure recall, we use two recent benchmarks, (1) an exhaustive benchmark of real clones, BigCloneBench, and (2) a Mutation/Injection-based framework of thousands of fine-grained artificial clones. We find SourcererCC has both high recall and precision, and is able to scale to a large inter-project repository (250MLOC) using a standard workstation.

Tool Download and Usage

In order to run the tool please follow the steps below:

A. Generating the input file of the project for which you want to detect clones

Click here to download input generator for the code clone detector (ast.zip).
Unzip ast.zip and import the project ast in your eclipse workspace.
Run it as an "Eclipse Application". This should open another eclipse instance where you will import the projects for which you want to generate the input file.
After importing the project in the workspace of the new eclipse instance, click on the "Sample Menu" in the top menu bar and then click on the "Sample command" to run. This should generate the output (desired input file) in the path specified by variable "outputdirPath".
Please note that you will have to change the location of output directory on line 61 of SampleHandler.java.this.outputdirPath = "/Users/vaibhavsaini/Documents/codetime/repo/ast/output/"; to your desired output directory.
The generated input file name will be of the format: <ProjectName>-clone-INPUT.txt. For example, if your project name is jython, then the generated input file name should be jython-clone-INPUT.txt

B. Running the clone detection tool on the generated input file

Click here to download the CloneDetector (tool.zip).
Unzip tool.zip and navigate to tool/ using terminal
Copy the input file generated above (<ProjectName>-clone-INPUT.txt) into input/dataset directory.
Open cd.sh, and assign <ProjectName> as value to the variable arrayname (line #5). For example, If your generated input file is jython-clone-INPUT.txt, line #5 should be arrayname=(jython)
Execute the command ./cd.sh

C. Generated output

The generated output will be in the ./output folder.
Files with extension .txt will have the computed clones and the files with .csv extension will have the time taken to detect clones

D. Source Code
The source code of SourcererCC can be found here on github.

E. SourcererCC-I
SourcererCC-I is an interactive version of the tool integrated with Eclipse IDE to help developers instantly find clones during software development and maintenance.
A short video of Sourcerer-I in action can be found here and link to install the Eclipse plug-in is available here.

Reviewer

True Positives

False Positives

Judge 1

Judge 2

Judge 3

SourcererCC: Scaling Type-3 Clone Detection to Large Software Repositories

Project Description

Tool Download and Usage

Precision data as reported in the paper

Effectiveness of Filtering Heuristics (Figure 1 in paper)