Building hundreds of thousands of Java projects
Large repositories of source code for research tend to limit their utility to static analysis of the code, as they give no guarantees on whether the projects are compilable, much less runnable in any way. The immediate consequence of the lack of large compilable and runnable datasets is that research that requires such properties does not generalize beyond small benchmarks.
We present the Java Build Framework, a method and tool capable of automatically compiling a large percentage of Java projects available in open source repositories like GitHub. Two elements are at the core: a very large repository of JAR files, and techniques of resolution of compilation faults and dependencies.
We also provide a repository of 50,000 compilable Java projects. Each project in this dataset comes with references to all the dependencies required to compile it, the resulting bytecode, and the scripts with which the projects were built.
Access to the Source Code and Tutorials
The software can be found on GitHub here.
We also created an artifact in the form of a VirtualBox virtual machine, which provides a quick access to the pipeline through a guided tutorial, and can be found here ( 2.6Gb ). The password is p
Important Disclosure
Please be aware that no special care was taken to analyze the source code towards security threats. This means that when you run the executable class files or the building scripts that come with the projects you are running unknown code. Act on your own discretion and be careful. At the very least, sandboxing through system file permissions and limited network access is highly recommended.
Access to 50K-C(ompilable)
Direct download links:
JBF Meta-data:
The file 50K-C_projects.tgz (source code) and 50K_buildresults.tgz (build meta-data) contain projects and builds meta-data, respectively, under the same relative path. For example, the result of building projects/2/mvmn-Thue-in-java.zip is in builds/2/mvmn-Thue-in-java.
Inside each project's build folder (in 50K_buildresults.tgz), for example, inside builds/2/mvmn-Thue-in-java, one can find:
"build_method": "general_build_file", | |
"create_build": true, | |
"depends": [ | |
[ | |
null, | |
null, | |
null, | |
false, | |
"n/n4upgrade/n4upgrade_0.jar", | |
false | |
] | |
], | |
"file": "100/emagnus-ulurulib", | |
"full_output": "", | |
"has_own_build": false, | |
"output": "Buildfile: /home/mondego/UCLA-50K/SourcererJBF/TBUILD/BUILD_30/build.xml\n\ninit:\n\nresolve:\n\ncompile:\n [javac] Compiling 30 source files to /home/mondego/UCLA-50K/SourcererJBF/TBUILD/BUILD_30/build\n\nBUILD SUCCESSFUL\nTotal time: 5 seconds\n", | |
"path": "projects/100/emagnus-ulurulib.zip", | |
"success": true, | |
"timing": [ |
The field depends contains a list of dependencies per project (paths are relative to the contents of 50K-C_jars.tgz). Some of these dependencis have as last argument true instead of false, for example:
"depends": [ | |
[ | |
null, | |
null, | |
null, | |
false, | |
"project21/n4upgrade/n4upgrade_0.jar", | |
true | |
] | |
], |
which means that the jar was obtained from the project itself, not from 50K-C_jars.tgz.
Finally, if you do not want to traverse the entire builds results to read all the build-result.json files, project_details.json has a concatenation of all the build-result.json files for all the projects.