Global Data Science Forum

AI for Code: IBM’s CodeNet Dataset Empowers AI to Tackle Programming Challenges

By Sepideh Seifzadeh posted Fri June 25, 2021 05:21 PM

AI for Code: IBM’s CodeNet Dataset Empowers AI to Tackle Programming Challenges
Blog by: IBM Center for Open-source Data & AI Technologies (CODAIT) Team
Authors: Yiwen Li & Sepideh Seifzadeh

Software is eating the world. Software now permeates every part of our existence; Google services combine for 2 billion lines of code, and a modern vehicle contains around 100 million lines of code. It's a monumental challenge to create, debug, maintain, and update these complex software systems. 

fast-growing discipline known as AI for Code aims to help software developers improve their productivity by automating the software engineering process. AI for Code researchers have been leveraging technologies like NLP and augmenting them with code analysis and compilation techniques to perform a myriad of practical tasks, such as code search, summarization, and completion, as well as code-to-code translation. The discipline isn't limited to academic research. Ruchir Puri, IBM Research's chief research scientist, discussed in a recent podcast how technologies from AI for Code are being used to modernize legacy software by helping to migrate monolithic applications to microservices for IBM's enterprise clients. To serve that purpose, IBM’s AI research division has released a new dataset called Project CodeNet.
What is Project CodeNet?
Project CodeNet is a large scale dataset with approximately 14 M code samples, around 500 lines of code in 55 different programming languages, each of which is an intended solution to one of 4000 coding problems. CodeNet also provides sample input and output test sets for over 7M code samples. The CodeNet dataset contains problems, submissions, and metadata that are obtained from downloading submissions from two online judging web sites: AIZU Online Judge and AtCoder
The dataset is accompanied by a repository, where we provide a set of tools to aggregate code samples based on user criteria and to transform code samples into token sequences, simplified parse trees and other code graphs. A detailed discussion of Project CodeNet is available in this paper.
Most importantly, Project CodeNet drives innovation in deep learning and machine learning models in code classification and code similarity. To expedite AI for code research using graph neural networks, CodeNet researchers also made available the simplified parse tree (SPT) representation of the code samples in the four benchmark datasets. It's said: Project CodeNet is a “very large scale, diverse, and high-quality dataset to accelerate the algorithmic advances in AI for Code.”
How Project CodeNet Helps in Machine Learning Tasks?
Here are several examples of where machine learning models derived from CodeNet can help improve programming tasks:
First, programming language detection and translation. You can take the Project CodeNet dataset and build a deep learning model to detect the language of a piece of source code. This notebook showcases how to perform language classifications using a Keras model in TensorFlow. Moreover, in future release of the dataset, we plan to better support more use cases, for example to enrich dataset to help data scientists and developers to create machine learning models to translate the programming language from one to another.
This saves much of old-school efforts for engineers, and will become useful for teams to transform old code to new programming languages to be accessible to new development tools. 
Also, models derived from CodeNet could help in code recommendations. By running clustering methods, users can build recommendation tools to auto-complete a simple line of code to blocks of code, or even a full function.
Another use case is to use a masked language model (MLM) on source code. The purpose is to infer the correct token for a masked-out token at an arbitrary position in the source code text. IBM Researchers created a notebook to complete this experimentation.
What Makes Project CodeNet Outstanding?
There are two amazing features of Project CodeNet when comparing it with related Datasets.
First, not only the tremendous size of the dataset and the comprehensive programming languages written in it, the code samples in Project CodeNet are annotated with a rich set of information, such as its code size, memory footprint, CPU run time, and status, which indicates acceptance or error types. Over 90% of the problems come with the respective problem description, which contains a concise problem statement, specification of the input format and the output format. When available, sample input and output is also extracted from the problem description, and is provided as part of the dataset. Users can execute the accepted code samples to extract additional metadata and verify outputs from generative AI models for correctness.
Second, the Project CodeNet addresses issues of the quality of the data samples. Most of the time, quite a large number of frequently used AI for Code datasets have duplicate code samples, which could inflate performance metrics up to 100%. Plus, the problem-submission style datasets from online judging systems can contain clusters of identical problems, which also skew the performance metrics. However, in Project CodeNet, the researchers have identified issues such as near-duplicates and identical problem clusters for the benefit of the users.
 Related Datasets comparison
Dataset Statistics
Let's take a look at the dataset statistics. The dataset comprises 13,916,868 submissions, divided into 4053 problems. Of the submissions, 53.6% (7,460,588) are accepted, 29.5% are marked as wrong answer and the remaining suffer from one of the possible rejection causes. 
Percentage of submissions per status 
The data contains submissions in 55 different languages, although 95% of them are coded in the six most common languages (C++, Python, Java, C, Ruby, C#), C++ is the most common language with 8,008,527 submissions (57% of the total) of which 4,353,049 are accepted. Here are 2 pie charts depicting submissions and status distribution of Project CodeNet.
Percentage of submissions per programming language
The rich metadata and language diversity enable Project CodeNet to help with variety of uses cases.  The problem-submission relationship in CodeNet can be used for code search and clone detection.  The code samples in Project CodeNet are labeled with their acceptance status and we can explore AI techniques to distinguish correct codes from problematic ones.  Project CodeNet’s metadata also enables the tracking of how a submission evolves from problematic to accepted, which could be used for exploring automatic code correction. Each code sample is labeled with CPU run time and memory footprint, which can be used for regression studies and prediction.Project CodeNet may also be used for program translation, given its wealthy collection of programs written in different languages. One considerable challenge of neural machine translation is that model training depends on large, parallel corpora, Project CodeNet covers a very rich set of languages with ample training instances.
In summary, "Project CodeNet" is first-of-its-kind very large-scale, diverse and high-quality dataset to accelerate the algorithmic advances in AI for Code. This dataset is not only unique in its scale, but also in the diversity of coding tasks it can help benchmark: from code similarity and classification for advances in code recommendation algorithms, and code translation between a large variety programming languages, to advances in code performance improvement techniques and code quality.
The rich annotation of Project CodeNet enables research in code search, code completion, code-code translation, and a myriad of other use cases. We also extracted several language specific datasets for benchmarking in Python, Java and C++ to drive innovation in deep learning and machine learning models in code classification and code similarity. To expedite AI for code research using graph neural networks, we also made available the simplified parse tree (SPT) representation of the code samples in the four benchmark datasets. 
  1. Project CodeNet on DAX:
  2. Project CodeNet Github:
  3. VentureBeat's post on IBM Project CodeNet: 
  1. Forbes post on IBM Project CodeNet:
  2. AI News Featured IBM Project CodeNet: