Global AI and Data Science

 View Only

AI for Code: Predict Code complexity using IBM’s CodeNet Dataset

By Sepideh Seifzadeh posted Tue October 05, 2021 01:30 PM


Authors: IBM Center for Open Source Data
and AI Technologies (CODAIT)

1. Introduction

What if AI can help you write code one day?  With the rise of computer and technology since 1900s, code and software is now everywhere of our existence. You will be surprised: there are approximately 4.5 million lines of code on Github, 2 billion lines of code in Google services, and even 100 million lines of code contained in a modern vehicle. The enormous amount of code, challenges the traditional way developers create, debug, maintain, and update these complicated software systems. Well, that's where AI for Code comes into the picture [1].
In recent years, a fast-growing discipline called AI for Code aims to help software engineers improve their productivity by automating the code developing process, such as code auto-complete, natural language to code auto-translation, code debugging, etc. In May 2021, IBM Research published a new dataset called CodeNet.
The Dataset is hosted on Machine Learning eXchange (MLX). MLX is an open source Data and AI assets catalog and execution engine, hosted in the LF AI & Data Foundation. It allows the upload, registration, execution, and deployment of AI pipelines, pipeline components, models, datasets, and notebooks [2], [3].
Center for Open Source Data and AI Technologies (CODAIT) team @ IBM has been working on a couple of use cases using CodeNet Dataset. 1) Language Classification Use case in which we have applied different ML models and also used AutoAI which is a tool in IBM Cloud Pak for Data to build automated end-to-end Machine Learning pipelines. AutoAI helps with Data Preparation, Model Selection, and Hyper Parameter Optimization (HPIO) in a drag and drop Graphical User Interface (GUI). Once the pipeline is generated, end user can then download the the pipeline as a code and tweak it further before deployment.
We are going to publish the work that has been done around Language Classification soon in a separate blog. In this blog, we will cover the details on Code Complexity Prediction Use case. Describing which features were extracted from the data, how we generated the labels and which models were used for prediction part to assign complexity score to each code snippet.
We are also working on Code Language Translation task to be able to translate code language between different programming languages. There will be an upcoming blog on that use case as well, to learn more about this use case you can refer to [4].

2. Code Complexity Use Case

In this section, we are going to describe the details of the use case on how to predict the code snippet time complexity, formally Big O notation, which is the standard way to measure the runtime efficiency of codes respect to the input.  
Not only for software engineers, but also it is crucial for data scientists to write quality code in order to perform feature engineering efficiently. Mathematically, it's proven that it is not possible for a single function to estimate the code complexity in polynomial time. With CodeNet, we have the opportunity to address this problem using machine learning techniques.  
Here is how we formulate our machine learning problem for the Time Complexity use case. The training Dataset we created contains the code snippet vector representation and  we also generated the labels for the  code complexity. The classifier will take the input of the vector representation and predicts the time complexity as label. 
We hope that the solution we build could help the Data Science platform such as IBM Cloud Pak for Data (CPD) to provide code quality estimation to help the data scientist deliver quality code with higher standard and confidence. 

2.1 Data Preparation

The CodeNet Dataset has around 14M code samples for roughly 4K programming problems. For this prototype, we only work on the Python code snippets which is about 4M code files.  The metadata which describes the code submission status is available in tabular format, and the code snippets are in text files. To process the partially unstructured data in scale, we designed the feature engineer pipeline using PySpark User Defined Function (UDF) in order to scale with the workload. 
To derive the labels, we defined three classes corresponding to 1) sub-polynomial, 2) polynomial and 3) above-polynomial. Commonly, the algorithm falls into 6 classes: 1) near-constant, 2) linear, 3) log-linear, 4) polynomial, 5) exponential and 6) factorial. Any algorithm that runs slower than polynomial is considered "non-viable"or "inefficient" algorithm, so there is no need to distinguish whether it is exponential or factorial algorithm.  For the future work, we plan to define the labels within the polynomial class into few more classes based on the degrees: 1) quadratic, 2) cubic and 3) above-cubic etc. 

2.2 Feature Engineering

The goal of feature engineering is to find a good embedding representation of the code snippet. What is a good embedding in this case? A good embedding should keep the code snippets that has similar code complexity stay closer in the embedding space. For example, the code snippets with O(n) time complexity should stay closer to each other in the embedding space. The distance between two vectors is defined by the Cosine Similarity. 
To experiment with different ideas, we sampled around 500K code snippets as our Dataset for the feature engineering and modeling task.  Here are the different approaches that we have tried for the code snippet representation:

  1. Manual Features:

 The most straightforward way to represent the code snippet, is to manually create features that capture the attributes relevant to the time complexity estimation. The manual features that we have, are inspired by [5] and [6]

  • Number of functions
  • Number of breaks
  • Number of loops
  • Conditional-Loop frequency
  • Loop-Conditional frequency
  • Loop-Loop frequency
  • Conditional-Conditional frequency
  • Nested loop depth
  • Recursion present
  • Number of variables
  • Number of ifs
  • Number of statements
  • Number of continue
  • Number of break
  • Number of statement inside loop
  • Number of function call inside loop
  • Number of return statement inside loop
It is clear that the above list of features, can only capture some aspect of code complexity. For example, the two nested loops not always corresponding to a quadratic running complexity. We choose to use the manual features as our baseline representation for the problem.

 2. Graph Based Representation:

It is very natural way to represent the code snippet as a graph. Internally, the compiler represents the source code as AST(Abstract Syntax Tree).  Fig 3-1 illustrates the AST tree and the corresponding source code. 

Fig. 1. Convert Source To Graph Using AST
After the the graph is defined, the next question is how to represent the graph as embedding vector so that the model can process it as an input.  Inspired by the the anonymous random walk [7] idea, we developed an algorithm to sample the graph by performing anonymous random walk to generate the embedding to represent the source code AST tree.

 3. Graph Based Representation Extended

Another approach we have tried is to concatenate the graph embedding to the manual feature vector to create a the final embedding to represent the graph. We have observed the improved performance and report the benchmark in the next section. 

2.3 Modelling

For the modeling, we first started with classic Machine Learning models. The IBM AutoAI is an automated and out-of-the-box way to run modeling experiment and compare the benchmark. Our first modeling experiment uses the the baseline representation generated by manual features. The LGBM model topped the leaderboard and The holdout Accuracy is about 0.70 and the F1 weighted score is 0.71. 

Fig. 2. ROC Curve: Use manual created source code representation 

The second modeling experiment is to use the graph representation generated by Anonymous Random Walk algorithm. Comparing to the baseline performance, we observe the model perform poorly when distinguish the polynomial case (label 1), when comparing to baseline model. We believe the embedding generated was too complex for the tree-based model to generalize. The LGBM model has holdout Accuracy and F1 score 0.68 and 0.65 

Fig. 3. ROC Curve:  Anonymous Random Walk source code representation

With the concatenated embedding, we repeated the experiment again. The LGBM model improved the performance with the holdout Accuracy and F1 score 0.74 and 0.72 respectively. 

Fig. 4. ROC Curve: Concatenated source representation

The next class of models we have experimented is DNN. We have designed two different network structure:
  1. simple three layers fully connected DNN
  2. 15 layers fully connected DNN with Residual Learning [8]
Both neural networks are trained with the concatenated embedding we have used in the previous AutoAI and outperformed the traditional ML algorithm by few points. Here is the summary of the model performance:
Holdout Accuracy
3 Layers Network
15 Layers Residual Network

Table 1. Performance Comparison for different Models

2.4 Next Steps

As our next steps, to expand this works, we are planning to first experiment with the Graph Neural Network. In this approach, the source code AST tree representation is not necessary. By converting the AST tree to geometric data and feed it into the GNN directly, we hope the network will be able to learn the embedding to represent the different node types in the AST tree and generalize better in terms of recognizing the code complexity. Another extension to our work is to generate more specific labels for the polynomial class code snippet.

3. Code Competitions Held on CodeNet

3.1 Call To Action:

There is a session introducing #CodeNet Challenge to identify similar programs that solve the same problem.
Register and join the session on October 13, 2021 
- Register for session
- Call for Code Spot Challenge for CodeNet - Introduction 1
Project CodeNet on Github and on IBM Developer

3.2 The Challenge Synopsis:

Working in collaboration with Women in Data Science, we will:
  • Provide education on the codenet dataset, code representation, and NLP (Natural Language Processing)
  • Help you form or be part of a team
Working in collaboration with your team, you will:
  • Build your models and submit your predictions 
  • Join crowdcast sessions to learn about NLP,  and code representation, network with other participants, and mentors
We will announce the winning team in January 2022