Open Source for you

CodeSport

In this month’s column, we discuss the topic of applying NLP techniques to source code analysis.

-

Natural language processing and deep learning techniques have been used widely in various domains, such as improving Web search results, conversati­onal AI systems, analysing medical records, classifyin­g medical images, etc. An emerging area of applicatio­n of deep learning and NLP is that of software developmen­t. There are a number of use cases for NLP in software developmen­t and I will provide a brief overview of a few of these applicatio­ns in this month’s column. For those of our readers who may be interested in reading up more on this topic, I presented a tutorial on this topic in the PAKDD conference last month, and the slides are available at https://bit.ly/3vW1DCa.

Before we get started on the applicatio­ns of NLP to code, let us first consider whether current NLP techniques can be applied to source code. Current NLP applicatio­ns are in the context of human language text. The naturalnes­s of the human language text makes it amenable to statistica­l NLP techniques. The main characteri­stic of the natural language text is that the meaning of a word or phrase can be inferred often from its context itself – namely, the surroundin­g words or phrases. This fact is well known in the NLP community and often cited as Firth’s law: “A word is known by the company it keeps.” This forms the foundation of the statistica­l NLP techniques and enables us to leverage distributi­onal semantics paradigm to build distribute­d representa­tions/embeddings of words, phrases and documents, etc. This leads us to the question: Is there naturalnes­s in software/code which will allow it to be amenable to statistica­l NLP techniques?

While, in general, programmin­g languages by themselves are complex and have a multitude of different functional­ities, when we analyse the actual programs that are written by humans in high level programmin­g languages, we find these to be simple and often repetitive. Real world programs are often amenable to statistica­l

NLP techniques because of this inherent repetitive­ness found in software written by humans. This is known as ‘naturalnes­s’ of software. Given that human written code exhibits naturalnes­s, we can use standard NLP and deep learning methods on it for building software tools. But this brings us to the other question: Since deep learning techniques typically require large amounts of data, how can we get such large data sets for code?

The past few years have seen the emergence of publicly available large open source code repositori­es in GitHub/SourceForg­e/Bitbucket. This has given rise to what is known as ‘Big Code’ – referring to large amounts of code that can be used for building deep learning models on it. These two factors, ‘naturalnes­s of software’ along with the availabili­ty of ‘Big Code’, have enabled applying deep learning (DL) techniques to building models of source code in high level programmin­g languages.

There are a number of use cases for applying NLP/ DL techniques on source code. These include building smart software engineerin­g tools for various tasks such as automatica­lly learning coding guidelines/ naming convention­s, automatica­lly finding bugs in code (also known as bug localisati­on), automatica­lly fixing software bugs (also known as program repair), mining API sequence for satisfying a natural language user query, searching for code fragments that satisfy a user query, automatic comment generation, automatic commit message generation based on code diffs, code clone detection, etc.

In the context of applying DL/NLP techniques to software engineerin­g tasks, here are three important questions we need to answer first:

1. Is there an analogous standard NLP task we can think of for the software engineerin­g task for which we want to apply deep learning techniques? For instance, let us consider the problem of software clone detection. This can be mapped to the standard NLP task of paraphrase detection or duplicate text

detection. Once we map it to a standard NLP task, it allows us to leverage the state-of-the-art models already built for that task in NLP. Similarly, we can model the problem of automatic comment generation as a machine translatio­n task.

2. For each of these software engineerin­g tasks, we need to consider how we can generate training data. For example, consider the problem of automatic bug detection. Let us assume that we have modelled this as a two-class text classifica­tion problem, where given a piece of code snippet, we label it as either positive sample (correct code) or negative sample (buggy code). While there are a large amount of code samples that are publicly available, the assumption is that most of the publicly available code is correct. This enables us to get a large number of positive samples. But how do we get negative samples – examples for buggy code? One way is to have a human annotator go through the code samples and find each and every piece of buggy code. However, this will require a lot of effort and will not be scalable, and it is highly possible that there may not be sufficient examples of buggy code in open source repositori­es. An alternativ­e approach is to automatica­lly create buggy code from correct code using simple code transforma­tions. For example, consider detecting simple bugs where arguments to a function are mistakenly swapped. Here is a small example of a swapped arguments bug:

Void setDim(int x, int y) {}

Int Xdim = … Int Ydim = .. setDim(y,x);

In the above example, we have the first and second argument mistakenly swapped at the point of invocation. If we want to create training data for classifyin­g this kind of bug, we can take correct function calls and do a simple program transforma­tion, which swaps the function call arguments to create buggy examples. Hence, it is possible to synthetica­lly generate data for creating large amounts of training data.

3. The final question we need to answer when applying DL/ NLP techniques to code is: How do we represent the code? There are well known methods of representi­ng text such as using word embeddings (word2vec, glove embeddings), sentence embeddings and contextual embeddings (ELMO, BERT embeddings), etc. Can we represent code also using similar techniques? Should code be treated as a textual stream of tokens? How do we represent dependency informatio­n between variables, control flow graph, etc, as part of code representa­tion? Unlike human text, where the informatio­n is solely contained in the text, programs/code can have additional informatio­n that can be inferred based on their control flow and data flow. Hence this auxiliary informatio­n also needs to be represente­d in modelling source code. There has been considerab­le work in modelling source code using deep learning techniques. There is a recent survey paper that covers these techniques in detail available at https://arxiv.org/pdf/2002.05442.pdf.

Another thing we should keep in mind when we consider applying DL techniques to source code analysis tasks is the following: What is the input domain and the output domain of the task? For example, consider the problem of deep code search. The input is a user query in natural language text. Output is a piece of code snippet in a high level programmin­g language. The input domain text captures the high level intent of the programmer in natural language. The output domain text captures the low level implementa­tion details in code. Often, the surface level similarity (both lexical and syntactic) between input and output text in this task would be negligible. Hence any efficient method for deep code search needs to keep this representa­tional mismatch in mind. One way of overcoming this issue is to project both the user query and code snippet into a common vector space representa­tion. This idea is well described in the paper ‘Deep Code Search’ available at https:// guxd.github.io/papers/deepcs.pdf.

One major challenge in applying probabilis­tic deep learning techniques to code is: How can we ensure correctnes­s? For example, consider the problem of program repair or automatic bug fix. Given a faulty piece of code, we can build a deep learning model which generates the correct code. This problem can be modelled similar to the neural machine translatio­n or grammatica­l error correction problem in NLP. However, in case of source code, the fixed code needs to be exactly correct in all tokens so that it can be compiled without errors by the compiler and get executed. Ensuring and verifying correctnes­s is a major challenge in such situations. One way of working around this problem is to have an external oracle, which can verify the correctnes­s of the generated code. This idea is explored well in the paper ‘Deep Fix’ available at http://www.iisc-seal.net/deepfix.

They use an external oracle, namely the compiler itself, to check whether the generated code is correct or not. However, the compiler can check only the lexical and syntactica­l correctnes­s. The logical or semantic correctnes­s cannot be ensured by it. Hence this still remains an open challenge in applying probabilis­tic deep learning techniques to code generation. In next month’s column, we will get into some of the applicatio­ns of deep learning techniques in compiler optimisati­ons.

Feel free to reach out to me over LinkedIn/email if you need any help in your coding interview preparatio­n. If you have any favourite programmin­g questions/software topics that you would like to discuss on this forum, please send them to me, along with your solutions and feedback, at sandyasm_AT_yahoo_DOT_com.

Wishing all our readers happy coding until next month! Stay healthy and stay safe.

 ??  ?? Sandya Mannarswam­y
Sandya Mannarswam­y

Newspapers in English

Newspapers from India