CODE SPORT
In this month’s column, we cover how machine learning and natural language processing can be applied to the area of programming.
In this column, we cover computer science interview questions and topics related to natural language processing (NLP) and machine learning (ML). NLP has been applied to various real world applications. For instance, sentiment analysis has been applied to measure customer sentiment from product reviews. Natural language understanding (NLU) has been used to understand human conversations, deciphering the intent of a communication and the entities involved in it. Natural language generation (NLG) has been used to create human-like conversational replies. NLU and NLG are the foundations for designing conversational AI systems like chatbots and virtual agents. NLP has been deployed in various fields such as customer care, health care, finance and transportation.
In spite of the many real world applications of NLP, it is quite surprising that NLP/ML techniques have not been applied to software development extensively. There has been some research in applying these techniques to software development; however, they have not been applied extensively in real world deployment. But with the advent of deep learning, NLP/ML techniques are also getting explored in the area of software development. We will cover this topic over the next few columns. In the area of software development, we will discuss how NLP/ML can be applied to: (a) programming, (b) compilers, and (c) software tools and applications.
Before we get into this topic, it would be good to define and differentiate two terms that are confusing for many lay persons. The two terms ‘Learn to Compile’ and ‘Compile to Learn’ are used often in this area. ‘Compile to Learn’ is all about how specialised/targeted compiler techniques can be used in software development to improve the performance of NLP/ML applications. For instance, TensorFlow has developed a special compiler optimiser called
XLA that targets linear algebra routines, which are often part of deep learning (DL) applications. This is a case of ‘Compile to Learn’, where compiler techniques are used to improve ML applications. ‘Learn to Compile’ is completely different from this. Here, the key focus is on how we can use NLP/ML techniques to improve programming and compilers. For example, researchers have used NLP/ML techniques to perform improved loop optimisations in compilers. These techniques are part of ‘Learn to Compile’, as they are used in compilers to improve the performance of all kinds of software applications. It is important to understand the difference between these two terms. In the next couple of columns, we will focus on the ‘Learn to Compile’ topic.
We are all familiar with the term ‘Big Data’. A related concept is ‘Big Code’. In the last decade, a large amount of source code in different programming languages has become publicly available due to open source development as well as the emergence of source code repositories such as GitHub. This has opened up large data sets of source code, which can be used for applying machine learning techniques on them. Discussions related to code and software development in community question answering forums such as Stack Overflow have led to both code and text related to code artefacts becoming available for data mining. ‘Big Code’ is the idea of applying ML and NLP techniques to huge volumes of publicly available source code and related textual artefacts such as documentation, code related discussions and comments.
The first question we need to answer is the following: NLP/ML techniques have been applied to natural language text – text produced by humans in different languages. However, with ‘Big Code’ we are interested in applying these techniques to source code in high level languages such as C/C++/Python/
Java, etc. How much are the NLP/ML techniques applicable directly to source code? To answer that question, we need to first identify what are the similarities and differences between source code and natural language? Or “how natural” is the source code in terms of human language characteristics?
Let us recall the fact that NLP techniques are built on the foundation of distributional hypothesis. The meaning of words in a text is inferred from the context. This is known as Firth’s hypothesis, which states that “a word is characterised by the company it keeps.” Essentially, this translates in lay terms to the fact that words that are used in similar contexts have similar meanings. An example is given below.
S1: I went with my mother for her medical consultation. S2: I accompanied my mother for her medical consultation. Since the contexts are similar in S1 and S2, we can infer that the words ‘accompanied’ and ‘went with’ have similar meanings. Distributional hypothesis forms the basis for contextual embeddings and contextual representations that form the core of NLP, namely, word embeddings/sentence embeddings, etc. Given this significance, we need to know whether distributional hypothesis would hold for source code as well.
The key difference between natural language text and source code is the fact that natural language is used for communication between two humans. On the other hand, source code is used as the form of communication in two different channels. The first channel is between the software developer who writes the source code and other developers who read/maintain/modify that source code. This is human to human communication like natural language text. On the other hand, source code is used as a means of specifying the computations that need to be performed on a computer. Hence, the second channel of communication is between the software writer and the machine on which the code is intended to be executed. Understanding the meaning of and inferring intent from source code in case of human to human communication is similar in spirit to that of natural language text. Hence, statistical properties that permit standard NLP techniques are applicable to source code as well. However, given that source code is intended to communicate meaning/instructions to a machine, it uses a more precise/ non-ambiguous form of communication, and hence is less surprising compared to human communication text. These two properties make it feasible for source code to be analysed and subjected to the same set of NLP techniques that are applicable to natural language text.
If classical NLP techniques can be applied to source code, how do we model source code? Traditional methods of modelling natural language text include n-gram language models, where a word is predicted based on the context. In case of source code, we can identify three different types of models. Representational models of source code are similar to traditional language models. They can be used to predict tokens/code sequences and properties of source code.
Code generation models of source code are used to generate new source code, taking as input either natural language text or source code in some other programming language, or a combination of both. Consider a tool that can automatically translate C code to Python code. Or a tool, which given the natural language description of a function, can generate the source code for that function in a high-level programming language. These are based on code generation models of source code.
Pattern mining models of source code are used to infer latent properties/structure within the source code. They are similar to cluster techniques applied on natural language text, to mine underlying patterns. These models can be used in source code analysis tools, bug prediction tools, etc.
We can also classify source code models based on the type of input and output. Input can be source code in one high level programming language and output can be another high level programming language. A trans-compiler that translates source code in one language to another PL is an example of this. Or, input can be natural language text and output can be source code. Input can also be a combination of natural language text and source code, and output can be source code. Any tool that can automatically generate source code from specifications along with an optional code skeleton in natural language text is an example of this. Or, input can be source code and output can be natural language text. A tool that automatically creates comments/documentation for source code is an example of this.
We can also differentiate the source code models based on the type of machine learning technique used. It can be a supervised learning, unsupervised learning or reinforcement learning based model. We can have DL models of source code that are either sequence to sequence models, transformers or graph neural networks. Different types of source code models are used for different applications such as source code analysis tools, compiler optimisations and automatic code generation. We will see examples of each of these types of models in next month’s column.
There is an excellent paper that discusses how NLP/ML techniques can be used for source code modelling. Titled ‘A survey of machine learning for big code and naturalness’, it is available at https://arxiv.org/pdf/1709.06182. I would encourage our readers to take a look at this paper.
Feel free to reach out to me over LinkedIn/email if you need any help in your coding interview preparation. If you have any favourite programming questions/software topics that you would like to discuss on this forum, please send them to me, along with your solutions and feedback, at sandyasm_AT_yahoo_DOT_com. Wishing all our readers happy coding until next month! Stay healthy and stay safe.