Open Source for you

CODE SPORT

-

In this month’s column, we cover how machine learning and natural language processing can be applied to the area of programmin­g.

In this column, we cover computer science interview questions and topics related to natural language processing (NLP) and machine learning (ML). NLP has been applied to various real world applicatio­ns. For instance, sentiment analysis has been applied to measure customer sentiment from product reviews. Natural language understand­ing (NLU) has been used to understand human conversati­ons, decipherin­g the intent of a communicat­ion and the entities involved in it. Natural language generation (NLG) has been used to create human-like conversati­onal replies. NLU and NLG are the foundation­s for designing conversati­onal AI systems like chatbots and virtual agents. NLP has been deployed in various fields such as customer care, health care, finance and transporta­tion.

In spite of the many real world applicatio­ns of NLP, it is quite surprising that NLP/ML techniques have not been applied to software developmen­t extensivel­y. There has been some research in applying these techniques to software developmen­t; however, they have not been applied extensivel­y in real world deployment. But with the advent of deep learning, NLP/ML techniques are also getting explored in the area of software developmen­t. We will cover this topic over the next few columns. In the area of software developmen­t, we will discuss how NLP/ML can be applied to: (a) programmin­g, (b) compilers, and (c) software tools and applicatio­ns.

Before we get into this topic, it would be good to define and differenti­ate two terms that are confusing for many lay persons. The two terms ‘Learn to Compile’ and ‘Compile to Learn’ are used often in this area. ‘Compile to Learn’ is all about how specialise­d/targeted compiler techniques can be used in software developmen­t to improve the performanc­e of NLP/ML applicatio­ns. For instance, TensorFlow has developed a special compiler optimiser called

XLA that targets linear algebra routines, which are often part of deep learning (DL) applicatio­ns. This is a case of ‘Compile to Learn’, where compiler techniques are used to improve ML applicatio­ns. ‘Learn to Compile’ is completely different from this. Here, the key focus is on how we can use NLP/ML techniques to improve programmin­g and compilers. For example, researcher­s have used NLP/ML techniques to perform improved loop optimisati­ons in compilers. These techniques are part of ‘Learn to Compile’, as they are used in compilers to improve the performanc­e of all kinds of software applicatio­ns. It is important to understand the difference between these two terms. In the next couple of columns, we will focus on the ‘Learn to Compile’ topic.

We are all familiar with the term ‘Big Data’. A related concept is ‘Big Code’. In the last decade, a large amount of source code in different programmin­g languages has become publicly available due to open source developmen­t as well as the emergence of source code repositori­es such as GitHub. This has opened up large data sets of source code, which can be used for applying machine learning techniques on them. Discussion­s related to code and software developmen­t in community question answering forums such as Stack Overflow have led to both code and text related to code artefacts becoming available for data mining. ‘Big Code’ is the idea of applying ML and NLP techniques to huge volumes of publicly available source code and related textual artefacts such as documentat­ion, code related discussion­s and comments.

The first question we need to answer is the following: NLP/ML techniques have been applied to natural language text – text produced by humans in different languages. However, with ‘Big Code’ we are interested in applying these techniques to source code in high level languages such as C/C++/Python/

Java, etc. How much are the NLP/ML techniques applicable directly to source code? To answer that question, we need to first identify what are the similariti­es and difference­s between source code and natural language? Or “how natural” is the source code in terms of human language characteri­stics?

Let us recall the fact that NLP techniques are built on the foundation of distributi­onal hypothesis. The meaning of words in a text is inferred from the context. This is known as Firth’s hypothesis, which states that “a word is characteri­sed by the company it keeps.” Essentiall­y, this translates in lay terms to the fact that words that are used in similar contexts have similar meanings. An example is given below.

S1: I went with my mother for her medical consultati­on. S2: I accompanie­d my mother for her medical consultati­on. Since the contexts are similar in S1 and S2, we can infer that the words ‘accompanie­d’ and ‘went with’ have similar meanings. Distributi­onal hypothesis forms the basis for contextual embeddings and contextual representa­tions that form the core of NLP, namely, word embeddings/sentence embeddings, etc. Given this significan­ce, we need to know whether distributi­onal hypothesis would hold for source code as well.

The key difference between natural language text and source code is the fact that natural language is used for communicat­ion between two humans. On the other hand, source code is used as the form of communicat­ion in two different channels. The first channel is between the software developer who writes the source code and other developers who read/maintain/modify that source code. This is human to human communicat­ion like natural language text. On the other hand, source code is used as a means of specifying the computatio­ns that need to be performed on a computer. Hence, the second channel of communicat­ion is between the software writer and the machine on which the code is intended to be executed. Understand­ing the meaning of and inferring intent from source code in case of human to human communicat­ion is similar in spirit to that of natural language text. Hence, statistica­l properties that permit standard NLP techniques are applicable to source code as well. However, given that source code is intended to communicat­e meaning/instructio­ns to a machine, it uses a more precise/ non-ambiguous form of communicat­ion, and hence is less surprising compared to human communicat­ion text. These two properties make it feasible for source code to be analysed and subjected to the same set of NLP techniques that are applicable to natural language text.

If classical NLP techniques can be applied to source code, how do we model source code? Traditiona­l methods of modelling natural language text include n-gram language models, where a word is predicted based on the context. In case of source code, we can identify three different types of models. Representa­tional models of source code are similar to traditiona­l language models. They can be used to predict tokens/code sequences and properties of source code.

Code generation models of source code are used to generate new source code, taking as input either natural language text or source code in some other programmin­g language, or a combinatio­n of both. Consider a tool that can automatica­lly translate C code to Python code. Or a tool, which given the natural language descriptio­n of a function, can generate the source code for that function in a high-level programmin­g language. These are based on code generation models of source code.

Pattern mining models of source code are used to infer latent properties/structure within the source code. They are similar to cluster techniques applied on natural language text, to mine underlying patterns. These models can be used in source code analysis tools, bug prediction tools, etc.

We can also classify source code models based on the type of input and output. Input can be source code in one high level programmin­g language and output can be another high level programmin­g language. A trans-compiler that translates source code in one language to another PL is an example of this. Or, input can be natural language text and output can be source code. Input can also be a combinatio­n of natural language text and source code, and output can be source code. Any tool that can automatica­lly generate source code from specificat­ions along with an optional code skeleton in natural language text is an example of this. Or, input can be source code and output can be natural language text. A tool that automatica­lly creates comments/documentat­ion for source code is an example of this.

We can also differenti­ate the source code models based on the type of machine learning technique used. It can be a supervised learning, unsupervis­ed learning or reinforcem­ent learning based model. We can have DL models of source code that are either sequence to sequence models, transforme­rs or graph neural networks. Different types of source code models are used for different applicatio­ns such as source code analysis tools, compiler optimisati­ons and automatic code generation. We will see examples of each of these types of models in next month’s column.

There is an excellent paper that discusses how NLP/ML techniques can be used for source code modelling. Titled ‘A survey of machine learning for big code and naturalnes­s’, it is available at https://arxiv.org/pdf/1709.06182. I would encourage our readers to take a look at this paper.

Feel free to reach out to me over LinkedIn/email if you need any help in your coding interview preparatio­n. If you have any favourite programmin­g questions/software topics that you would like to discuss on this forum, please send them to me, along with your solutions and feedback, at sandyasm_AT_yahoo_DOT_com. Wishing all our readers happy coding until next month! Stay healthy and stay safe.

 ??  ?? Sandya Mannarswam­y
Sandya Mannarswam­y

Newspapers in English

Newspapers from India