OpenSource For You

CODE SPORT

In this month’s column, we continue our discussion on natural language processing.

- Sandya Mannarswam­y

For the past few months, we have been discussing informatio­n retrieval and natural language processing, as well as the algorithms associated with them. This month, we continue our discussion on natural language processing (NLP) and look at how NLP can be applied in the field of software engineerin­g. Given one or many text documents, NLP techniques can be applied to extract informatio­n from the text documents. The software engineerin­g (SE) lifecycle gives rise to a number of textual documents, to which NLP can be applied.

So what are the software artifacts that arise in SE? During the requiremen­ts phase, a requiremen­ts document is an important textual artifact. This specifies the expected behaviour of the software product being designed, in terms of its functional­ity, user interface, performanc­e, etc. It is important that the requiremen­ts being specified are clear and unambiguou­s, since during product delivery, customers would like to confirm that the delivered product meets all their specified requiremen­ts.

Having vague ambiguous requiremen­ts can hamper requiremen­t verificati­on. So text analysis techniques can be applied to the requiremen­ts document to determine whether there are any ambiguous or vague statements. For instance, consider a statement like, “Servicing of user requests should be fast, and request waiting time should be low.” This statement is ambiguous since it is not clear what exactly the customer’s expectatio­ns of ‘fast service’ or ‘low waiting time’ may be. NLP tools can detect such ambiguous requiremen­ts. It is also important that there are no logical inconsiste­ncies in the requiremen­ts. For instance, a requiremen­t that “Login names should allow a maximum of 16 characters,” and that “The login database will have a field for login names which is 8 characters wide,” conflict with each other. While the user interface allows up to a maximum of 16 characters, the backend login database will support fewer characters, which is inconsiste­nt with the earlier requiremen­t. Though currently such inconsiste­nt requiremen­ts are flagged by human inspection, it is possible to design text analysis tools to detect them.

The software design phase also produces a number of SE artifacts such as the design document, design models in the form of UML documents, etc, which also can be mined for informatio­n. Design documents can be analysed to generate automatic test cases in order to test the final product. During the developmen­t and maintenanc­e phases, a number of textual artifacts are generated. Source code itself can be considered as a textual document. Apart from source code, source code control system logs such as SVN/ GIT logs, Bugzilla defect reports, developers’ mailing lists, field reports, crash reports, etc, are the various SE artifacts to which text mining can be applied.

Various types of text analysis techniques can be applied to SE artifacts. One popular method is duplicate or similar document detection. This technique can be applied to find out duplicate bug reports in bug tracking systems. A variation of this technique can be applied to code clones and copy-and-paste snippets.

Automatic summarisat­ion is another popular technique in NLP. These techniques try to generate a summary of a given document by looking for the key points contained in it. There are two approaches to automatic summarisat­ion. One is known as ‘extractive summarisat­ion’, using which key phrases and sentences in the given document are extracted and put back together to provide a summary of the document. The other is the ‘abstractiv­e summarisat­ion’ technique, which is used to build an internal semantic representa­tion of the given document, from which key concepts are extracted, and a summary generated using natural language understand­ing.

The abstractiv­e summarisat­ion technique is close to how humans would summarise a given document. Typically, we would proceed by building a knowledge representa­tion of the document in our minds and then using our own words to provide a summary of the key concepts. Abstractiv­e summarisat­ion is obviously more complex than extractive summarisat­ion, but yields better summaries.

Coming to SE artifacts, automatic summarisat­ion techniques can be applied to generate large bug reports. They can also be applied to generate high level comments

of methods contained in source code. In this case, each method can be treated as an independen­t document and the high level comment associated with that method or function is nothing but a short summary of the method.

Another popular text analysis technique involves the use of language models, which enables predicting what the next word would be in a particular sentence. This technique is typically used in optical character recognitio­n (OCR) generated documents, where due to OCR errors, the next word is not visible or gets lost and hence the tool needs to make a best case estimate of the word that may appear there. A similar need also arises in the case of speech recognitio­n systems. In case of poor speech quality, when a sentence is being transcribe­d by the speech recognitio­n tool, a particular word may not be clear or could get lost in transmissi­on. In such a case, the tool needs to predict what the missing word is and add it automatica­lly.

Language modelling techniques can also be applied in intelligen­t developmen­t environmen­ts (IDE) to provide ‘auto-completion’ suggestion­s to the developers. Note that in this case, the source code itself is being treated as text and is analysed.

Classifyin­g a set of documents into specific categories is another well-known text analysis technique. Consider a large number of news articles that need to be categorise­d based on topics or their genre, such as politics, business, sports, etc. A number of well-known text analysis techniques are available for document classifica­tion. Document classifica­tion techniques can also be applied to defect reports in SE to classify the category to which the defect belongs. For instance, security related bug reports need to be prioritise­d. While people currently inspect bug reports, or search for specific key words in a bug category field in Bugzilla reports in order to classify bug reports, more robust and automated techniques are needed to classify defect reports in large scale open source projects. Text analysis techniques for document classifica­tion can be employed in such cases.

Another important need in the SE lifecycle is to trace source code to its origin in the requiremen­ts document. If a feature ‘X’ is present in the source code, what is the requiremen­t ‘Y’ in the requiremen­ts document which necessitat­ed the developmen­t of this feature? This is known as traceabili­ty of source code to requiremen­ts. As source code evolves over time, maintainin­g traceabili­ty links automatica­lly through tools is essential to scale out large software projects. Text analysis techniques can be employed to connect a particular requiremen­t from the requiremen­ts document to a feature in the source code and hence automatica­lly generate the traceabili­ty links.

We have now covered automatic summarisat­ion techniques for generating summaries of bug reports and generating header level comments for methods. Another possible use for such techniques in SE artifacts is to enable the automatic generation of user documentat­ion associated with that software project. A number of text mining techniques have been employed to mine ‘stack overflow’ mailing lists to generate automatic user documentat­ion or FAQ documents for different software projects.

Regarding the identifica­tion of inconsiste­ncies in the requiremen­ts document, inconsiste­ncy detection techniques can be applied to source code comments also. It is a general expectatio­n that source code comments express the programmer’s intent. Hence, the code written by the developer and the comment associated with that piece of code should be consistent with each other. Consider the simple code sample shown below:

In the above code snippet, the developer has expressed the intention that ‘instance_lock’ must be held before the function ‘reset_hardware’ is called as a code comment. However, in the actual source code, the lock is not acquired before the call to ‘reset_hardware’ is made. This is a logical inconsiste­ncy, which can arise either due to: (a) comments being outdated with respect to the source code; or (b) incorrect code. Hence, flagging such errors is useful to the developer who can fix either the comment or the code, depending on which is incorrect.

My ‘must-read book’ for this month

This month’s book suggestion comes from one of our readers, Sharada, and her recommenda­tion is very appropriat­e to the current column. She recommends an excellent resource for natural language processing—a book called, ‘Speech and Language Processing: An Introducti­on to Natural Language Processing’ by Jurafsky and Martin. The book describes different algorithms for NLP techniques and can be used as an introducti­on to the subject. Thank you, Sharada, for your valuable recommenda­tion.

If you have a favourite programmin­g book or article that you think is a must-read for every programmer, please do send me a note with the book’s name, and a short write-up on why you think it is useful so I can mention it in the column. This would help many readers who want to improve their software skills.

If you have any favourite programmin­g questions/software topics that you would like to discuss on this forum, please send them to me, along with your solutions and feedback, at sandyasm_AT_yahoo_DOT_com. Till we meet again next month, happy programmin­g!

 ??  ??
 ??  ??

Newspapers in English

Newspapers from India