https://ift.tt/djyZsbF SourceCodeAI — Deep Learning for Source Code — Why and How Photo by Enric Cruz López from Pexels Yet another N...
SourceCodeAI — Deep Learning for Source Code — Why and How
Yet another NLP?
Source code AI recently became a popular topic; more than ever before companies are spending efforts in that direction, trying to leverage it for their needs. The motivations are quite clear; First, as AI apps benefit from domain expertise, what is better than having the whole R&D organisation being an expert in the field of interest! Second, given the many recent NLP breakthroughs it seems such techniques can be easily applied to every textual domain that exists out there (including source code). Summing it up, applying deep learning NLP techniques to source code seems like a super no-brainer case; Just take one of the many pre-trained Language Model Transformers and use it for the task of interest. What could go wrong? But the reality as usual is more complicated. There is still a mile to go before successfully applying deep learning to source code. The main reason is the many unique features of that domain, making such an approach less relevant. Let’s have a deeper look into a few of the most significant challenges of the source code domain.
Unique dictionary
While Python evangelists like to claim that ‘reading a good Python program feels almost like reading English’ the truth is Python (and basically any other source code language) differs in the tokens it is built from; having two main token types — user defined (like variable names, function names, etc..) and language built-ins (like the words ‘def’, ‘len’ or the character ‘=’). Both won’t be correctly recognized by the common deep learning NLP models (being usually trained on a regular English corpus), leading to many ‘out of vocabulary’ scenarios which are known to highly affect such models’ performance. A solution could be to use instead language models which are specialised (specifically trained) for the source code domain (like code2vec as example) and later tune it to the problem of interest. The next challenge will be how to make it multi-lingual.
Unique dictionary per language
While in the general NLP world, using an English based model can be enough for many types of applications (like sentiment analysis, summarization, etc.. given the high dominance of English), on source code we will commonly prefer to have multi linguistic models, to solve the same issues (like defect detection), on a wide range of languages, all together (not only Python, but also to support other languages like JavaScript and Java at the same time). This is a main difference between the academic and the commercial world; while when publishing a paper the motivation can be to proof a new concept (and therefore applying to a single source code language can be enough), on the production world we would like to offer our customers a solution with a minimal set of limitations, to make it to support the wider range of languages as possible. Microsoft’s CodeBERT and SalesForce’s CodeT5 are examples in that direction, deliberately training multi-linguistic language models (~6 languages support). The first issue with such solutions is the fact that their language specific sub models are always better than the general ones (just try to summarise a Python snippet using the general CodeT5 model and the Python optimised one). Another, more inherent, issue is the fact that such a limited (~6 languages) support is only a drop in the ocean. It’s enough to have a look at GitHub’s Linguist’s list of supported languages to understand that this is not enough. And even if we will optimistically assume such models can be seamlessly applied to similar languages (like C and C++ given that CodeBert supports the Go language which is assumed to be quite similar), what about languages like Yaml, XML and Clojure which syntax is so different, it is fair to assume such transition shouldn’t hold. The solution can be to try to embrace less general language models but ones which are more optimised towards the problem of interest. The next challenge will be how to fulfil the required prediction context scope.
Sparse context
Unlike regular texts which are supposed to be read from the beginning to end, code is more dynamic, more like a tower made of lego bricks, supposed to be compiled and evaluated per case. Consider for example a simple object oriented program with an Inheritance structure of a base abstract class (person) with an interface (print the person job title), a set of implementations (a different print for employees and managers), and a function that apply the interface (print title) for an input list of base (person) objects. Let’s assume we want to train a model to summarise such functions. How should our model understand the program we just described? The required context (the print title implementation) will most likely be in a different position — far on that class or maybe even on a different file or project. And even if we will consider a huge transformer that sees a lot of context; as the relevant context can be in entirely different places, the local context may be not enough for the problem we’re trying to solve (moreover, considering best practices like Encapsulation and code reuse that will increase the likelihood of local context sparsity). A solution could be to try to compile and merge all the relevant code snippets prior to applying the model. The next challenge will be how to consider the various code dynamic states.
Program state
Unlike texts which always have the same outcome regardless of the way we read it (probably excluding DnD quests), in code the outcome depends on the specific input we provide. If for example we want to identify Null Pointer scenarios; they can be static (always happen, regardless of the input) due to poor coding and therefore should be identifiable by just reading the code (PS, this is why static code analysis tools are quite good at finding these cases), but they can also be dynamic, due to poor input conditions and lack of relevant validations and therefore should be less identifiable by just reading the code. The missing ingredient is the data flow graph; by understanding how data is being propagated through the program, a model can identify when code parts together with specific data conditions can be problematic. Such a view can be achieved using tools like Github’s CodeQL which analyses the program data flow. The issue is this is not an easy task, moreover when considering the multi-lingual need (CodeQL for example supports only ~7 source code languages). At the same time it can be the secret sauce towards a magically working solution. A classic matter of tradeoff.
Summing it up, the source code domain seems quite challenging. At the same time we should consider the ‘naturalness hypothesis’ of Allamanis at el which argue that ‘Software is a form of human communication; software corpora have similar statistical properties to natural language corpora; and these properties can be exploited to build better software engineering tools’. NLP algorithms should be capable of handling source code tasks. It’s up to us to make sure they will see the tasks in the right way, enabling them to properly deal with it. How to successfully apply deep learning NLP techniques to these areas?
Problem understanding
The first common solution (which is important in general) is to properly understand the problem domain prior to trying to solve it; What are we trying to achieve? What are the business top requirements? What are the technical limitations? If multilingualism is not mandatory (like for example when specifically targeting Android or Javascript apps) then by easing that limitation, the development will become much more simplified, enabling us to use language specific pre-trained models, or even to train simple ones on our own. Concentrating on a specific language can enable using regular NLP methods, purposely overfitting to the target source code language. By properly understanding the problem domain we can use simplifications that will enable using common best practices to solve our needs. A transition from super complicated cross linguistic problems towards more general NLP tasks. This by itself can be enough to ease the development cycles.
Input scope
Properly understanding the domain we’re trying to solve is also important in order to make sure we choose the right input. If we’re trying to summarise a standalone function then the function body may be enough. It can be the same in case the function calls external functions with a self explained naming convention. If this is not the case, we may consider to provide as input the whole class or at least the relevant functions implementation. If our target is more flow oriented, like to identify areas at risk of SQL Injection, then it makes sense not only to look at the specific ORM implementations (like for example the Hibernate code which interacts with the Database) but to consider looking at the snippets leading to that path as well. The main issue is the fact that this approach forwards the complication from the model we are trying to train to the support modules, the ones responsible to collect all the relevant scope parts. Meaning that we’re as good as these support modules are. Adding a redundant requirement to our ecosystem.
Data modeling
The eagle eyed readers may have noticed the fact that the main issues we presented were in regards to the data, not to the models themselves. The algorithms are just fine. The data they get is the main issue. Same as before, by properly understanding the problem, one can choose a data representation that will better fit their needs. Source code ASTs are commonly in use to gain a higher level functions interactions visibility. Data flow graphs are commonly in use in order to track how data is being propagated through the program (important in order to detect scenarios like Null Pointers or SQL Injection). Looking at System calls or Opcode instead of the plain code can enable implicitly training a multi linguistic models (using a general interface which can be shared across different languages). Code embeddings like code2vec can enable a high level understanding of a snippet. Moreover, some of these embedding models are already multi-lingual, answering that need at the same time. For some cases we may choose to train a multi linguistic model on our own. Then it’s important to verify that we represent all the relevant sub populations (take into account that sampling code dataset and specifically relying for that on Github is not trivial. Naive approaches can easily lead to serious hidden population biases). Treating code as plain text can work for the more simple tasks. One of its main building blocks is to decide the input level; it can be words, sub words or even character level. From the most language specific (words) to the most general (characters). Spiral is an example of an open source which tries to make word based tokenization more general by normalizing different languages coding styles (like Camel Case naming). Sub words is a tradeoff between characters (when the model need to learn to identify the dictionary words) and word tokenization (when words are already the input) for a similar motivation; trying to generate a specific corpus while making sure not to have too many out of vocabulary scenarios (relevant example implementations are BPE and word piece). For some cases we may choose to keep only the code input parts which are more relevant to our case (like keeping only function names or ignoring source code built in tokens). The issue is it requires Lexer understanding in order to fulfil that need which again adds a redundant, prone based, ingredient to our ecosystem.
What to watch ahead
The most obvious direction that the software world will take is towards more language agnostic source code models. It is being done by making the relevant models more specific to the problem domain, leveraging features which are source code unique (like program and data flow) and by relying on less general transformers implementations which don’t treat the source code domain as yet another NLP but in a more specialized ways, taking into account the specific domain characteristics (interestingly it is similar to the way image and NLP deep learning best practices are being applied to the audio domain, not as is but by taking into account audio unique concepts, tuning these architectures to better fit the audio domain). At the same time as on the source code domain applications speed can be critical (like in order to be part of systems like CI-CD, having resources and speed limitations) we will most likely see more and more lightweight implementations, whether it’s lightweight transformers or more general NLP architectures. Finally as the source code domain requires its own labelling (the general NLP ones are less relevant), and with the understanding that just relying on Github sampling is not enough, it is likely we will face more and more efforts towards generating labelled source code datasets. Exciting days are ahead.
SourceCodeAI — Deep Learning for Source Code — Why and How was originally published in Towards Data Science on Medium, where people are continuing the conversation by highlighting and responding to this story.
from Towards Data Science - Medium https://ift.tt/gxV9qJK
via RiYo Analytics
No comments