Multi-token code suggestions using statistical language models

Department of Computing Science, University of Alberta, Edmonton, Alberta, Canada
DOI
10.7287/peerj.preprints.1597v1
Subject Areas
Data Mining and Machine Learning, Natural Language and Speech, Software Engineering
Keywords
naturalness, ngram, language models, atom text editor, code suggestion, code prediction, nlp
Copyright
© 2015 Santos et al.
Licence
This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, reproduction and adaptation in any medium and for any purpose provided that it is properly attributed. For attribution, the original author(s), title, publication source (PeerJ PrePrints) and either DOI or URL of the article must be cited.
Cite this article
Santos EA, Hindle A. 2015. Multi-token code suggestions using statistical language models. PeerJ PrePrints 3:e1597v1

Abstract

We present an application of the naturalness of software to provide multi-token code suggestions in GitHub’s Atom text editor. We extended the results of a simple n-gram prediction model using the "mean surprise" metric—the arithmetic mean of the surprisal of several successive single-token predictions. After an error-fraught evaluation, there is not enough evidence to conclude that Gamboge significantly improves programmer productivity. We conclude by discussing several directions for future research in code suggestion and more using naturalness.

Author Comment

This is the paper submitted to my supervisor as part of my undergraduate directed studies. It is fraught with errors, and ripe with informal, non-academic language. That said, we believe the content to be informative, regardless —especially the usage of "mean surprise," and the numerous applications of NLP applied to software ("naturalness of software") that we have listed.