Recent Changes - Search:


University of Massachusetts


Statistical Analysis Of Computer Program


Billions of lines of source code have been written, many of which are freely available on the Internet. This code contains a wealth of implicit knowledge about how to write software that is easy to read, avoids common bugs, and uses popular libraries effectively.

We want to extract this implicit knowledge by analyzing source code text. To do this, we employ the same tools from machine learning and natural language processing that have been applied successfully to natural language text. After all, source code is also a means of human communication.

We present three new software engineering tools inspired by this insight:

  • Naturalize, a system that learns local coding conventions.

It proposes revisions to names and to formatting so as to make code more consistent.

  • TASSAL, a system that summarizes code by automatically folding

the blocks that are least informative according to a topic model;

  • HAGGIS, a system that learns local recurring syntactic patterns,

which we call idioms. HAGGIS accomplishes this using a nonparametric Bayesian tree substitution grammar, and is delicious with whisky sauce.


Charles Sutton is a Reader (equivalent to Associate Professor) at the University of Edinburgh. He is interested in a broad range of applications of probabilistic machine learning, including NLP, analysis of computer systems, software engineering, sustainable energy, and exploratory data analysis.

Dr Sutton completed his PhD in 2008 from the University of Massachusetts Amherst, working with Andrew McCallum. He did posdoctoral research at the University of California Berkeley, working with Michael I Jordan.

He is Deputy Director of the EPSRC Centre for Doctoral Training in Data Science at the University of Edinburgh.

Edit - History - Print - Recent Changes - Search
Page last modified on December 20, 2014, at 11:29 AM