Automatically classifying source code using tree-based approaches

Article ID	Journal	Published Year	Pages	File Type
6853933	Data & Knowledge Engineering	2018	14 Pages	PDF

Abstract

We survey many machine learning algorithms on different types of program representations including software metrics, sequences, and tree structures. The approaches are evaluated based on classifying 52000 programs written in C language into 104 target labels. The experiments show that the tree-based classifiers dramatically achieve high performance in comparison with those of metrics-based or sequences-based; and two proposed models TBCNN + SVM and TBCNN + kNN rank as the top and the second classifiers. Pruning redundant AST branches leads to not only a substantial reduction in execution time but also an increase in accuracy.

Keywords

Support vector machines (SVMs)