College of Engineering and Polymer Science
Date of Last Revision
Senior Honors Project in Computer Science
Number of Credits
Bachelor of Science in Computer Science
Date of Expected Graduation
Source code comment classification is an important problem for future machine learning solutions. In particular, supervised machine learning solutions that have largely subjective data labels but are difficult to obtain the labels for. Machine learning problems are problems largely because of a lack of data. In machine learning solutions, it is better to have a large amount of mediocre data than it is to have a small amount of good data. While the mediocre data might not produce the best accuracy, it produces the best results because there is much more to learn from the problem.
In this project, data was collected from student comment code in computer science classes. This data was then sorted based on various tools in order to create automated source code classification. Various data categorization and sorting methods were explored, ultimately resulting in a process where assigned letter grade was used as a sorting label. Using python, CommentLabeler, and SortAndUnique tools were developed in order to automate the manual source code labeling process. State retention and error checking were also features that were added to streamline the process further.
The most important takeaway from this experience was that the amount of data is much more important than quality. In fact, mediocre data will provide better results with regard to machine learning because there is room for improvement and it proves machine learning as a solution.
Dr. Michael L. Collard
Honors Faculty Advisor
Sutyak, Cole, "Source Code Comment Classification Artificial Intelligence" (2021). Williams Honors College, Honors Research Projects. 1308.