College of Engineering and Polymer Science

Date of Last Revision

2024-06-04 07:23:51


Computer Science

Honors Course


Number of Credits


Degree Name

Bachelor of Science in Computer Science

Date of Expected Graduation

Spring 2024


The purpose of this project was to compare tokenization methods, or methods of breaking up a text into meaningful parts for use in natural language processing. The effectiveness of several commonly used tokenization methods were investigated, including morpheme tokenization, which takes into account the linguistic features of the language. In addition, I proposed and implemented a new technique to consider the capitalization pattern of a word in the tokenization process, in order to allow this process to include more natural language features. The effectiveness of these methods was compared by using them in a sentiment analysis model for various datasets, including binary classification and multiclass classification datasets. This report summarizes these methods and the findings from the comparisons.

Research Sponsor

Zhong-Hui Duan

First Reader

Michael L. Collard

Second Reader

Yingcai Xiao

Honors Faculty Advisor

Zhong-Hui Duan

Proprietary and/or Confidential Information




To view the content in your browser, please download Adobe Reader or, alternately,
you may Download the file to your hard drive.

NOTE: The latest versions of Adobe Reader do not support viewing PDF files within Firefox on Mac OS and if you are using a modern (Intel) Mac, there is no official plugin for viewing PDF files within the browser window.