College
College of Engineering and Polymer Science
Date of Last Revision
2024-06-04 07:23:51
Major
Computer Science
Honors Course
498
Number of Credits
3
Degree Name
Bachelor of Science in Computer Science
Date of Expected Graduation
Spring 2024
Abstract
The purpose of this project was to compare tokenization methods, or methods of breaking up a text into meaningful parts for use in natural language processing. The effectiveness of several commonly used tokenization methods were investigated, including morpheme tokenization, which takes into account the linguistic features of the language. In addition, I proposed and implemented a new technique to consider the capitalization pattern of a word in the tokenization process, in order to allow this process to include more natural language features. The effectiveness of these methods was compared by using them in a sentiment analysis model for various datasets, including binary classification and multiclass classification datasets. This report summarizes these methods and the findings from the comparisons.
Research Sponsor
Zhong-Hui Duan
First Reader
Michael L. Collard
Second Reader
Yingcai Xiao
Honors Faculty Advisor
Zhong-Hui Duan
Proprietary and/or Confidential Information
No
Recommended Citation
Culmer, Nathan, "A Comparison of Lexical Tokenization Methods" (2024). Williams Honors College, Honors Research Projects. 1831.
https://ideaexchange.uakron.edu/honors_research_projects/1831