REPRESENTING CONTEXT IN ABBREVIATION EXPANSION USING MACHINE LEARNING APPROACH

Trieu Thi Ly Ly, Nguyen Van Quy, Ninh Khanh Duy, Huynh Huu Hung, Dang Duy Thang



DOI: 10.15625/vap.2017.00096

Abstract


Text normalization is an essential problem in applications involving natural language processing since the input text often contains non-standard words such as abbreviations, numbers, and foreign words. This paper deals with the problem of normalizing abbreviations in Vietnamese text when there are several possible expansions for an abbreviation. To disambiguate the expansions for an abbreviation, a machine learning approach is proposed in which contextual information of the abbreviation is represented by either of the two models: Bag-of-words or Doc2vec. Experiments with Naïve Bayes classifier on a dataset of abbreviations collected by us shows that the average ratios of expanding correctly for Bag-of-words and Doc2vec are 86.0% and 79.7 %, respectively. Experimental results also show that information on the context plays an important role in the correct expansion of an abbreviation.

Keywords


Text normalization, abbreviation expansion, context representation, Bag-of-words model, Doc2vec model, machine learning

Full Text:

PDF


Copyright (c) 2019 PROCEEDING of Publishing House for Science and Technology



PROCEEDING

PUBLISHING HOUSE FOR SCIENCE AND TECHNOLOGY

Website: http://vap.ac.vn

Contact: nxb@vap.ac.vn