Published 28 FEB 2019 •  vol 12  •  no 2  • 



Narongsak Chayangkoon, Kasetsart University, Thailand
Anongnart Srivihok, Kasetsart University, Thailand



Text classification for data preprocessing methods regularly uses bag-of-words (BoWs). In a large dataset, BoWs always include many vectors with very large sizes and high dimensions. The authors introduced a new data preprocessing method for feature reduction of short text classification, namely NDTMD. It reduces features of the dataset using BoWs and word embedding (WE), and can solve the weaknesses of BoWs. The experiment consisted of four steps: 1) 5 datasets were selected from the data science community website, Kaggle; 2) the new methods were compared with 5 commonly used data preprocessing methods and 4 of these 5 methods used the state of the art as their baseline, while the other one used BoWs. One of the new data preprocessing methods used features reduction of BoWs to produce a new document termed matrix data (NDTMD); 3) the authors generated many classification models by 3 classifiers: support vector machine, logistic regression, and convolutional neural network for text classification; and 4) the above classifiers were applied to each preprocessing dataset and evaluated using feature reduction rate (FRR), accuracy, kappa, and running time performance. The results showed that classification models had the highest performance when using NDTMD. In particular, classifier algorithms had the highest accuracy and kappa but the lowest running time. The new data preprocessing methods can be used to preprocess short text classification and also can be applied with real social media data.



Feature Reduction, Word Embedding, Text Classification, Bag-Of-Words



[1] Harris, Zellig S. “Distributional Structure”, Word, vol. 10, no. 2-3, (1954): 146-162.
[2] Jörg A., Walter and Helge, Ritter. “On Interactive Visualization of High-Dimensional Data Using the Hyperbolic Plane”, International Conference on Knowledge Discovery and Data Mining. The eighth ACM SIGKDD, 2002.
[3] Leskovec, Jure, et al. Mining of Massive Datasets. Cambridge university press, 2014.
[4] Dumais, Susan T. “Latent Semantic Analysis”, Annual Review of Information Science and Technology. 38, (2005): 188–230.
[5] Landauer, Thomas K., et al. “An Introduction to Latent Semantic Analysis”, Discourse Processes, vol. 25, no. 2-3, (1998): 259-284.
[6] Quoc, Le and Tomas, Mikolov. “Distributed Representations of Sentences and Documents”, International Conference on Machine Learning, 2014, pp. 1188-1196.
[7] Mikolov, Tomas, et al. “Efficient Estimation of Word Representations in Vector Space”, ICLR Workshop, (2013).
[8] Mikolov, Tomas, et al. “Distributed Representations of Words and Phrases and Their Compositionality”, Advances in Neural Information Processing Systems, (2013): 3111-3119.
[9] Google Teams. “Word2vec-Google Code Archive-Long-Term Storage for Google Code Project Hosting”, https://code.google.com/archive/p/word2vec.
[10] Peng, Jin, et al. “Bag-of-Embeddings for Text Classification”, Twenty-Fifth International Joint Conference on Artificial Intelligence (IJCAI-16), 2016, pp. 2824-2830.
[11] Michael, Gutmann and Aapo, Hyvärinen. “Noise-Contrastive Estimation: A New Estimation Principle for Unnormalized Statistical Models”, The Thirteenth International Conference on Artificial Intelligence and Statistics, 2010, pp. 297-304.
[12] Joseph, Lilleberg, et al. “Support Vector Machines and Word2Vec for Text Classification with Semantic Features”, Cognitive Informatics & Cognitive Computing, 2015 IEEE 14th International Conference on, 2015, pp. 136-140.
[13] Tom, Kenter and Maarten, de, Rijke. “Short Text Similarity with Word Embeddings”, The 24th ACM International on Conference on Information and Knowledge Management, 2015, pp. 1411-1420.
[14] Cortes, Corinna and Vapnik, Vladimir. “Support-Vector Networks”, Machine learning, vol. 20, no. 3, (1995): 273-297.
[15] Walker, Strother H. and Duncan, David B. “Estimation of the Probability of an Event as a Function of Several Independent Variables”, Biometrika, vol. 54, no. 1-2, (1967): 167-179.
[16] Kim, Yoon. “Convolutional Neural Networks for Sentence Classification”, arXiv preprint arXiv:1408.5882, (2014).
[17] Kaggle Inc. “Terms and Conditions – Kaggle”, https://www.kaggle.com/datasets.
[18] Figure Eight Inc. “Data for Everyone”, https://www.figure-eight.com/data-for-everyone/.
[19] R Core Teams. “R: A Language and Environment for Statistical Computing”, R Foundation for Statistical Computing, Vienna, Austria. ISBN 3-900051-07-0, URL http://www.R-project.org/.
[20] Witten, Ian H., et al. Data Mining: Practical Machine Learning Tools and Techniques. Morgan Kaufmann, 2016.
[21] Hsin-Yu, Ha, et al. “FC-MST: Feature Correlation Maximum Spanning Tree for Multimedia Concept Classification”, Computing (ICSC), IEEE International Conference, 2015, pp. 276-283.
[22] Xiangfeng, Dai and Robert, Prout. “Unlocking Super Bowl Insights: Weighted Word Embedding for Twitter Sentiment Classification”, The 3rd Multidisciplinary International Social Networks Conference on SocialInformatics, 2016, pp. 12.
[23] Viera, Anthony J. and Garrett, Joanne M. “Understanding Interobserver Agreement: The Kappa Statistic”, Fam Med, vol. 37, no. 5, (2005): 360-363.
[24] Davis, Adrian T. and Moed, Berton R. “Can Experts in Acetabular Fracture Care Determine Hip Stability After Posterior Wall Fractures Using Plain Radiographs and Computed Tomography”, Journal of Orthopaedic Trauma, vol. 27, no. 10, (2013): 587-591.
[25] Edouard, Grave, et al. “Bag of Tricks for Efficient Text Classification”, The 15th Conference of the European Chapter of the Association for Computational Linguistics, EACL, 2017, pp. 3-7.



Chayangkoon, N., & Srivihok, A. (2019). Feature Reduction of Short Text Classification by using Bag of Words and Word Embedding. International Journal of Control and Automation (IJCA), ISSN: 2005-4297 (Print); 2207-6387 (Online), NADIA, 12(2), 1-16. doi: 10.14257/ijca.2019.12.2.01.

Chayangkoon, Narongsak, et al. “Feature Reduction of Short Text Classification by using Bag of Words and Word Embedding.” International Journal of Control and Automation, ISSN: 2005-4297 (Print); 2207-6387 (Online), NADIA, vol. 12, no. 2, 2019, pp. 1-16. IJCA, http://article.nadiapub.com/IJCA/vol12_no2/1.html.

[1] N. Chayangkoon, and A. Srivihok, "Feature Reduction of Short Text Classification by using Bag of Words and Word Embedding." International Journal of Control and Automation (IJCA), ISSN: 2005-4297 (Print); 2207-6387 (Online), NADIA, vol. 12, no. 2, pp. 1-16, Feb 2019.