FEATURES SELECTION METHOD BASED ON THE MEASUREMENT OF THE DISTANCE BETWEEN NORMAL DISTRIBUTIONS FOR CLASSIFICATION

Published 31 OCTOBER 2021 •  vol 149  • 


Authors:

 

Byungju Shin, Department of Computer Engineering, Gachon University, South Korea
Minwoo Kim, Department of Computer Engineering, Gachon University, South Korea
Bohyun Wang, Department of Computer Engineering, Gachon University, South Korea
Joon S. Lim, Department of Computer Engineering, Gachon University, South Korea

Abstract:

 

Feature selection is an important technique that simplifies machine learning models for easy understanding, reduced learning time, and reduces the over fitting or under fitting of the curve. This paper presents a feature selection algorithm, which is based on a method of examining the similarity between sampled feature values for classification variables (classes). This is based on the premise that “the lower the similarity, the more usable it is for distinguishing classes.” The confidence intervals of normal distributions are used to measure the similarity. It is determined that as the confidence intervals overlap more, the similarity increases. As the confidence intervals overlap less, the similarity decreases, and if the similarity is low, it can be used as a baseline for classification. Thus, we propose equations to apply this method. To confirm the usefulness of the equations, we used a colon cancer dataset with about 2,000 genes and performed comparative experiments with other feature selection algorithms. The compared algorithms were the Gini index (10 features), mRMR (10 features), and relational matrix algorithm (7 features). An artificial neural network was commonly used as a machine learning algorithm, and the comparative validation was performed based on the leave-one-out cross-validation method. In the experimental result, 88.71% accuracy was obtained by selecting 10 features, which was better than the results of the Gini index (85.487%), mRMR (87.09%), and relational matrix algorithm (87.09%). Furthermore, we conducted experiments with iris, wine, glass, music emotion, seeds, and Japanese vowels datasets for multi-class classification problems. In the case of wine, the accuracy was 98.8% when all features were used, but we obtained results with 99.4% accuracy by removing six features. In the case of music emotion, the accuracy was 51.7% when all 54 features were used, but it improved to 61.3% when 20 features were removed. In the case of seeds, a slight improvement from 93.3% to 93.8 was shown when the number of features were reduced from 7 to 5. In the case of iris, glass, and Japanese vowels, the accuracy did not increase despite the removal of features. Thus, it can be concluded that, the method proposed in this paper can be used to select features easily and effectively in multi-class classification problems.

Keywords:

 

Feature Selection, Gaussian Distribution, Similarity, Distance, Classification

References:

 

[1] I. Guyon and A. Elisseeff, “An introduction to variable and feature selection,” JMLR. 3, 2003.
[2] G. James, D. Witten, T. Hastie, and R. Tibshirani, An Introduction to Statistical Learning. Location: Springer. 2013.
[3] M. L. Bermingham, R. Pong-Wong, A. Spiliopoulou, C. Hayward, I. Caroline, H. Rudan, A. F. Campbell, J. F. Wright, F. Wilson, P. Agakov, P. Navarro, and C. S. Haley. “Application of high-dimensional feature selection: evaluation for genomic prediction in man,” Sci. Rep. vol. 5, pp. 00–00, 2015.
[4] X. Liu, A. Krishnan, and A. Mondry, “An entropy based gene selection method for cancer classification using microarray data,” BMC Bioinformatics, vol. 6, pp. 1–14, 2005.
[5] J. Li, H. Su, H. Chen, and B. W. Futscher, “Optimal search-based gene subset selection for gene array cancer classification,” IEEE Trans. Inf. Technol. Biomed. Vol. 11, pp. 398–405, 2007.
[6] C. Gini, “Variabilità e Mutuabilità. Contributo allo Studio delle Distribuzioni e delle Relazioni Statistiche,” C. Cuppini, Bologna, 1912.
[7] G. Forman, “An extensive empirical study of feature selection metrics for text classification,” J. Mach. Learn. Res. Vol. 3, pp. 1289–1305, 2003.
[8] H. Peng, F. Long, and C. Ding, “Feature selection based on mutual information: Criteria of max-dependency, max-relevance, and min-redundancy,” IEEE Transo. Pattern Anal. Mach. Intel. Vol. 27, pp. 1226–1238, 2005.
[9] A. Bhattacharyya, “On a measure of divergence between two statistical populations defined by their probability distributions,” Bull. Calcutta Math. Soc. vol. 35, pp. 99–109, 1943.
[10] C. C. Reyes-Aldasoro and A. Bhalerao, "The Bhattacharyya space for feature selection and its application to texture segmentation," Pattern Recog. vol. 39, pp. 812–826, 2006.
[11] B. Shin, B. Wang, and J. S. Lim, “Relational matrix algorithm for feature selection in a fuzzy neural network”, Basic Clin. Pharmacol. Toxicol. vol 124, pp. 114–115, 2019.
[12] J. S. Lim, “Extracting minimized feature input and fuzzy rules using a fuzzy neural network and non-overlap area distribution measurement method,” 퍼지 및 지능 시스템학회 논문지 제15권 제5호, pp. 599–604, 2005.
[13] J. S. Lim,“Finding features for real-time premature ventricular contraction detection using a fuzzy neural network system,” IEEE Trans. Neural Net. Vol. 20, pp. 522–527, 2009.
[14] J. S. Lim and S. Gupta, “Feature selection using weighted neuro-fuzzy membership functions,” The 2004 International Conference on Artificial Intelligence (IC-AI'04), vol. 1, pp. 1301–1315, 2004.
[15] J. W. Lim, B. J. Shin, and J. S. Lim, “A match count method (mcm) for feature selection with cancer datasets in a neuro-fuzzy system,” Special Issue on “Engineering and Bio Science, Int. J. Pharma. and Bio. Sciences, vol. #, pp. 236–242, 2017.
[16] J. Jeyachidra and M. Punithavalli, “A comparative analysis of feature selection algorithms on classification of gene microarray dataset,” Information Communication and Embedded Systems, vol. #, pp. 1088–1093, 2013.
[17] L. J. Shik, Extracting Wisconsin Breast Cancer Prediction Fuzzy Rules Using Neural Network with Weighted Fuzzy Membership Functions, Korea Information Processing Society (한국정보처리학회), vol. 11B, pp. 717–722, 2004, 1598-284X(pISSN).
[18] W. H. Wolberg, W. N. Street, D. M. Heisey, and O. L. Mangasarian, Computerized Breast Cancer Diagnosis and Prognosis from Fine Needle Aspirates. Archives of Surgery vol. 130, pp. 511–516. 1995.
[19] W. H. Wolberg, W. N. Street, and O. L. Mangasarian, “Machine learning techniques to diagnose breast cancer from fine-needle aspirates,” Cancer Letters vol. 77, pp. 163–171, 1994.
[20] W. N. Street, W. H. Wolberg, and O. L. Mangasarian, Nuclear Feature Extraction for Breast Tumor Diagnosis. IS&T/SPIE 1993 International Symposium on Electronic Imaging: Science and Technology vol. 1905, pp. 861–870, San Jose, CA, 1993.
[21] O. L. Mangasarian, W. N. Street, and W. H. Wolberg, “Breast cancer diagnosis and prognosis via linear programming,” Operations Research vol. 43(4), pp. 570–577, July–August 1995.
[22] P. Zhong and M. Fukushima, “A regularized nonsmooth Newton method for multi-class support vector machines,” 2005 http://www-optima.amp.i.kyoto-u.ac.jp/~fuku/papers/ZhongFuku2-rev.pdf
[23] J. S. Lim, “Finding fuzzy rules for iris by neural network with weighted fuzzy membership function,” Int. J. Fuzzy Log. Intell. Sys. vol. 4(2), 211–216, 2004.

Citations:

 

APA:
Shin, B., Kim, M., Wang, B., & Lim, Joon S. (2021). Features Selection Method Based on the Measurement of the Distance between Normal Distributions for Classification. International Journal of Advanced Science and Technology (IJAST), ISSN: 2005-4238(Print); 2207-6360 (Online), NADIA, 149, 1-10. doi: 10.33832/ijast.2021.149.01.

MLA:
Shin, Byungju, et al. “Features Selection Method Based on the Measurement of the Distance between Normal Distributions for Classification.” International Journal of Advanced Science and Technology, ISSN: 2005-4238(Print); 2207-6360 (Online), NADIA, vol. 149, 2021, pp. 1-10. IJAST, http://article.nadiapub.com/IJAST/Vol149/1.html.

IEEE:
[1] B. Shin, M. Kim, B. Wang, and J. S. Lim, "Features Selection Method Based on the Measurement of the Distance between Normal Distributions for Classification." International Journal of Advanced Science and Technology (IJAST), ISSN: 2005-4238(Print); 2207-6360 (Online), NADIA, vol. 149, pp. 1-10, October 2021.