SURBHI JAIN, Sarvesh Singh Rai


Text mining has been employed in a wide range of applications such as text summarization, text categorization, named entity extraction, and opinion and sentimental analysis. Text classification is the task of assigning predefined categories to free-text documents. Clustering of documents is used to group documents into relevant topics. Each of such group is known as clusters. It is an unsupervised learning technique. The major difficulty in document clustering is its high dimension. It requires efficient algorithms which can solve this high dimensional clustering. The high dimensionality of data is a great challenge for effective text categorization. In this paper, we discuss a text categorization method based on k-means clustering feature selection. K-means is classical algorithm for data clustering in text mining, but it is seldom used for feature selection. For text data, the words that can express correct semantic in a class are usually good features. We use k-means method to capture several cluster centroids for each class, and then choose the high frequency words in centroids as the text features for categorization. The words extracted by k-means not only can represent each class clustering well, but also own high quality for semantic expression. On normal text databases, Regularized least-squares regression based on our feature selection method exhibit better performances than original classifiers for text categorization.


Feature selection, k-mean clustering, feature clustering and Regularized least-squares regression.

Full Text:




T Liu, S Liu, Z Chen, WY Ma., An Evaluation on Feature Selection for Text Clustering In ICML, .

H. H. Hsu, C. W. Hsieh, Feature Selection via Correlation Coefficient Clustering, Journal of Software, vol. 5, no. 12, pp. 1371-1377, 2010.

H. Liu and L. Yu., toward integrating feature selection algorithms for classification and clustering, Knowledge and Data Engineering, IEEE Transactions on, 17(4):49-502, April 2005.

Michael W. Berry and Jacob Kogan [2010] “Text Mining: Applications and Theory,” John Wiley & Sons, Ltd.

Ren WangI, Amr M. Youssef, Ahmed K. Elhakeem “On Some Feature Selection Strategies for Spam Filter Design,” IEEE, May 2006.

Yiming Yang and Jan O. Pedersen “A Comparative Study on Feature Selection in Text Categorization” 1997.

I.Koprinska, J. Poon, J. Clark, and J. Chan, “Learning to classify e-mail,” Inform. Sci., vol. 177, pp. 2167–2187, 2007.

Al-Mubaid, H., Umair, S.A., 2006. A new text categorization technique using distributional clustering and learning logic. IEEE Trans. Knowledge Data Eng. 18 (9), 1156–1165.

Arnold, W., Tesauro, G., 2000. Automatically generated Win32 heuristic virus detection. Proceedings of the 2000 International Virus Bulletin Conference.

Abu-Nimeh, S., Nappa, D., Wang, X., Nair, S., 2007. A comparison of machine learning techniques for phishing detection. In: Proceedings of the Anti-Phishing Working Groups 2nd Annual eCrime Researchers Summit: eCrime 2007. ACM, New York, pp. 60–69

Carbonell, J., & Goldstein, J. (1998). The use of MMR, diversity-based reranking for reordering documents and producing summaries.

Caruana, R., Lawrence, S., & Giles, C. L. (2001). Overfitting in neural networks backpropagation, conjugate gradient, and early stopping. Advance in Neural Information Processing Systems, 13, 402–408.

G.B. Bezerra, T.V. Barra, H.M. Ferreira, et al., An immunological filter for spam, in: Proceedings of the International Conference on Artificial Immune Systems, Oeiras, Portugal, 2006, pp. 446–458.

Androutsopoulos, J. Koutsias, K.V. Chandrinos, G. Paliouras, C.D. Spyropoulos, An evaluation of Naive Bayesian anti-spam filtering, in: Proceedings of the workshop on Machine Learning in the New Information Age, 11th European Conference on Machine Learning. Barcelona,

T.A. Almeida, J.M.G. Hidalgo, A. Yamakami, Contributions to the study of SMS spam filtering: new collection and results, in: Proceedings of the 11th ACM Symposium on Document, Engineering, 2011, pp. 259–262

Y.J. Wang, R. Sanderson, F. Coenen, P. Leng, Document-base extraction for single-label text classification, in: Proceedings of the 10th International Conference on Data Warehousing and Knowledge Discovery, Publishing, Turin, Italy, 2008, pp. 357–367.

F. Sebastiani, Machine learning in automated text categorization, ACM Comput. Surveys (CSUR) 34 (1) (2002) 1–47.

Balthrop, J., Forrest,SJ., Forrest, S., Glickman, M.R., 2002. Revisiting LISYS: parameters and normal behavior. Proceedings of the 2002 Congress on Evolutionary Computing.

Azam, N., & Yao, J. T. (2011). Incorporating game theory in feature selection for text

Categorization In Proceedings of 13th international conference on rough sets, fuzzy sets, Lecture notes in computer science

Almuallim, H., & Dietterich, T. G. (1991). Learning with many irrelevant features.

Alhabashneh, O., Iqbal, R., Shah, N., Amin, S., & James, A. (2011). Towards the development of an integrated framework for enhancing enterprise search using latent semantic indexing.

S. Whittaker, V. Bellotti, P. Moody, Introduction to this special issue on revisiting and reinventing e-mail, Hum.-Comput. Interact. 20 (1–2) (2005) 1–9.

M.H. Aghdam, N. Ghasem-Aghaee, M.E. Basiri, Text feature selection using ant colony optimization, Expert Systems with Application.

Beirlant, J., Dudewicz, E., Györfi, L., Van der Meulen, E., 1997. Nonparametric entropy estimation: An overview. Internat. J. Math. Statist.

F. Sebastiani, Machine learning in automated text categorization, ACM Comput. Survey.

Tibshirani, R. Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society, Series B, Vol 58, No. 1, pp. 267–288, 1996.

Zou, H. and T. Hastie. Regularization and variable selection via the elastic net. Journal of the Royal Statistical Society, Series B, Vol. 67, No. 2, pp. 301–320, 2005.

Friedman, J., R. Tibshirani, and T. Hastie. Regularization paths for generalized linear models via coordinate descent. Journal of Statistical Software, Vol 33, No. 1, 2010. http://www.jstatsoft.org/v33/i01

Hastie, T., R. Tibshirani, and J. Friedman. The Elements of Statistical Learning, 2nd edition. Springer, New York, 2008.


  • There are currently no refbacks.