Improving Hypertext Classification Systems through WordNet-based Feature Abstraction

Jun-Ho Roh, Han-joon Kim, Jae-Young Chang


This paper presents a novel feature engineering technique that can improve the conventional machine learning-based text classification systems. The proposed method extends the initial set of features by using hyperlink relationships in order to effectively categorize hypertext web documents. Web documents are connected to each other through hyperlinks, and in many cases hyperlinks exist among highly related documents. Such hyperlink relationships can be used to enhance the quality of features which consist of classification models. The basic idea of the proposed method is to generate a sort of ed concept feature which consists of a few raw feature words; for this, the method computes the semantic similarity between a target document and its neighbor documents by utilizing hierarchical relationships in the WordNet ontology. In developing classification models, the ed concept features are equated with other raw features, and they can play a great role in developing more accurate classification models. Through the extensive experiments with the Web-KB test collection, we prove that the proposed methods outperform the conventional ones.

Full Text:



Chakrabarti, S., Dom, B., and Indyk, P., "Enhanced hypertext categorization using hyperlinks," Proceedings of the ACM SIGMOD International Conference, pp. 307-318, 1998.

Chang, J. Y., "A Sentiment Analysis Algorithm for Automatic Product Reviews Classification in On-Line Shopping Mall," The Journal of Society for e-Business Studies, Vol. 14, No. 4, pp. 19-33, 2009.

Elberrichi, Z., Rahmoun, A., and Bentaalah, M. A., "Using WordNet for Text Categorization," The International Arab Journal of Information Technology, Vol. 5, No. 1, pp. 16-24, 2008.

Jiang, J. and Conrath, D., "Semantic Similarity Based on Corpus Statistics and Lexical Taxonomy," Proceedings of International Conference on Research in Computational Linguistics, pp. 19-33, 1997.

Lee, J. W., Park, S. C., Lee, S. K., Park, J. H., Kim, H. J., and Lee, S. G., "Semantic Search and Recommendation of e-Catalog Documents through Concept Network," The Journal of Society for e-Business Studies, Vol. 15, No. 3, pp. 131-145, 2010.

Lu, Z., Liu, Y., Zhao, S., and Chen, X., "Study on Feature Selection and Weighting Based on Synonym Merge in Text Categorization," Proceedings of the 2nd International Conference on Future Networks, pp. 105-109, 2010.

MALLET, MAchine Learning for Language Toolkit,

Mansuy, T. and Hilderman, R., "Evaluating WordNet Features in Text Classification Models," Proceedings of the 19th International Florida Artificial Intelligence Research Symposium, pp. 568-573, 2006.

Mitchell, T. M., Machine Learning, McGraw-Hill, 1997.

Oh, H. J. and Myaeng, S. H., "A Hypertext Categorization Method using Incrementally Computable Class Link Information," Journal of Korean Institute of Information Scientist and Engineers, Vol. 29, No. 7-8, pp. 498-509, 2002.

Oh, S. J., Ahn, J. H., and Park, J. S., "Ontology Selection Ranking Model based on Semantic Similarity Approach," The Journal of Society for e-Business Studies, Vol. 14, No. 2, pp. 95-116, 2009.

Priss, U., "Formal Concept Analysis in Information Science," Annual Review of Information Science and Technology, Vol. 40, No. 1, pp. 521-543, 2006.

RiTa.WordNet, A WordNet library for Java/Processing,

Scott, S. and Matwin, S., "Feature engineering for text classification," Proceedings of 16th International Conference on Machine Learning, pp. 379-388, 1999.

Utard, H. and F├╝rnkranz, J., "Link-Local Features for Hypertext Classification," Semantics, Web and Mining : Joint International Workshops, Lecture Notes in Computer Science, Vol. 4289, pp. 51-64, 2005.

Zhang, B., Yan, J., Chen, Z., Liu, W., Bai, F., and Chien, L., "Text representation: from vector to tensor," Proceedings of 5th IEEE International Conference on Data Mining, pp. 725-728, 2005.

Zhao, Y., Karypis, G., and Fayyad, U., "Hierarchical Clustering Algorithms for Document Datasets," Data Mining and Knowledge Discovery, Vol. 10, No. 2, pp. 141-168, 2005.


  • There are currently no refbacks.