A Semantic Text Model with Wikipedia-based Concept Space

Han-Joon Kim, Jae-Young Chang


Current text mining techniques suffer from the problem that the conventional text representation models cannot express the semantic or conceptual information for the textual documents written with natural languages. The conventional text models represent the textual documents as bag of words, which include vector space model, Boolean model, statistical model, and tensor space model. These models express documents only with the term literals for indexing and the frequency-based weights for their corresponding terms; that is, they ignore semantical information, sequential order information, and structural information of terms. Most of the text mining techniques have been developed assuming that the given documents are represented as ‘bag-of-words’ based text models. However, currently, confronting the big data era, a new paradigm of text representation model is required which can analyse huge amounts of textual documents more precisely. Our text model regards the ‘concept’ as an independent space equated with the ‘term’ and ‘document’ spaces used in the vector space model, and it expresses the relatedness among the three spaces. To develop the concept space, we use Wikipedia data, each of which defines a single concept. Consequently, a document collection is represented as a 3-order tensor with semantic information, and then the proposed model is called text cuboid model in our paper. Through experiments using the popular 20NewsGroup document corpus, we prove the superiority of the proposed text model in terms of document clustering and concept clustering.

Full Text:



Antonellis, I. and Gallopoulos, E., Exploring term-document matrices from matrix models in text mining, SIAM Text Mining Workshop, SIAM Conference on Data Mining, 2006.

Berry, M. W., Survey of text mining : Clustering, Classification, and Retrieval, Springer-Verlag, 2003.

Cai, D., He, X., Wen, J. R., Han, J., and Ma, W. Y., Support Tensor Machines for Text Categorization, Technical Report UIUCDCS-R-2006-2714, 2006.

Cavnar, W. B. and Trenkle, J. M., N- Gram-Based Text Categorization, Proceedings of 3rd Annual Symposium on Document Analysis and Information Retrieval, pp. 161-175, 1994.

Faulkner, A., Automated Classification of Stance in Student Essays : An Approach Using Stance Target Information and the Wikipedia Link-Based Measure, Science, Vol. 376, No. 12, p. 86, 2014.

Gabrilovich, E. and Markovitch, S., Feature generation for text categorization using world knowledge, Proceedings of International Joint Conferences on Artificial Intelligence, pp. 1048-1053, 2005.

Howard, T. and Croft, W. B., Inference networks for document retrieval, Proceedings of International ACM SIGIR, pp. 1-24, 1989.

http://www.emc.com/collateral/analyst- reports/idc-extracting-value-from-chaos- ar.pdf.

http://www.statsoft.com/textbook/text- mining/.

Jiang, C., Coenen, F., Sanderson, R., and Zito, M., Text Classification Using Graph Mining-Based Feature Extraction, Knowledge-Based Systems, Vol. 23, No. 4, pp. 302-308, 2009.

Kimbrough, S., Executive Briefing : Text Mining for Business Intelligence, INSEAD- UNILEVER workshop, 2006.

Lancaster, F. W. and Fayen, E. G., Information Retrieval On-Line, Melville Publishing Co., 1973.

Maron, M. and Kuhns, J., On relevance, probabilistic indexing and information retrieval, Journal of the Association for Computing Machinery, Vol. 7, pp. 216-244, 1960.

Martinez, D. and Baldwin, T., Word sense disambiguation for event trigger word detection, Proceedings of the ACM fourth international workshop on Data and text mining in biomedical informatics, pp. 41- 48, 2010.

Navigli, R., Word sense disambiguation : A survey, ACM Computing Surveys, Vol. 41, No. 2, pp. 1-69, 2009.

Ribeiro, B. and Muntz, R. A., Belief Network Model for IR, Proceedings of International ACM SIGIR, pp. 253-260, 1996.

Salton, G., Wong, A., and Yang, C. S., A Vector Space Model for Automatic Indexing, Communications of the ACM, Vol. 18, No. 11, pp. 613-620, 1975.

Schenker, A., Last, M., Bunke, H., and Kandel, A., Classification of Web Documents Using a Graph Model, Proceedings of 7th International Conference on Document Analysis and Recognition, pp. 240- 244, 2003.

Sui, Z., Zhao, Q., and Liu, Y., Inducting Concept Hierarchies from Text based on FCA, Proceedings of Fourth International Conference on Innovative Computing, Information and Control, pp. 1080-1083, 2009.

Tamara, G. K. and Bader, B., Tensor Decompositions and Applications, SIAM Review, Vol. 51, No. 3, pp. 455-500, 2009.

The Value and Benefits of Text Mining, JISC Digital Infrastructure, 2012.

Witten, I. H., Text Mining, http://www. cs.waikato.ac.nz/~ihw/papers/04-IHW- Textmining.pdf.

Wu, J., Xuan, Z., and Pan, D., Enhancing Text Representation for Classification Tasks with Semantic Graph Structures, International Journal of Innovative Computing, Information Control, Vol. 7, No. 5(B), pp. 2689-2698, 2011.

Yeon, J., Shim, J., and Lee, S. G., Outlier Detection Techniques for Biased Opinion Discovery, Journal of Society for e-Business Studies, Vol. 18, No. 4, pp. 315-326, 2013.

Zhang, B., Yan, J., Chen, Z., Liu, W., Bai, F., and Chien, L., Text representation : from vector to tensor, Fifth IEEE International Conference on Data Mining, pp. 725-728, 2005.


  • There are currently no refbacks.