Web Site Keyword Selection Method by Considering Semantic Similarity Based on Word2Vec

Donghun Lee, Kwanho Kim


Extracting keywords representing documents is very important because it can be used for automated services such as document search, classification, recommendation system as well as quickly transmitting document information. However, when extracting keywords based on the frequency of words appearing in a web site documents and graph algorithms based on the co-occurrence of words, the problem of containing various words that are not related to the topic potentially in the web page structure, There is a difficulty in extracting the semantic keyword due to the limit of the performance of the Korean tokenizer. In this paper, we propose a method to select candidate keywords based on semantic similarity, and solve the problem that semantic keyword can not be extracted and the accuracy of Korean tokenizer analysis is poor. Finally, we use the technique of extracting final semantic keywords through filtering process to remove inconsistent keywords. Experimental results through real web pages of small business show that the performance of the proposed method is improved by 34.52% over the statistical similarity based keyword selection technique. Therefore, it is confirmed that the performance of extracting keywords from documents is improved by considering semantic similarity between words and removing inconsistent keywords.

Full Text:



Cao, J., Jiang, Z., Huang, M., and Wang, K., “A Way to Improve Graph-Based Keyword Extraction,” Proceedings of IEEE International Conference on Computer and Communications, pp. 166-170, 2015.

Cho, T. and Lee, J.-H., “Latent Keyphrase Extraction Using LDA Model,” Journal of Korean Institute of Intelligent Systems, Vol. 25, No. 2, pp. 180-185, 2015.

Choi, D. J., Lee, S. W., Kim. J. K., and Lee, J. H., “A Study on Graph-Based Topic Extraction from Microblogs,” Journal of Korean Institute of Intelligent Systems, Vol. 21, No. 5, pp. 564-568, 2011.

Hu, J., Jin, F., Zhang, G., Wang, J., and Yang, Y., “A User Profile Modeling Method Based on Word2Vec,” Proceedings of IEEE International Conference on Software Quality, Reliability and Security Companion, pp. 410-414, 2017.

Lee, K-H., Lee, K-C., and Kim, K-Ok., “Ranked Web Service Retrieval by Keyword Search,” The Journal of Society for e-Business Studies, Vol. 13, No. 2, pp. 213-223, 2008.

Lee, S. and Kim, H. J., “News Keyword Extraction for Topic Tracking,” Proceedings of IEEE Networked Computing and Advanced Information Management, Vol. 2, pp. 554-559, 2008.

Lee, S.-J. and Kim, H-J., “Keyword Extraction from News Corpus using Modified TF-IDF,” The Journal of Society for e-Business Studies, Vol. 14, No. 4, pp. 59-73, 2009.

Lee, Y. J., “Korean Morphological Analysis Algorithmas for Automatic Idexing,” Proceedings of the Annual Conference on Human and Cognitive Language Technology, pp. 240-246, 1989.

Lott, B., “Survey of Keyword Extraction Techniques,” UNM Education, 2012.

Matsuo, Y. and Ishizuka, M., “Keyword Extraction from a Single Document using Word Co-occurrence Statistical Information,” International Journal of Artificial Intelligence Tools, Vol. 13, No. 1, pp. 157-169, 2004.

Mikolov, T., Sutskever, I., Chen, K., Corrado, G. S., and Dean, J., “Distributed Representations of Words and Phrases and Their Compositionality,” Advances in Neural Information Processing Systems, pp. 3111-3119, 2013.

Mikolov, T., Chen, K., Corrado, G., and Dean, J., “Efficient Estimation of Word Representations in Vector Space,” arXiv preprint arXiv, pp. 1301-3781, 2013.

Noh, Y., Lim, J., Bok, K., and Yoo, J., “Hot Topic Prediction Scheme using Modified TF-IDF in Social Network Environments,” Journal of Korean Institute of Information Scientists end Engineers, Vol. 23, No. 4, pp. 217-225, 2017.

Oh, J. Y. and Cha, J. W., “High Speed Korean Dependency Analysis using Cascaded Chunking,” Journal of the Korea Society for Simulation, Vol. 19, No. 1, pp. 103-111, 2010.

Robertson, S. E., “Term Specificity,” Journal of Documentation, Vol. 28, No. 1, pp. 164-165, 1972.

Rose, S., Engel, D., Cramer, N., and Cowley, W., “Automatic Keyword Extraction from Individual Documents, Text Mining: Applications and Theory,” pp. 1-20, WILEY, 2010.

Shin, J.-C. and Ock, C.-Y., “A Korean Morphological Analyzer using a Pre-analyzed Partial Word-phrase Dictionary,” Journal of Software and Applications, Vol. 39, No. 5, pp. 415-424, 2012.

Song, G. H. and Kim, Y.-S., “Automatic Keyword Extraction using Hierarchical Graph Model Based on Word Co-occurrences,” Journal of Korean Institute of Information Scientists end Engineers, Vol. 44, No. 5, pp. 522-536, 2017.

Wen, Y., Yuan, H., and Zhang, P., “Research on Keyword Extraction Based on Word2Vec Weighted TextRank,” Proceedings of IEEE International Conference on Computer and Communications, No. 2, pp. 2109-2113, 2016.

Yarowsky, D., “Unsupervised word sense disambiguation rivaling supervised methods,” Proceedings of the Association for Computational Linguistics, pp. 189-196, 1995.


  • There are currently no refbacks.