Practical Datasets for Similarity Measures and Their Threshold Values

Byoungju Yang, Junho Shim

Abstract


In the e-business domain where data objects are quantitatively large, measuring similarity to find the same or similar objects is important. It basically requires comparing and computing the features of objects in pairs, and therefore takes longer time as the amount of data becomes bigger. Recent studies have shown various algorithms to efficiently perform it. Most of them show their performance superiority by empirical tests over some sets of data. In this paper, we introduce those data sets, present their characteristics and the meaningful threshold values that each of data sets contain in nature. The analysis on practical data sets with respect to their threshold values may serve as a referential baseline to the future experiments of newly developed algorithms.

Full Text:

PDF

References


Bayardo, R. J., Ma, Y., and Srikant, R., "Scaling up all pairs similarity search," In Proceedings of the 16th international conference on World Wide Web, WWW '07, USA, 2007.

Dean, J. and Ghemawat, S., "Mapreduce: simplified data processing on large clusters," Communications of ACM, Vol. 51, No. 1, pp. 107-113, 2008.

Last.fm Web Services, http://www.last. fm/api, 2012.

Lee, D. and Shim, J., "Survey on Vector Similarity Measures : Focusing on Algebraic Characteristic," The Journal of Society for e-Business Studies, Vol. 17, No. 4, pp. 209-219, 2012.

CrossRef

Lee, D., Park, J., Shim, J., and Lee, S. G., "An efficient similarity join algorithm with cosine similarity predicate," In Proceedings of the DEXA (2), 2010.

Metwally, A. and Faloutsos, C., "V-smart join: a scalable mapreduce framework for all-pair similarity joins of multisets and vectors," Proc. VLDB Endow, Vol. 5, No. 8, pp. 704-715, 2012.

Movielens data sets, grouplens research. http://www.grouplens.org/node/73, 2011.

Nister, D. and Stewenius, H., "Scalable recognition with a vocabulary tree," In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Vol. 2, pp. 2161-2168, 2006.

Stanford Large Network Dataset Collection, Stanford University, http://snap.- stanford.edu/data/, 2012.

The DBLP Computer Science Bibliography, http://www.informatik.uni-trier. de/-ley/db/, 2012.

Vernica, R., Carey, M. J., and Li, C., "Efficient parallel set-similarity joins using mapreduce," In Proceedings of the 2010 ACM SIGMOD International Conference on Management of data, 2010.

Yang, B., Kim, H., Shim, J., Lee, D., and Lee, S. G., "A MapReduce-based Filtering Framework for Vector Similarity Joins," Technical Report, Seoul National Univ, 2013.

Yang, B., Myung, J., Lee, S. G. and Lee, D., "A mapreduce-based filtering algorithm for vector similarity join," In Proceedings of the ICUIMC(IMCOM) '13, 2013.

Yeon, J., Lee, D., Shim, J., and Lee, S. G., "Product Review Data and Sentiment Analytical Processing Modeling," The Journal of Society for e-Business Studies, Vol. 16, No. 4, pp. 125-137, 2011.


Refbacks

  • There are currently no refbacks.