Academic Journal of Computing & Information Science, 2023, 6(2); doi: 10.25236/AJCIS.2023.060208.
Jieru Zhang
Beijing Xinfeng Aerospace Equipment Co. Ltd, Intelligent Equipment and Technology Research Laboratory, Beijing, China
The inner probabilistic properties of the big data have a great impact on the performance of pattern recognition systems. Jaccard similarity (JS) is a most popular statistic metric used for cal-culating the similarity of objects in feature extraction process. The paper combines JS with probabil-istic distribution model to explore the effect of the inner properties of big data. It deduced the gener-alized form of JS for probabilistic model and determined the calculation method of JS for power-law and exponential distribution. Experiment observations showed that power-law distribution has high-er JS than the correspondent exponential distribution, which denotes that power-law probabilistic structure is a more efficient probability structure. The original normalized data in MNIST database exhibited a more power-law-like distribution and the randomly translated data exhibited a more exponential-like distribution. The MNIST data with power-law-like property has higher JS and are more efficient comparing to the translated data. Thus, these observations provide possible guidelines for efficient information coding and processing methods.
Jaccard similarity, Power-law distribution, Exponential distribution, Efficiency analysis
Jieru Zhang. Efficiency Analysis of Jaccard Similarity in Probabilistic Distribution Model. Academic Journal of Computing & Information Science (2023), Vol. 6, Issue 2: 53-63. https://doi.org/10.25236/AJCIS.2023.060208.
[1] C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. E. Reed, D. Anguelov, D. Erhan, V. Vanhoucke, and A. Rabinovich, Going deeper with convolutions, IEEE Conf. Comput. Vis. Pattern Recognit., pp. 1–9, 2015.
[2] O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpathy, A. Khosla, and M. Bernstein, ImageNet Large Scale Visual Recognition Challenge, Int. J. Comput. Vis., vol. 115, no. 3, pp. 211–252, 2014.
[3] A. Krizhevsky, I. Sutskever, and G. E. Hinton, ImageNet classification with deep convolutional neural networks, in International Conference on Neural Information Processing Systems, 2012, pp. 1097–1105.
[4] K. S. Tai, R. Socher, and C. D. Manning, Improved Semantic Representations From Tree-Structured Long Short-Term Memory Networks, Comput. Sci., vol. 5, no. 1, p. : 36., 2015.
[5] Y. Kim, Convolutional Neural Networks for Sentence Classification, Eprint Arxiv, 2014.
[6] N. Kalchbrenner, E. Grefenstette, and P. Blunsom, A Convolutional Neural Network for Model-ling Sentences, Eprint Arxiv, vol. 1, 2014.
[7] R. S. A. M. David Mascharka Philip Tran, Transparency by Design: Closing the Gap Between Performance and Interpretability in Visual Reasoning, IEEE Conf. Comput. Vis. Pattern Recognit., 2018.
[8] M. B. Ari S. Morcos, David G.T. Barrett, Neil C. Rabinowitz, On the importance of single direc-tions for generalization, Int. Conf. Learn. Represent., 2018.
[9] S. Ritter, D. G. T. Barrett, A. Santoro, and M. M. Botvinick, Cognitive Psychology for Deep Neural Networks: A Shape Bias Case Study, in Proceedings of the 34 th International Conference on Ma-chine Learning, 2017.
[10] S. Dasgupta, C. F. Stevens, and S. Navlakha, A neural algorithm for a fundamental computing problem., Science (80-. )., vol. 358, no. 6364, pp. 793–796, 2017.
[11] J. Ji, J. Li, S. Yan, Q. Tian, and B. Zhang, Min-Max Hash for Jaccard Similarity, in IEEE Interna-tional Conference on Data Mining, 2014, pp. 301–309.
[12] A. Gionis, P. Indyk, and R. Motwani, Similarity Search in High Dimensions via Hashing, in In-ternational Conference on Very Large Data Bases, 1999, pp. 518–529.
[13] J. Yang, A. F. Frangi, J. Y. Yang, D. Zhang, and Z. Jin, KPCA Plus LDA: A Complete Kernel Fish-er Discriminant Framework for Feature Extraction and Recognition, IEEE Trans. Pattern Anal. Mach. Intell., vol. 27, no. 2, pp. 230–244, 2005.
[14] A. L. Yuille, P. W. Hallinan, and D. S. Cohen, Feature extraction from faces using deformable templates, Int. J. Comput. Vis., vol. 8, no. 2, pp. 99–111, 1992.
[15] J. Singthongchai and S. Niwattanakul, A Method for Measuring Keywords Similarity by Apply-ing Jaccard’s, N-Gram and Vector Space, Lect. Notes Inf. Theory, vol. 1, no. 4, pp. 159–164, 2013.
[16] R. Real, Tables of significant values of Jaccard’s index of similarity, Vet. Rec., vol. 22, no. 14, pp. 456–457, 1999.
[17] R. Real and J. M. Vargas, The Probabilistic Basis of Jaccard’s Index of Similarity, Syst. Biol., vol. 45, no. 3, pp. 380–385, 1996.
[18] M. Levandowsky and D. Winter, Distance between Sets, Nature, vol. 239, no. 5368, p. 174, 1971.
[19] E. Mossel, N. Olsman, and O. Tamuz, Efficient Bayesian Learning in Social Networks with Gaussian Estimators, in Communication, Control, and Computing, 2017, pp. 425–432.
[20] C. K. Wen, S. Jin, K. K. Wong, J. C. Chen, and P. Ting, Channel Estimation for Massive MIMO Using Gaussian-Mixture Bayesian Learning, IEEE Trans. Wirel. Commun., vol. 14, no. 3, pp. 1356–1368, 2015.
[21] H. Liu, J. Bo, H. Liu, and Z. Bao, Superresolution ISAR Imaging Based on Sparse Bayesian Learning, IEEE Trans. Geosci. Remote Sens., vol. 52, no. 8, pp. 5005–5013, 2014.
[22] M. E. Tipping, Sparse Bayesian Learning and Relevance Vector Machine, J. Mach. Learn. Res., vol. 1, no. 3, pp. 211–244, 2001.
[23] D. Kundu and R. D. Gupta, Generalized exponential distribution: Bayesian estimations, Comput. Stat. Data Anal., vol. 52, no. 4, pp. 1873–1883, 2008.
[24] W. J. Ma, J. M. Beck, P. E. Latham, and A. Pouget, Bayesian inference with probabilistic popula-tion codes, Nat. Neurosci., vol. 9, no. 11, pp. 1432–1438, 2006.
[25] R. D. Gupta and D. Kundu, Generalized exponential distribution: different method of estima-tions, J. Stat. Comput. Simul., vol. 69, no. 4, pp. 315–337, 2001.
[26] G. Zheng and Q. Liu, Scale-free topology evolution for wireless sensor networks, Comput. Electr. Eng., vol. 39, no. 6, pp. 1779–1788, 2013.
[27] A. Clauset, C. R. Shalizi, and M. E. J. Newman, Power-Law Distributions in Empirical Data, Siam Rev., vol. 51, no. 4, pp. 661–703, 2012.
[28] M. L. Goldstein, S. A. Morris, and G. G. Yen, Problems with fitting to the power-law distribution, Eur. Phys. J. B - Condens. Matter Complex Syst., vol. 41, no. 2, pp. 255–258, 2004.
[29] X. Gabaix, P. Gopikrishnan, V. Plerou, and H. E. Stanley, A theory of power-law distributions in financial market fluctuations, Nature, vol. 423, no. 6937, pp. 267–270, 2003.
[30] M. Levy and S. Solomon, New evidence for the power-law distribution of wealth. Physica A: Statistical Mechanics and its Applications, vol. 242, no. 1–2, pp. 90–94, 1997.
[31] Y. Tian, C. Yang, Y. Cui, et al., An excitatory neural assembly encodes short-term memory in the prefrontal cortex. Cell Rep., vol. 22, no. 7, pp. 1734–1744, 2018.
[32] M. E. J. Newman, Power laws, Pareto distributions and Zipf’s law, Contemp. Phys., vol. 46, no. 5, pp. 323–351, 2005.
[33] Feng, Peijiang, Yuan, Yangzhen, Wang, Chen, and Zhang, The superior fault tolerance of artifi-cial neural network training with a fault/noise injection- based genetic algorithm. Protein Cell, vol. 7, no. 10, pp. 735–748, 2016.
[34] Y. Lecun, L. Bottou, Y. Bengio, and P. Haffner, Gradient-based learning applied to document recognition. Proceedings of the IEEE, vol. 86, no. 11, pp. 2278–2324, 1998.