Efficiency Analysis of Jaccard Similarity in Probabilistic Distribution Model

<p>Jieru Zhang</p>

doi:10.25236/AJCIS.2023.060208

Academic Journal of Computing & Information Science, 2023, 6(2); doi: 10.25236/AJCIS.2023.060208.

Efficiency Analysis of Jaccard Similarity in Probabilistic Distribution Model

Author(s)

Jieru Zhang

Corresponding Author:

Jieru Zhang

Affiliation(s)

Beijing Xinfeng Aerospace Equipment Co. Ltd, Intelligent Equipment and Technology Research Laboratory, Beijing, China

Download PDF
|
Download: 18
|
View: 568

Abstract

The inner probabilistic properties of the big data have a great impact on the performance of pattern recognition systems. Jaccard similarity (JS) is a most popular statistic metric used for cal-culating the similarity of objects in feature extraction process. The paper combines JS with probabil-istic distribution model to explore the effect of the inner properties of big data. It deduced the gener-alized form of JS for probabilistic model and determined the calculation method of JS for power-law and exponential distribution. Experiment observations showed that power-law distribution has high-er JS than the correspondent exponential distribution, which denotes that power-law probabilistic structure is a more efficient probability structure. The original normalized data in MNIST database exhibited a more power-law-like distribution and the randomly translated data exhibited a more exponential-like distribution. The MNIST data with power-law-like property has higher JS and are more efficient comparing to the translated data. Thus, these observations provide possible guidelines for efficient information coding and processing methods.

Keywords

Jaccard similarity, Power-law distribution, Exponential distribution, Efficiency analysis

Cite This Paper

Jieru Zhang. Efficiency Analysis of Jaccard Similarity in Probabilistic Distribution Model. Academic Journal of Computing & Information Science (2023), Vol. 6, Issue 2: 53-63. https://doi.org/10.25236/AJCIS.2023.060208.

References

[1] C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. E. Reed, D. Anguelov, D. Erhan, V. Vanhoucke, and A. Rabinovich, Going deeper with convolutions, IEEE Conf. Comput. Vis. Pattern Recognit., pp. 1–9, 2015.

[2] O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpathy, A. Khosla, and M. Bernstein, ImageNet Large Scale Visual Recognition Challenge, Int. J. Comput. Vis., vol. 115, no. 3, pp. 211–252, 2014.

[3] A. Krizhevsky, I. Sutskever, and G. E. Hinton, ImageNet classification with deep convolutional neural networks, in International Conference on Neural Information Processing Systems, 2012, pp. 1097–1105.

[4] K. S. Tai, R. Socher, and C. D. Manning, Improved Semantic Representations From Tree-Structured Long Short-Term Memory Networks, Comput. Sci., vol. 5, no. 1, p. : 36., 2015.

[5] Y. Kim, Convolutional Neural Networks for Sentence Classification, Eprint Arxiv, 2014.

[6] N. Kalchbrenner, E. Grefenstette, and P. Blunsom, A Convolutional Neural Network for Model-ling Sentences, Eprint Arxiv, vol. 1, 2014.

[7] R. S. A. M. David Mascharka Philip Tran, Transparency by Design: Closing the Gap Between Performance and Interpretability in Visual Reasoning, IEEE Conf. Comput. Vis. Pattern Recognit., 2018.

[8] M. B. Ari S. Morcos, David G.T. Barrett, Neil C. Rabinowitz, On the importance of single direc-tions for generalization, Int. Conf. Learn. Represent., 2018.

[9] S. Ritter, D. G. T. Barrett, A. Santoro, and M. M. Botvinick, Cognitive Psychology for Deep Neural Networks: A Shape Bias Case Study, in Proceedings of the 34 th International Conference on Ma-chine Learning, 2017.

[10] S. Dasgupta, C. F. Stevens, and S. Navlakha, A neural algorithm for a fundamental computing problem., Science (80-. )., vol. 358, no. 6364, pp. 793–796, 2017.

[11] J. Ji, J. Li, S. Yan, Q. Tian, and B. Zhang, Min-Max Hash for Jaccard Similarity, in IEEE Interna-tional Conference on Data Mining, 2014, pp. 301–309.

[12] A. Gionis, P. Indyk, and R. Motwani, Similarity Search in High Dimensions via Hashing, in In-ternational Conference on Very Large Data Bases, 1999, pp. 518–529.

[13] J. Yang, A. F. Frangi, J. Y. Yang, D. Zhang, and Z. Jin, KPCA Plus LDA: A Complete Kernel Fish-er Discriminant Framework for Feature Extraction and Recognition, IEEE Trans. Pattern Anal. Mach. Intell., vol. 27, no. 2, pp. 230–244, 2005.

[14] A. L. Yuille, P. W. Hallinan, and D. S. Cohen, Feature extraction from faces using deformable templates, Int. J. Comput. Vis., vol. 8, no. 2, pp. 99–111, 1992.

[15] J. Singthongchai and S. Niwattanakul, A Method for Measuring Keywords Similarity by Apply-ing Jaccard’s, N-Gram and Vector Space, Lect. Notes Inf. Theory, vol. 1, no. 4, pp. 159–164, 2013.

[16] R. Real, Tables of significant values of Jaccard’s index of similarity, Vet. Rec., vol. 22, no. 14, pp. 456–457, 1999.

[17] R. Real and J. M. Vargas, The Probabilistic Basis of Jaccard’s Index of Similarity, Syst. Biol., vol. 45, no. 3, pp. 380–385, 1996.

[18] M. Levandowsky and D. Winter, Distance between Sets, Nature, vol. 239, no. 5368, p. 174, 1971.

[19] E. Mossel, N. Olsman, and O. Tamuz, Efficient Bayesian Learning in Social Networks with Gaussian Estimators, in Communication, Control, and Computing, 2017, pp. 425–432.

[20] C. K. Wen, S. Jin, K. K. Wong, J. C. Chen, and P. Ting, Channel Estimation for Massive MIMO Using Gaussian-Mixture Bayesian Learning, IEEE Trans. Wirel. Commun., vol. 14, no. 3, pp. 1356–1368, 2015.

[21] H. Liu, J. Bo, H. Liu, and Z. Bao, Superresolution ISAR Imaging Based on Sparse Bayesian Learning, IEEE Trans. Geosci. Remote Sens., vol. 52, no. 8, pp. 5005–5013, 2014.

[22] M. E. Tipping, Sparse Bayesian Learning and Relevance Vector Machine, J. Mach. Learn. Res., vol. 1, no. 3, pp. 211–244, 2001.

[23] D. Kundu and R. D. Gupta, Generalized exponential distribution: Bayesian estimations, Comput. Stat. Data Anal., vol. 52, no. 4, pp. 1873–1883, 2008.

[24] W. J. Ma, J. M. Beck, P. E. Latham, and A. Pouget, Bayesian inference with probabilistic popula-tion codes, Nat. Neurosci., vol. 9, no. 11, pp. 1432–1438, 2006.

[25] R. D. Gupta and D. Kundu, Generalized exponential distribution: different method of estima-tions, J. Stat. Comput. Simul., vol. 69, no. 4, pp. 315–337, 2001.

[26] G. Zheng and Q. Liu, Scale-free topology evolution for wireless sensor networks, Comput. Electr. Eng., vol. 39, no. 6, pp. 1779–1788, 2013.

[27] A. Clauset, C. R. Shalizi, and M. E. J. Newman, Power-Law Distributions in Empirical Data, Siam Rev., vol. 51, no. 4, pp. 661–703, 2012.

[28] M. L. Goldstein, S. A. Morris, and G. G. Yen, Problems with fitting to the power-law distribution, Eur. Phys. J. B - Condens. Matter Complex Syst., vol. 41, no. 2, pp. 255–258, 2004.

[29] X. Gabaix, P. Gopikrishnan, V. Plerou, and H. E. Stanley, A theory of power-law distributions in financial market fluctuations, Nature, vol. 423, no. 6937, pp. 267–270, 2003.

[30] M. Levy and S. Solomon, New evidence for the power-law distribution of wealth. Physica A: Statistical Mechanics and its Applications, vol. 242, no. 1–2, pp. 90–94, 1997.

[31] Y. Tian, C. Yang, Y. Cui, et al., An excitatory neural assembly encodes short-term memory in the prefrontal cortex. Cell Rep., vol. 22, no. 7, pp. 1734–1744, 2018.

[32] M. E. J. Newman, Power laws, Pareto distributions and Zipf’s law, Contemp. Phys., vol. 46, no. 5, pp. 323–351, 2005.

[33] Feng, Peijiang, Yuan, Yangzhen, Wang, Chen, and Zhang, The superior fault tolerance of artifi-cial neural network training with a fault/noise injection- based genetic algorithm. Protein Cell, vol. 7, no. 10, pp. 735–748, 2016.

[34] Y. Lecun, L. Bottou, Y. Bengio, and P. Haffner, Gradient-based learning applied to document recognition. Proceedings of the IEEE, vol. 86, no. 11, pp. 2278–2324, 1998.