Similarity and Distance

Jaccard similarity

$$ JSim(C_1, C_2) = {|C_1 \cap C_2| \over |C_1 \cup C_2|} $$

Jaccard distance

$$ JDist(X, Y) = 1 - JSim(X, Y) $$


Euclidean distance

二维计算公式

$$ p = \sqrt{(x_2 - x_1)^2 + (y_2 - y_1)^2}, |X| = \sqrt{x_2^2 + y_2^w} $$

其中p为点$(x_2,y_2)与点(x_1, y_1)的欧氏距离, X为(x_2, y_2)到原点的欧氏距离$

三维空间公式

$$ p = \sqrt{(x_2 - x_1)^2 + (y_2 - y_1)^2 + (z_2-z_1)^2}, |X| = \sqrt{x_2^2 + y_2^2 + z_2^2} $$


Cosine Similarity

pearson皮尔逊系数就是cos计算之前两个向量都先进行中心化(centered)..从本质上, 皮尔逊相关系数是余弦相似度在维度值缺失情况下的一种改进.

$$ cos(d_1, d_2) = {d_1 * d_2 \over ||d_1|| ||d_2||} $$

$ \begin{array}{l} d_1 = 3 2 0 5 0 0 0 2 0 0 \\ d_2 = 1 0 0 0 0 0 0 1 0 2 \\ d1•d2= 3*1 + 2*0 + 0*0 + 5*0 + 0*0 + 0*0 + 0*0 + 2*1 + 0*0 + 0*2 = 5 \\ ||d1|| = \sqrt{3*3+2*2+0*0+5*5+0*0+0*0+0*0+2*2+0*0+0*0} = \sqrt{42} = 6.481 \\ ||d2|| = \sqrt{1*1+0*0+0*0+0*0+0*0+0*0+0*0+1*1+0*0+2*2} = \sqrt 6 = 2.245 \\ cos(d_1, d_2) = {5 \over 6.481 * 2.245} = .3125 \\ \end{array} $

Cosine distance

$$ Dist(X, Y) = 1 - cos(X, Y) $$


simhash distance


Hamming distance

bit-vectors XOR count. for exaple: $p_1 = 10101, p_2=10011$

$$ d(p_1, p_2) = 2 $$


Edit distance for strings

Example: x = abcde ; y = bcduve.

Turn x into y by deleting a, then inserting u and v after d.

Edit distance = 3.

Common distance measure for comparing DNA sequences(基因测序)


minhash

$$ Pr(h(S_1)=h(S_2)) \approx Sim(S_1, S_2) $$

h(x)哈希函数, Pr(x)概率函数


LSH(Locality Sensitive Hashing)


$$ \begin{array}{c|lcr} n & left & center & right \\ \hline 1 & 0.24 & 1 & 123 \\ 2 & -1 & 180 & -8 \\ 3 & -20 & 2000 & 1+10i \end{array} $$