text - Why do we calculate cosine similarities using tf-idf weightings? -
suppose trying measure similarity between 2 similar documents.
document a: "a b c d" document b: "a b c e"
this corresponds term-frequency matrix
b c d e 1 1 1 1 0 b 1 1 1 0 1
where cosine similarity on raw vectors dot product of 2 vectors , b, divided product of magnitudes:
3/4 = (1*1 + 1*1 + 1*1 + 1*0 + 1*0) / (sqrt(4) * sqrt(4)).
but when apply inverse document frequency transformation multiplying each term in matrix (log(n / df_i), n number of documents in matrix, 2, , df_i number of documents in term present, tf-idf matrix of
b c d e a: 0 0 0 log2 0 b: 0 0 0 0 1og2
since "a" appears in both documents, has inverse-document-frequency value of 0. same "b" , "c". meanwhile, "d" in document a, not in document b, multiplied log(2/1). "e" in document b, not in document a, multiplied log(2/1).
the cosine similarity between these 2 vectors 0, suggesting 2 totally different documents. obviously, incorrect. these 2 documents considered similar each other using tf-idf weightings, need third document c in matrix vastly different documents , b.
thus, wondering whether and/or why use tf-idf weightings in combination cosine similarity metric compare highly similar documents. none of tutorials or stackoverflow questions i've read have been able answer question.
this post discusses similar failings tf-idf weights using cosine similarities, offers no guidance on them.
edit: turns out, guidance looking in comments of blog post. recommends using formula
1 + log ( n / ni + 1)
as inverse document frequency transformation instead. keep weights of terms in every document close original weights, while inflating weights of terms not present in lot of documents greater degree. interesting formula not more prominently found in posts tf-idf.
since "a" appears in both documents, has inverse-document-frequency value of 0
this have made error in using inverse document frequency (idf). idf meant computed on large collection of documents (not across 2 documents), purpose being able predict importance of term overlaps in document pairs.
you expect common terms, such 'the', 'a' etc. overlap across document pairs. should having contribution similarity score? - no.
that precisely reason why vector components multiplied idf factor - dampen or boost particular term overlap (a component of form a_i*b_i being added numerator in cosine-sim sum).
now consider have collection on computer science journals. believe overlap of terms such 'computer' , 'science' across document pair considered important? - no. , indeed happen because idf of these terms considerably low in collection.
what think happen if extend collection scientific articles of discipline? in collection, idf value of word 'computer' no longer low. , makes sense because in general collection, think 2 documents similar enough if on same topic - computer science.
Comments
Post a Comment