nlp - Similarity between two text documents -
i looking @ working on nlp project, in language (though python preference).
i want write program take 2 documents , determine how similar are.
as new , quick google search not point me much. know of references (websites, textbooks, journal articles) cover subject , of me?
thanks
the common way of doing transform documents tf-idf vectors, compute cosine similarity between them. textbook on information retrieval (ir) covers this. see esp. introduction information retrieval, free , available online.
tf-idf (and similar text transformations) implemented in python packages gensim , scikit-learn. in latter package, computing cosine similarities easy as
from sklearn.feature_extraction.text import tfidfvectorizer documents = [open(f) f in text_files] tfidf = tfidfvectorizer().fit_transform(documents) # no need normalize, since vectorizer return normalized tf-idf pairwise_similarity = tfidf * tfidf.t
or, if documents plain strings,
>>> vect = tfidfvectorizer(min_df=1) >>> tfidf = vect.fit_transform(["i'd apple", ... "an apple day keeps doctor away", ... "never compare apple orange", ... "i prefer scikit-learn orange"]) >>> (tfidf * tfidf.t).a array([[ 1. , 0.25082859, 0.39482963, 0. ], [ 0.25082859, 1. , 0.22057609, 0. ], [ 0.39482963, 0.22057609, 1. , 0.26264139], [ 0. , 0. , 0.26264139, 1. ]])
though gensim may have more options kind of task.
see this question.
[disclaimer: involved in scikit-learn tf-idf implementation.]
Comments
Post a Comment