nlp - Similarity between two text documents -


i looking @ working on nlp project, in language (though python preference).

i want write program take 2 documents , determine how similar are.

as new , quick google search not point me much. know of references (websites, textbooks, journal articles) cover subject , of me?

thanks

the common way of doing transform documents tf-idf vectors, compute cosine similarity between them. textbook on information retrieval (ir) covers this. see esp. introduction information retrieval, free , available online.

tf-idf (and similar text transformations) implemented in python packages gensim , scikit-learn. in latter package, computing cosine similarities easy as

from sklearn.feature_extraction.text import tfidfvectorizer  documents = [open(f) f in text_files] tfidf = tfidfvectorizer().fit_transform(documents) # no need normalize, since vectorizer return normalized tf-idf pairwise_similarity = tfidf * tfidf.t 

or, if documents plain strings,

>>> vect = tfidfvectorizer(min_df=1) >>> tfidf = vect.fit_transform(["i'd apple", ...                             "an apple day keeps doctor away", ...                             "never compare apple orange", ...                             "i prefer scikit-learn orange"]) >>> (tfidf * tfidf.t).a array([[ 1.        ,  0.25082859,  0.39482963,  0.        ],        [ 0.25082859,  1.        ,  0.22057609,  0.        ],        [ 0.39482963,  0.22057609,  1.        ,  0.26264139],        [ 0.        ,  0.        ,  0.26264139,  1.        ]]) 

though gensim may have more options kind of task.

see this question.

[disclaimer: involved in scikit-learn tf-idf implementation.]


Comments

Popular posts from this blog

ubuntu - PHP script to find files of certain extensions in a directory, returns populated array when run in browser, but empty array when run from terminal -

php - How can i create a user dashboard -

javascript - How to detect toggling of the fullscreen-toolbar in jQuery Mobile? -