nlp - Use Python to print sentences belonging to most common words in a document -

July 15, 2014

i have text document, using regex , nltk find top 5 common words document. have print out sentences these words belong to, how do that? further, want extend finding common words in multiple documents , returning respective sentences.

import nltk import collections collections import counter  import re import string  frequency = {} document_text = open('test.txt', 'r') text_string = document_text.read().lower() match_pattern = re.findall(r'\b[a-z]{3,15}\b', text_string) #return words number of characters in range [3-15]  fdist = nltk.freqdist(match_pattern) # creates frequency distribution  list most_common = fdist.max()    # returns single element top_five = fdist.most_common(5)# returns list  list_5=[word (word, freq) in fdist.most_common(5)]   print(top_five) print(list_5)

output:

[('you', 8), ('tuples', 8), ('the', 5), ('are', 5), ('pard', 5)] ['you', 'tuples', 'the', 'are', 'pard']

the output commonly occurring words have print sentences these words belong to, how do that?

although doesn't account special characters @ word boundaries code does, following starting point:

for sentence in text_string.split('.'):     if list(set(list_5) & set(sentence.split(' '))):         print sentence

we first iterate on sentences, assuming each sentence ends . , . character else in text. afterwards, print sentence if intersection of set of words set of words in list_5 not empty.

Search This Blog

Force Net

nlp - Use Python to print sentences belonging to most common words in a document -

Comments

Post a Comment

Popular posts from this blog

ubuntu - PHP script to find files of certain extensions in a directory, returns populated array when run in browser, but empty array when run from terminal -

php - How can i create a user dashboard -

javascript - How to detect toggling of the fullscreen-toolbar in jQuery Mobile? -