nlp - Use Python to print sentences belonging to most common words in a document -
i have text document, using regex
, nltk
find top 5
common words document. have print out sentences these words belong to, how do that? further, want extend finding common words in multiple documents , returning respective sentences.
import nltk import collections collections import counter import re import string frequency = {} document_text = open('test.txt', 'r') text_string = document_text.read().lower() match_pattern = re.findall(r'\b[a-z]{3,15}\b', text_string) #return words number of characters in range [3-15] fdist = nltk.freqdist(match_pattern) # creates frequency distribution list most_common = fdist.max() # returns single element top_five = fdist.most_common(5)# returns list list_5=[word (word, freq) in fdist.most_common(5)] print(top_five) print(list_5)
output:
[('you', 8), ('tuples', 8), ('the', 5), ('are', 5), ('pard', 5)] ['you', 'tuples', 'the', 'are', 'pard']
the output commonly occurring words have print sentences these words belong to, how do that?
although doesn't account special characters @ word boundaries code does, following starting point:
for sentence in text_string.split('.'): if list(set(list_5) & set(sentence.split(' '))): print sentence
we first iterate on sentences, assuming each sentence ends .
, .
character else in text. afterwards, print sentence if intersection of set of words set of words in list_5
not empty.
Comments
Post a Comment