python - More efficient way to get various token count stats from array and list -

May 15, 2010

i'm classifying spam list of email text (stored in csv format), before can this, want simple count stats output. used countvectorizer sklearn first step , implemented following code

import pandas pd import numpy np sklearn.model_selection import train_test_split sklearn.feature_extraction.text import countvectorizer  #import data csv  spam = pd.read_csv('spam.csv') spam['spam'] = np.where(spam['spam']=='spam',1,0)  #split data  x_train, x_test, y_train, y_test = train_test_split(spam_data['text'], spam_data['target'], random_state=0)   #convert 'features' numeric , matrix or list cv = countvectorizer() x_traincv = cv.fit_transform(x_train) = x_traincv.toarray() a_list = cv.inverse_transform(a)

the output stored in matrix (named 'a') or list of arrays (named 'a_list') format looks this

[array(['do', 'i', 'off', 'text', 'where', 'you'],         dtype='<u32'),  array(['ages', 'will', 'did', 'driving', 'have', 'hello', 'hi', 'hol', 'in', 'its', 'just', 'mate', 'message', 'nice', 'off', 'roads', 'say', 'sent', 'so', 'started', 'stay'], dtype='<u32'),              ...  array(['biz', 'for', 'free', 'is', '1991', 'network', 'operator', 'service', 'the', 'visit'], dtype='<u32')]

but found little difficult simple count stats these outputs, such longest/shortest token, average length of tokens, etc. how can these simple count stats matrix or list output generated?

you can load tokens, token counts, , token lengths new pandas dataframe, custom queries.

here simple example toy data set.

import pandas pd import numpy np sklearn.feature_extraction.text import countvectorizer  texts = ["dog cat fish","dog cat cat","fish bird walrus monkey","bird lizard"]  cv = countvectorizer() cv_fit = cv.fit_transform(texts) # https://stackoverflow.com/a/16078639/2491761 tokens_and_counts = zip(cv.get_feature_names(), np.asarray(cv_fit.sum(axis=0)).ravel())  df = pd.dataframe(tokens_and_counts, columns=['token', 'count'])  df['length'] = df.token.str.len() # https://stackoverflow.com/a/29869577/2491761  # tokens length equal min token length: df.loc[df['length'] == df['length'].min(), 'token']  # tokens length equal max token length: df.loc[df['length'] == df['length'].max(), 'token']  # tokens length less mean token length: df.loc[df['length'] < df['length'].mean(), 'token']  # tokens length greater 1 standard deviation mean: df.loc[df['length'] > df['length'].mean() + df['length'].std(), 'token']

can extended if want queries based on counts.

Search This Blog

Force Net

python - More efficient way to get various token count stats from array and list -

Comments

Post a Comment

Popular posts from this blog

python - Operations inside variables -

Generic Map Parameter java -

arrays - What causes a java.lang.ArrayIndexOutOfBoundsException and how do I prevent it? -