r - Compare the bag of words in two document and find the matching word and their frequency in second document -


i have calculated bag of words 'yelp.csv', 'yelpp.csv', 'yelpn.csv' , created matrix of individuals dataset's word frequency. now, want compare bag of words of yelp yelpn , check how many words in yelp appears in yelpn , frequency , store in variable matrix, same yelpp. yelp contains both positive , negative. yelpp, positive , yelpn, negative. can complete code? donno whether code relevant,i hope so.

getwd() setwd("/users/ash/rprojects/exc") getwd() df <- read.csv("yelp.csv",header = true,quote="\"",stringsasfactors= true,            strip.white = true) df dfd<-as.character(df[,2]) dfd df2<-as.character(df[,1]) df2 words <- readlines(system.file("stopwords", "english.dat",                            package = "tm")) s<-remove_stopwords(dfd, words, lines = true) s print(paste("****stopwords removed successfully****")) n<-removenumbers(s) n t<-removepunctuation(n, preserve_intra_word_dashes = false) t  #pos dfp <- read.csv("yelpp.csv",header = true,quote="\"",stringsasfactors= true,            strip.white = true) dfp dfdp<-as.character(dfp[,2]) dfdp df2p<-as.character(dfp[,1]) df2p wordsp <- readlines(system.file("stopwords", "english.dat",                            package = "tm")) sp<-remove_stopwords(dfdp, words, lines = true) sp print(paste("****stopwords removed successfully****")) np<-removenumbers(sp) np tp<-removepunctuation(np, preserve_intra_word_dashes = false) tp  #neg dfn <- read.csv("yelpn.csv",header = true,quote="\"",stringsasfactors=   true,            strip.white = true) dfn dfdn<-as.character(dfn[,2]) dfdn df2n<-as.character(dfn[,1]) df2n wordsn <- readlines(system.file("stopwords", "english.dat",                            package = "tm")) sn<-remove_stopwords(dfdn, words, lines = true) sn print(paste("****stopwords removed successfully****")) nn<-removenumbers(sn) nn tn<-removepunctuation(nn, preserve_intra_word_dashes = false) tn    #bag b<-bag_o_words(t, apostrophe.remove = true) b b.mat = as.matrix(b) b.mat bp<-bag_o_words(tp, apostrophe.remove = true) bp bp.mat = as.matrix(bp) bp.mat bn<-bag_o_words(tn, apostrophe.remove = true) bn bn.mat = as.matrix(bn) bn.mat  #frequent terms frequent_terms <- freq_terms(b.mat, 2000) frequent_terms frequent_termsp <- freq_terms(tp, 2000) frequent_termsp frequent_termsn <- freq_terms(tn, 2000) frequent_termsn 

i'm taking text example corpuses wiki text mining. using tm package , findfreqterms,agrep function main points in approach.

agrep

searches approximate matches pattern (the first argument) within each element of string x (the second argument) using generalized levenshtein edit distance (the minimal possibly weighted number of insertions, deletions , substitutions needed transform 1 string another).

approach steps :

texts -> corpuses -> data cleaning -> findfreqterms -> compare other term doc matrix

library(tm)  c1 <- corpus(vectorsource("text mining, referred text data mining, equivalent text analytics, process of deriving high-quality information text. high-quality information typically derived through devising of patterns , trends through means such statistical pattern learning"))  c2 <- corpus(vectorsource("text mining involves process of structuring input text (usually parsing, along addition of derived linguistic features , removal of others, , subsequent insertion database), deriving patterns within structured data, , evaluation , interpretation of output"))  c3 <- corpus(vectorsource("typical text mining tasks include text categorization, text clustering, concept/entity extraction, production of granular taxonomies, sentiment analysis, document summarization, , entity relation modeling (i.e., learning relations between named entities)"))  # data cleaning , transformation c1 <- tm_map(c1, content_transformer(tolower)) c2 <- tm_map(c2, content_transformer(tolower)) c3 <- tm_map(c3, content_transformer(tolower))  c1 <- tm_map(c1, removepunctuation) c1 <- tm_map(c1, removenumbers) c1 <- tm_map(c1, removewords, stopwords("english")) c1 <- tm_map(c1, stripwhitespace)  c2 <- tm_map(c2, removepunctuation) c2 <- tm_map(c2, removenumbers) c2 <- tm_map(c2, removewords, stopwords("english")) c2 <- tm_map(c2, stripwhitespace)  c3 <- tm_map(c3, removepunctuation) c3 <- tm_map(c3, removenumbers) c3 <- tm_map(c3, removewords, stopwords("english")) c3 <- tm_map(c3, stripwhitespace)  dtm1 <- documenttermmatrix(c1, control = list(weighting = weighttfidf, stopwords = true)) dtm2 <- documenttermmatrix(c2, control = list(weighting = weighttfidf, stopwords = true)) dtm3 <- documenttermmatrix(c3, control = list(weighting = weighttfidf, stopwords = true))  ft1 <- findfreqterms(dtm1) ft2 <- findfreqterms(dtm2) ft3 <- findfreqterms(dtm3)  #similarity between c1 , c2 common.c1c2 <- data.frame(term = character(0), freq = integer(0)) for(t in ft1){   find <- agrep(t, ft2)   if(length(find) != 0){     common.c1c2 <- rbind(common.c1c2, data.frame(term = t, freq = length(find)))   } } # note : loop can substituted apply family functions if taking time large text 

common.c1c2 contains common words between corpus1 , corpus2 frequency

> common.c1c2       term freq 1        1 2     data    2 3  derived    1 4 deriving    1 5   mining    1 6  pattern    1 7 patterns    1 8  process    1 9     text    1  > ft1  [1] "also"        "analytics"   "data"        "derived"     "deriving"    "devising"    "equivalent"   [8] "highquality" "information" "learning"    "means"       "mining"      "pattern"     "patterns"    [15] "process"     "referred"    "roughly"     "statistical" "text"        "trends"      "typically"    > ft2  [1] "addition"       "along"          "data"           "database"       "derived"        "deriving"        [7] "evaluation"     "features"       "finally"        "input"          "insertion"      "interpretation" [13] "involves"       "linguistic"     "mining"         "others"         "output"         "parsing"        [19] "patterns"       "process"        "removal"        "structured"     "structuring"    "subsequent"     [25] "text"           "usually"        "within"         

this solution not efficient 1 hope helps.


Comments

Popular posts from this blog

python - Operations inside variables -

Generic Map Parameter java -

arrays - What causes a java.lang.ArrayIndexOutOfBoundsException and how do I prevent it? -