r - Compare the bag of words in two document and find the matching word and their frequency in second document -
i have calculated bag of words 'yelp.csv', 'yelpp.csv', 'yelpn.csv' , created matrix of individuals dataset's word frequency. now, want compare bag of words of yelp yelpn , check how many words in yelp appears in yelpn , frequency , store in variable matrix, same yelpp. yelp contains both positive , negative. yelpp, positive , yelpn, negative. can complete code? donno whether code relevant,i hope so.
getwd() setwd("/users/ash/rprojects/exc") getwd() df <- read.csv("yelp.csv",header = true,quote="\"",stringsasfactors= true, strip.white = true) df dfd<-as.character(df[,2]) dfd df2<-as.character(df[,1]) df2 words <- readlines(system.file("stopwords", "english.dat", package = "tm")) s<-remove_stopwords(dfd, words, lines = true) s print(paste("****stopwords removed successfully****")) n<-removenumbers(s) n t<-removepunctuation(n, preserve_intra_word_dashes = false) t #pos dfp <- read.csv("yelpp.csv",header = true,quote="\"",stringsasfactors= true, strip.white = true) dfp dfdp<-as.character(dfp[,2]) dfdp df2p<-as.character(dfp[,1]) df2p wordsp <- readlines(system.file("stopwords", "english.dat", package = "tm")) sp<-remove_stopwords(dfdp, words, lines = true) sp print(paste("****stopwords removed successfully****")) np<-removenumbers(sp) np tp<-removepunctuation(np, preserve_intra_word_dashes = false) tp #neg dfn <- read.csv("yelpn.csv",header = true,quote="\"",stringsasfactors= true, strip.white = true) dfn dfdn<-as.character(dfn[,2]) dfdn df2n<-as.character(dfn[,1]) df2n wordsn <- readlines(system.file("stopwords", "english.dat", package = "tm")) sn<-remove_stopwords(dfdn, words, lines = true) sn print(paste("****stopwords removed successfully****")) nn<-removenumbers(sn) nn tn<-removepunctuation(nn, preserve_intra_word_dashes = false) tn #bag b<-bag_o_words(t, apostrophe.remove = true) b b.mat = as.matrix(b) b.mat bp<-bag_o_words(tp, apostrophe.remove = true) bp bp.mat = as.matrix(bp) bp.mat bn<-bag_o_words(tn, apostrophe.remove = true) bn bn.mat = as.matrix(bn) bn.mat #frequent terms frequent_terms <- freq_terms(b.mat, 2000) frequent_terms frequent_termsp <- freq_terms(tp, 2000) frequent_termsp frequent_termsn <- freq_terms(tn, 2000) frequent_termsn
i'm taking text example corpuses wiki text mining. using tm package , findfreqterms,agrep function main points in approach.
agrep
searches approximate matches pattern (the first argument) within each element of string x (the second argument) using generalized levenshtein edit distance (the minimal possibly weighted number of insertions, deletions , substitutions needed transform 1 string another).
approach steps :
texts -> corpuses -> data cleaning -> findfreqterms -> compare other term doc matrix
library(tm) c1 <- corpus(vectorsource("text mining, referred text data mining, equivalent text analytics, process of deriving high-quality information text. high-quality information typically derived through devising of patterns , trends through means such statistical pattern learning")) c2 <- corpus(vectorsource("text mining involves process of structuring input text (usually parsing, along addition of derived linguistic features , removal of others, , subsequent insertion database), deriving patterns within structured data, , evaluation , interpretation of output")) c3 <- corpus(vectorsource("typical text mining tasks include text categorization, text clustering, concept/entity extraction, production of granular taxonomies, sentiment analysis, document summarization, , entity relation modeling (i.e., learning relations between named entities)")) # data cleaning , transformation c1 <- tm_map(c1, content_transformer(tolower)) c2 <- tm_map(c2, content_transformer(tolower)) c3 <- tm_map(c3, content_transformer(tolower)) c1 <- tm_map(c1, removepunctuation) c1 <- tm_map(c1, removenumbers) c1 <- tm_map(c1, removewords, stopwords("english")) c1 <- tm_map(c1, stripwhitespace) c2 <- tm_map(c2, removepunctuation) c2 <- tm_map(c2, removenumbers) c2 <- tm_map(c2, removewords, stopwords("english")) c2 <- tm_map(c2, stripwhitespace) c3 <- tm_map(c3, removepunctuation) c3 <- tm_map(c3, removenumbers) c3 <- tm_map(c3, removewords, stopwords("english")) c3 <- tm_map(c3, stripwhitespace) dtm1 <- documenttermmatrix(c1, control = list(weighting = weighttfidf, stopwords = true)) dtm2 <- documenttermmatrix(c2, control = list(weighting = weighttfidf, stopwords = true)) dtm3 <- documenttermmatrix(c3, control = list(weighting = weighttfidf, stopwords = true)) ft1 <- findfreqterms(dtm1) ft2 <- findfreqterms(dtm2) ft3 <- findfreqterms(dtm3) #similarity between c1 , c2 common.c1c2 <- data.frame(term = character(0), freq = integer(0)) for(t in ft1){ find <- agrep(t, ft2) if(length(find) != 0){ common.c1c2 <- rbind(common.c1c2, data.frame(term = t, freq = length(find))) } } # note : loop can substituted apply family functions if taking time large text common.c1c2 contains common words between corpus1 , corpus2 frequency
> common.c1c2 term freq 1 1 2 data 2 3 derived 1 4 deriving 1 5 mining 1 6 pattern 1 7 patterns 1 8 process 1 9 text 1 > ft1 [1] "also" "analytics" "data" "derived" "deriving" "devising" "equivalent" [8] "highquality" "information" "learning" "means" "mining" "pattern" "patterns" [15] "process" "referred" "roughly" "statistical" "text" "trends" "typically" > ft2 [1] "addition" "along" "data" "database" "derived" "deriving" [7] "evaluation" "features" "finally" "input" "insertion" "interpretation" [13] "involves" "linguistic" "mining" "others" "output" "parsing" [19] "patterns" "process" "removal" "structured" "structuring" "subsequent" [25] "text" "usually" "within" this solution not efficient 1 hope helps.
Comments
Post a Comment