text mining - Taking a random sample of a structural topic model object list -
i take random sample of structural topic model object list stm experimentation before running on full sample.
my current solution following:
library(stm); library(quanteda) # use data available in stm package df <- gadarian # convert vector corpus mycorpus <- corpus(df$open.ended.response) # convert doc-feature-matrix dfm <- dfm(mycorpus, remove = c(stopwords("english")), ngrams= 1l, stem = f, remove_numbers = true, remove_punct = true, remove_symbols = true) # use quanteda converter convert our dfm stmobject <- convert(dfm, = "stm", docvars = docvars(mycorpus)) # work on smaller sample experimentation set.seed(10) small.df.rows <- sample(1:nrow(df), 10) # subsample data stmdf.sm <- list(documents = stmobject$documents[small.df.rows], vocab = stmobject$vocab) # dtm df.sm <- df[small.df.rows,] # meta-data # preprocess ##out <- prepdocuments(stmobject$documents, stmobject$vocab, stmobject$meta, lower.thresh = 5) # create prevalance variable , place in global env ##treatment <- df.sm$treatment # run ##stmfit.sm <- stm(out$documents, out$vocab, k = 0, prevalence =~ treatment , ## max.em.its = 150, init.type = "spectral", seed = 300, verbose = t, ngroups = 5)
i know 1 can create held.out data in stm, adding covariates (df in our example above) dfm object, however, prefer have them (the corpus , covariates) separate objects post-estimation calculations.
i appreciate advice.
ps: background question stm estimations crashes. have been unable identify reason. 1 issue might way creating random samples (sometimes stm works, not). or, issue might converting document-term-object of tm package stm corpus-object following function: readcorpus(dtm, type ="slam"). have filed separate , more detailed issue report on github (https://github.com/bstewart/stm/issues/89). if have additional advice on that, appreciate lot too. ds
Comments
Post a Comment