r - Modify combination of LaTeX-generated indices into a useful data frame -

April 15, 2014

several of documents output pdfs through rstudio, .rnw scripts, knitr , latex have lengthy indices. readlinescan read r each document's index (a "file.ind" file) , have combined them single character vector.

the vector looks below, pdf document indents \subitem terms (which call 'secondary' terms) under \item term (which call 'primary' terms). combined vector has more 20 primary terms (such data , statistics) , 2,000 secondary terms (such 'distribution, bell curve, normal, 20). primary term can have 5 50+ secondary terms.

\item data \subitem distribution, bell curve, normal, 20 \subitem absolute number, 111 \subitem arithmetic mean, 21 \subitem big data, 137 \subitem binary, 110 \subitem categorical, 130 \item statistics \subitem count, 53 \subitem data, 53, 129 \subitem data, missing, 135 \subitem digits, 53

the programming challenge "fill in" primary term until next primary term begins, , fill in second primary term until next 1 begins, , on. or, how can r create object looks 2 columns?

primary secondary data    distribution, bell curve, normal, 20 data    absolute number, 111 data    arithmetic mean, 21 data    big data, 137 data    binary, 110 data    categorical, 130 statistics  count, 53 statistics  data, 53, 129 statistics  data, missing, 135 statistics  digits, 53

my goal save modified, combined indices in excel can more standardize naming conventions, detect missing terms, fix misspellings, , more.

thank guidance.

i think following should work (untested fully, pieces tested). can read input file line line, , handle 2 types of input:

the line starts \item
- in case, record new primary label, don't write data frame
the line starts \subitem
- in case, write row data frame using latest primary , secondary current line

df <- data.frame(primary=character(),                  secondary=character(),                  stringsasfactors=false)   # replace 'filepath' actual path file con = file(filepath, "r") primary <- na secondary <- na while (true) {     line <- readlines(con, n=1)     if (length(line) == 0) {         break     }     if (substr(line, 2, 5) == 'item') {         primary <- gsub("\\\\item\\s+(.*)", "\\1", line)     }     else if (substr(line, 2, 8) == 'subitem') {         secondary <- gsub("\\\\subitem\\s+(.*)", "\\1", line)         df$primary <- primary         df$secondary <- secondary     }     else {         print(paste0("unexpected input: ", line))     } } close(con)

note if don't know actual path file is, file in windows explorer , copy path out. on linux grep file, or find location , type pwd.

update:

based on comments, sounds maybe data showed inside character vector. can simplify had above following:

primary <- na secondary <- na (i in 1:length(indxall)) {     line <- indxall[i]     if (length(line) == 0) {         break     }     if (substr(line, 2, 5) == 'item') {         primary <- gsub("\\\\item\\s+(.*)", "\\1", line)     }     else if (substr(line, 2, 8) == 'subitem') {         secondary <- gsub("\\\\subitem\\s+(.*)", "\\1", line)         df$primary <- primary         df$secondary <- secondary     }     else {         print(paste0("unexpected input: ", line))     } }

Search This Blog

Force Net

r - Modify combination of LaTeX-generated indices into a useful data frame -

Comments

Post a Comment

Popular posts from this blog

ubuntu - PHP script to find files of certain extensions in a directory, returns populated array when run in browser, but empty array when run from terminal -

php - How can i create a user dashboard -

javascript - How to detect toggling of the fullscreen-toolbar in jQuery Mobile? -