r - Modify combination of LaTeX-generated indices into a useful data frame -
several of documents output pdfs through rstudio, .rnw
scripts, knitr
, latex
have lengthy indices. readlines
can read r
each document's index (a "file.ind" file) , have combined them single character vector.
the vector looks below, pdf document indents \subitem terms (which call 'secondary' terms) under \item term (which call 'primary' terms). combined vector has more 20 primary terms (such data , statistics) , 2,000 secondary terms (such 'distribution, bell curve, normal, 20). primary term can have 5 50+ secondary terms.
\item data \subitem distribution, bell curve, normal, 20 \subitem absolute number, 111 \subitem arithmetic mean, 21 \subitem big data, 137 \subitem binary, 110 \subitem categorical, 130 \item statistics \subitem count, 53 \subitem data, 53, 129 \subitem data, missing, 135 \subitem digits, 53
the programming challenge "fill in" primary term until next primary term begins, , fill in second primary term until next 1 begins, , on. or, how can r
create object looks 2 columns?
primary secondary data distribution, bell curve, normal, 20 data absolute number, 111 data arithmetic mean, 21 data big data, 137 data binary, 110 data categorical, 130 statistics count, 53 statistics data, 53, 129 statistics data, missing, 135 statistics digits, 53
my goal save modified, combined indices in excel can more standardize naming conventions, detect missing terms, fix misspellings, , more.
thank guidance.
i think following should work (untested fully, pieces tested). can read input file line line, , handle 2 types of input:
- the line starts
\item
- in case, record new primary label, don't write data frame
- the line starts
\subitem
- in case, write row data frame using latest primary , secondary current line
df <- data.frame(primary=character(), secondary=character(), stringsasfactors=false) # replace 'filepath' actual path file con = file(filepath, "r") primary <- na secondary <- na while (true) { line <- readlines(con, n=1) if (length(line) == 0) { break } if (substr(line, 2, 5) == 'item') { primary <- gsub("\\\\item\\s+(.*)", "\\1", line) } else if (substr(line, 2, 8) == 'subitem') { secondary <- gsub("\\\\subitem\\s+(.*)", "\\1", line) df$primary <- primary df$secondary <- secondary } else { print(paste0("unexpected input: ", line)) } } close(con)
note if don't know actual path file is, file in windows explorer , copy path out. on linux grep file, or find location , type pwd
.
update:
based on comments, sounds maybe data showed inside character vector. can simplify had above following:
primary <- na secondary <- na (i in 1:length(indxall)) { line <- indxall[i] if (length(line) == 0) { break } if (substr(line, 2, 5) == 'item') { primary <- gsub("\\\\item\\s+(.*)", "\\1", line) } else if (substr(line, 2, 8) == 'subitem') { secondary <- gsub("\\\\subitem\\s+(.*)", "\\1", line) df$primary <- primary df$secondary <- secondary } else { print(paste0("unexpected input: ", line)) } }
Comments
Post a Comment