encoding - Working with log files in R -
i have .log file has inconsistent data format.
the data looks , stored "little-endian utf-16 unicode" text:
2017-06-21 00:00:30.483 start thing 2017-06-21 00:00:56.400 else happens [xyz 1000 t1]:1 2017-06-22 01:15:17.945 nothing 'd': 989 [case] in: [id: 1010]33 [case] in: [id: 2010]8 2017-06-21 00:00:30.483 start thing 2017-06-21 00:00:56.400 else happens 2017-06-21 00:00:30.483 start thing 2017-06-21 00:00:56.400 else happens 2017-06-21 00:00:30.483 start thing 2017-06-21 00:00:56.400 else happens 323133.....238813 76378 989899 000000000000
now, have several log files follow kind of pattern. have tried scan() , read.table(), both don't return data in format expect do.
the data format expecting looks this:
date string 2017-06-21 00:00:30.483 start thing
but, have these line multiple times in log files:
[case] in: [id: 1010]33 [case] in: [id: 2010]8
and this,
323133.....238813 76378 989899 000000000000
what best way approach solution? thanks!
just raw sketch (ignoring time part of timestamp , column names) using base r without performance optimisation (like using data.table::fread
, package lubridate
):
log.data <- "2017-06-21 00:00:30.483 start thing 2017-06-21 00:00:56.400 else happens [xyz 1000 t1]:1 2017-06-22 01:15:17.945 nothing 'd': 989 [case] in: [id: 1010]33 [case] in: [id: 2010]8 2017-06-21 00:00:30.483 start thing 2017-06-21 00:00:56.400 else happens 2017-06-21 00:00:30.483 start thing 2017-06-21 00:00:56.400 else happens 2017-06-21 00:00:30.483 start thing 2017-06-21 00:00:56.400 else happens 323133.....238813 76378 989899 000000000000" log <- read.csv(text = log.data, sep = "\n", header = f) log$timestamp <- as.date(log[,1])
this results in:
> log v1 timestamp 1 2017-06-21 00:00:56.400 else happens 2017-06-21 2 [xyz 1000 t1]:1 <na> 3 2017-06-22 01:15:17.945 nothing 'd': 989 2017-06-22 4 [case] in: [id: 1010]33 <na> 5 [case] in: [id: 2010]8 <na> 6 2017-06-21 00:00:30.483 start thing 2017-06-21 7 2017-06-21 00:00:56.400 else happens 2017-06-21 8 2017-06-21 00:00:30.483 start thing 2017-06-21 9 2017-06-21 00:00:56.400 else happens 2017-06-21 10 2017-06-21 00:00:30.483 start thing 2017-06-21 11 2017-06-21 00:00:56.400 else happens 2017-06-21 12 323133.....238813 76378 989899 000000000000 <na>
update 1:
since found out log file uses utf-16 little-endian file encoding (checked file
command of linux/osx in terminal) have add file encoding read.csv
let r convert file content correctly during reading:
log <- read.csv(file = "my.log", sep = "\n", header = f, fileencoding = "utf-16le", encoding = "utf-8")
Comments
Post a Comment