encoding - Working with log files in R -


i have .log file has inconsistent data format.

the data looks , stored "little-endian utf-16 unicode" text:

2017-06-21 00:00:30.483 start thing 2017-06-21 00:00:56.400 else happens      [xyz 1000 t1]:1 2017-06-22 01:15:17.945 nothing 'd': 989      [case] in: [id: 1010]33      [case] in: [id: 2010]8 2017-06-21 00:00:30.483 start thing 2017-06-21 00:00:56.400 else happens 2017-06-21 00:00:30.483 start thing 2017-06-21 00:00:56.400 else happens 2017-06-21 00:00:30.483 start thing 2017-06-21 00:00:56.400 else happens  323133.....238813   76378    989899 000000000000 

now, have several log files follow kind of pattern. have tried scan() , read.table(), both don't return data in format expect do.

the data format expecting looks this:

date                          string 2017-06-21 00:00:30.483       start thing 

but, have these line multiple times in log files:

 [case] in: [id: 1010]33  [case] in: [id: 2010]8 

and this,

323133.....238813   76378    989899 000000000000 

what best way approach solution? thanks!

just raw sketch (ignoring time part of timestamp , column names) using base r without performance optimisation (like using data.table::fread , package lubridate):

log.data <- "2017-06-21 00:00:30.483 start thing 2017-06-21 00:00:56.400 else happens      [xyz 1000 t1]:1 2017-06-22 01:15:17.945 nothing 'd': 989      [case] in: [id: 1010]33      [case] in: [id: 2010]8 2017-06-21 00:00:30.483 start thing 2017-06-21 00:00:56.400 else happens 2017-06-21 00:00:30.483 start thing 2017-06-21 00:00:56.400 else happens 2017-06-21 00:00:30.483 start thing 2017-06-21 00:00:56.400 else happens  323133.....238813   76378    989899 000000000000"  log <- read.csv(text = log.data, sep = "\n", header = f) log$timestamp <- as.date(log[,1]) 

this results in:

> log                                                  v1  timestamp 1    2017-06-21 00:00:56.400 else happens 2017-06-21 2                                   [xyz 1000 t1]:1       <na> 3          2017-06-22 01:15:17.945 nothing 'd': 989 2017-06-22 4                           [case] in: [id: 1010]33       <na> 5                            [case] in: [id: 2010]8       <na> 6          2017-06-21 00:00:30.483 start thing 2017-06-21 7    2017-06-21 00:00:56.400 else happens 2017-06-21 8          2017-06-21 00:00:30.483 start thing 2017-06-21 9    2017-06-21 00:00:56.400 else happens 2017-06-21 10         2017-06-21 00:00:30.483 start thing 2017-06-21 11   2017-06-21 00:00:56.400 else happens 2017-06-21 12 323133.....238813   76378    989899 000000000000       <na> 

update 1:

since found out log file uses utf-16 little-endian file encoding (checked file command of linux/osx in terminal) have add file encoding read.csv let r convert file content correctly during reading:

log <- read.csv(file = "my.log", sep = "\n", header = f, fileencoding = "utf-16le", encoding = "utf-8") 

Comments

Popular posts from this blog

ubuntu - PHP script to find files of certain extensions in a directory, returns populated array when run in browser, but empty array when run from terminal -

php - How can i create a user dashboard -

javascript - How to detect toggling of the fullscreen-toolbar in jQuery Mobile? -