r - Odd behaviour of data.table's update on non-equi self-join -
while preparing answer question dplyr or data.table calculate time series aggregations in r noticed different results depending on whether table updated in place or returned new object. also, different result when change order of columns in non-equi join conditions.
currently, don't have explanation this, perhaps due major misunderstanding on side or simple coding error.
please, note question asking particularly explanations of observed behaviour of
data.tablejoins. if have alternative solutions underlying problem, please, feel free post answer original question.
original question , working answer
the original question how count number of hospitalizations occuring in 365 days before hospitalization (including actual one) each patient using these data:
library(data.table) # version 1.10.4 (cran) or 1.10.5 (devel built 2017-08-19) dt0 <- data.table( patient.id = c(1l, 2l, 1l, 1l, 2l, 2l, 2l), hospitalization.date = as.date(c("2013/10/15", "2014/10/15", "2015/7/16", "2016/1/7", "2015/12/20", "2015/12/25", "2016/2/10"))) setorder(dt0, patient.id, hospitalization.date) dt0 patient.id hospitalization.date 1: 1 2013-10-15 2: 1 2015-07-16 3: 1 2016-01-07 4: 2 2014-10-15 5: 2 2015-12-20 6: 2 2015-12-25 7: 2 2016-02-10
the code below gives expected answer (additional helper column added here clarity)
# add helper columns dt0[, start.date := hospitalization.date - 365][ , end.date := hospitalization.date][] dt0 patient.id hospitalization.date start.date end.date 1: 1 2013-10-15 2012-10-15 2013-10-15 2: 1 2015-07-16 2014-07-16 2015-07-16 3: 1 2016-01-07 2015-01-07 2016-01-07 4: 2 2014-10-15 2013-10-15 2014-10-15 5: 2 2015-12-20 2014-12-20 2015-12-20 6: 2 2015-12-25 2014-12-25 2015-12-25 7: 2 2016-02-10 2015-02-10 2016-02-10
result <- dt0[dt0, on = c("patient.id", "hospitalization.date>=start.date", "hospitalization.date<=end.date"), .(hospitalizations.last.year = .n), = .eachi][] result patient.id hospitalization.date hospitalization.date hospitalizations.last.year 1: 1 2012-10-15 2013-10-15 1 2: 1 2014-07-16 2015-07-16 1 3: 1 2015-01-07 2016-01-07 2 4: 2 2013-10-15 2014-10-15 1 5: 2 2014-12-20 2015-12-20 1 6: 2 2014-12-25 2015-12-25 2 7: 2 2015-02-10 2016-02-10 3
except renamed , duplicated column names (which left comparison).
for patient.id == 2, result in last row 3 because patient hospitalized on 2016-02-10 third time since 2015-02-10.
update on join in place
result new data.table object occupies additional memory. tried update original data.table object in place using:
# use copy of dt0 can safely modified dt <- copy(dt0) dt[dt, on = c("patient.id", "hospitalization.date>=start.date", "hospitalization.date<=end.date"), hospitalizations.last.year := .n, = .eachi] dt patient.id hospitalization.date start.date end.date hospitalizations.last.year 1: 1 2013-10-15 2012-10-15 2013-10-15 1 2: 1 2015-07-16 2014-07-16 2015-07-16 2 3: 1 2016-01-07 2015-01-07 2016-01-07 2 4: 2 2014-10-15 2013-10-15 2014-10-15 1 5: 2 2015-12-20 2014-12-20 2015-12-20 3 6: 2 2015-12-25 2014-12-25 2015-12-25 3 7: 2 2016-02-10 2015-02-10 2016-02-10 3
dthas been updated in place rows 5 , 6 show 3 hospitalizations instead of 1 or 2, resp. seems total number of hospitalizations within last period returned each of rows.
change order of columns in conditions.
also order of columns in non-equi join conditions matter, in self-join:
result <- dt0[dt0, on = c("patient.id", "start.date<=hospitalization.date", "end.date>=hospitalization.date"), .(hospitalizations.last.year = .n), = .eachi][] result my expectation "start.date<=hospitalization.date" equivalent "hospitalization.date>=start.date" (note < , > switched) result
patient.id start.date end.date hospitalizations.last.year 1: 1 2013-10-15 2013-10-15 1 2: 1 2015-07-16 2015-07-16 2 3: 1 2016-01-07 2016-01-07 1 4: 2 2014-10-15 2014-10-15 1 5: 2 2015-12-20 2015-12-20 3 6: 2 2015-12-25 2015-12-25 2 7: 2 2016-02-10 2016-02-10 1
is different. seems number of forthcoming hospitalizations being counted
interestingly, update in place return same result (except of column names):
# use copy of dt0 can safely modified dt <- copy(dt0) dt[dt, on = c("patient.id", "start.date<=hospitalization.date", "end.date>=hospitalization.date"), hospitalizations.last.year := .n, = .eachi] dt patient.id hospitalization.date start.date end.date hospitalizations.last.year 1: 1 2013-10-15 2012-10-15 2013-10-15 1 2: 1 2015-07-16 2014-07-16 2015-07-16 2 3: 1 2016-01-07 2015-01-07 2016-01-07 1 4: 2 2014-10-15 2013-10-15 2014-10-15 1 5: 2 2015-12-20 2014-12-20 2015-12-20 3 6: 2 2015-12-25 2014-12-25 2015-12-25 2 7: 2 2016-02-10 2015-02-10 2016-02-10 1
related
there potentially related question led issue reported on github.
there answer arun concerning usage of x. prefix non-equi joins.
the grouping by=.eachi means "by each i" not "by each x".
# readability / sanity dt = copy(dt0) setnames(dt, "hospitalization.date", "h.date") z = dt[dt, on = .(patient.id, h.date >= start.date, h.date <= end.date), .(x.h.date, patient.id, i.start.date, i.end.date, g = .grp, .n) , by=.eachi][, utils:::tail.default(.sd, 6)] x.h.date patient.id i.start.date i.end.date g n 1: 2013-10-15 1 2012-10-15 2013-10-15 1 1 * 2: 2015-07-16 1 2014-07-16 2015-07-16 2 1 3: 2015-07-16 1 2015-01-07 2016-01-07 3 2 * 4: 2016-01-07 1 2015-01-07 2016-01-07 3 2 * 5: 2014-10-15 2 2013-10-15 2014-10-15 4 1 * 6: 2015-12-20 2 2014-12-20 2015-12-20 5 1 7: 2015-12-20 2 2014-12-25 2015-12-25 6 2 8: 2015-12-25 2 2014-12-25 2015-12-25 6 2 9: 2015-12-20 2 2015-02-10 2016-02-10 7 3 * 10: 2015-12-25 2 2015-02-10 2016-02-10 7 3 * 11: 2016-02-10 2 2015-02-10 2016-02-10 7 3 * for patient 1, groups are
.(start.date = 2012-10-15, end.date = 2013-10-15), count of 1.(start.date = 2014-07-16, end.date = 2015-07-16), count of 1.(start.date = 2015-01-07, end.date = 2016-01-07), count of 2
it luck there both 7 groups in join , 7 rows in original table.
for tougher issue, i'll borrow example notes:
beware multiple matches in update join. when there multiple matches, update join apparently use last one. unfortunately, done silently. try:
a = data.table(id = c(1l, 1l, 2l, 3l, na_integer_), t = c(1l, 2l, 1l, 2l, na_integer_), x = 11:15) b = data.table(id = 1:2, y = c(11l, 15l)) b[a, on=.(id), x := i.x, verbose = true ][] # calculated ad hoc index in 0 secs # starting bmerge ...done in 0.02 secs # detected j uses these columns: x,i.x # assigning 3 row subset of 2 rows # id y x # 1: 1 11 12 # 2: 2 15 13with verbose on, see helpful message assignment “to 3 row subset of 2 rows.”
-- modified "quick r tutorial", section "updating in join"
in op's case, verbose=true not offer such message, unfortunately.
dt[dt, on = .(patient.id, h.date >= start.date, h.date <= end.date), n := .n, = .eachi, verbose=true] # non-equi join operators detected ... # forder took ... 0.01 secs # generating group lengths ... done in 0 secs # generating non-equi group ids ... done in 0 secs # found 1 non-equi group(s) ... # starting bmerge ...done in 0.02 secs # detected j uses these columns: <none> # lapply optimization on, j unchanged '.n' # making each group , running j (gforce false) ... # memcpy contiguous groups took 0.000s 7 groups # eval(j) took 0.000s 7 calls # 0.01 secs however, can see last row per x group contain value op sees. i've manually marked these asterisks above. alternately, mark them z[, mrk := replace(rep(0, .n), .n, 1), by=x.h.date].
for reference, update join here is...
dt[, n := .sd[.sd, on = .(patient.id, h.date >= start.date, h.date <= end.date), .n, by=.eachi]$n ] patient.id hospitalization.date start.date end.date h.date n 1: 1 2013-10-15 2012-10-15 2013-10-15 2013-10-15 1 2: 1 2015-07-16 2014-07-16 2015-07-16 2015-07-16 1 3: 1 2016-01-07 2015-01-07 2016-01-07 2016-01-07 2 4: 2 2014-10-15 2013-10-15 2014-10-15 2014-10-15 1 5: 2 2015-12-20 2014-12-20 2015-12-20 2015-12-20 1 6: 2 2015-12-25 2014-12-25 2015-12-25 2015-12-25 2 7: 2 2016-02-10 2015-02-10 2016-02-10 2016-02-10 3 this correct/idiomatic way handle case, of adding columns x based on looking each row of x in table , computing summary of result:
x[, v := dt2[.sd, on=, j, by=.eachi]$v1 ]
Comments
Post a Comment