r - Comparing between groups in grouped dataframe -
i trying perform comparison between items in subsequent groups in dataframe - guess pretty easy when know doing...
my data set can represented follows:
set.seed(1) data <- data.frame( date = c(rep('2015-02-01',15), rep('2015-02-02',16), rep('2015-02-03',15)), id = as.character(c(1005 + sample.int(10,15,replace=true), 1005 + sample.int(10,16,replace=true), 1005 + sample.int(10,15,replace=true))) )
which yields dataframe looks like:
date id 1/02/2015 1008 1/02/2015 1009 1/02/2015 1011 1/02/2015 1015 1/02/2015 1008 1/02/2015 1014 1/02/2015 1015 1/02/2015 1012 1/02/2015 1012 1/02/2015 1006 1/02/2015 1008 1/02/2015 1007 1/02/2015 1012 1/02/2015 1009 1/02/2015 1013 2/02/2015 1010 2/02/2015 1013 2/02/2015 1015 2/02/2015 1009 2/02/2015 1013 2/02/2015 1015 2/02/2015 1008 2/02/2015 1012 2/02/2015 1007 2/02/2015 1008 2/02/2015 1009 2/02/2015 1006 2/02/2015 1009 2/02/2015 1014 2/02/2015 1009 2/02/2015 1010 3/02/2015 1011 3/02/2015 1010 3/02/2015 1007 3/02/2015 1014 3/02/2015 1012 3/02/2015 1013 3/02/2015 1007 3/02/2015 1013 3/02/2015 1010
then want group data date (group_by) , filter out duplicates (distinct) before comparing between groups. want determine day day new id's added , id's leave. day 1 , day 2 compared determine id's in day 2 not in day 1 , id's in day 1 not present in day 2, same comparisons between day 2 , day 3 etc.
comparison can done using anti_join (dplyr) don't know how reference individual groups in dataset.
my attempt (or 1 of attempts) looks like:
data %>% group_by(date) %>% distinct(id) %>% do(lost = anti_join(., lag(.), by="id"))
but of course not work, get:
error in anti_join_impl(x, y, by$x, by$y) : can't join on 'id' x 'id' because of incompatible types (factor / logical)
is attempting possible or should looking @ writing clunky function it?
just add input stringsasfactors = false
dataframe. make code run: although not sure whether outputted result 1 looking for. view whole result, pipe data.frame , see whether looking for. hope helps.
set.seed(1) data <- data.frame( date = c(rep('2015-02-01',15), rep('2015-02-02',16), rep('2015-02-3',15)), id = as.character(c(1005 + sample.int(10,15,replace=true), 1005 + sample.int(10,16,replace=true), 1005 + sample.int(10,15,replace=true))),stringsasfactors = false) data %>% group_by(date) %>% distinct(id) %>% do(lost = anti_join(., lag(.), by="id"))%>%data.frame()
Comments
Post a Comment