Complex sum function by grouping two variables within a dataframe in R -
for below dataframe entry:
df <- data.frame(major=c("math","math","math","mlsp","mlsp","mlsp","biol","biol","biol","pshy","pshy","pshy"), age=(c(2,3,4,2,3,4,2,3,4,2,3,4)), mjr_1=c("biol","pshy","agbu","math","pshy",0,"mlsp","math",0,0,"math","mlsp"), trf_mjr_1=(c(7,2,2,3,2,0,3,2,0,0,2,2)), mjr_2=c("pshy","biol",0,"biol","math",0,"math","pshy",0,0,"mlsp","biol"), trf_mjr_2=(c(3,1,0,2,1,0,2,4,0,0,1,2)), mjr_3=c(0,0,0,0,"biol",0,0,0,0,0,0,0), trf_mjr_3=(c(0,0,0,0,1,0,0,0,0,0,0,0))) we following dataframe:
major age mjr_1 trf_mjr_1 mjr_2 trf_mjr_2 mjr_3 trf_mjr_3 1 math 2 biol 7 pshy 3 0 0 2 math 3 pshy 2 biol 1 0 0 3 math 4 agbu 2 0 0 0 0 4 mlsp 2 math 3 biol 2 0 0 5 mlsp 3 pshy 2 math 1 biol 1 6 mlsp 4 0 0 0 0 0 0 7 biol 2 mlsp 3 math 2 0 0 8 biol 3 math 2 pshy 4 0 0 9 biol 4 0 0 0 0 0 0 10 pshy 2 0 0 0 0 0 0 11 pshy 3 math 2 mlsp 1 0 0 12 pshy 4 mlsp 2 biol 2 0 0 well as, need have below output table having sum function entitled “trf_in - transferred in” column major adds number of each transferred majors (trf_mjr_1, trf_mjr_2, etc...) in appropriate grouping category of (major , age) in first 2 columns; although “major” category change according mjr_1, mjr_2, etc... shown below.
i appreciate avoid multiple “merge” or “ddply” functions since actual file big , has many variables..
major age trf_in_sum 1: math 2 5 2: math 3 5 3: math 4 0 4: mlsp 2 3 5: mlsp 3 1 6: mlsp 4 2 7: biol 2 9 8: biol 3 2 9: biol 4 2 10: pshy 2 3 11: pshy 3 8 12: pshy 4 0 **13: agbu 4 2** explanation of output table:
row1: math major age 2: trf_in = "3" trf_mjr_1 in row(4) having mjr_1= math , age =2 + trf_in = "2" trf_mjr_2 in row(7) having mjr_2= math , age =2 row2: math major age 3 : trf_in = "1" trf_mjr_2 in row(5) having mjr_2= math , age =3 + trf_in = "2" trf_mjr_1 in row(8) having mjr_1= math , age =3 + trf_in = "2" trf_mjr_1 in row(11) having mjr_1= math , age =3
the melt() function data.table package can reshape multiple measure columns simultaneously required here.
variant 1:
library(data.table) # reshape wide long format melt(setdt(df), id.vars = c("major", "age"), measure.vars = patterns("^mjr_", "^trf_mjr_"))[ # omit null entries value1 != "0" & value2 != 0l][ # aggregate , .(trf_in_sum = sum(value2)), keyby = .(major = value1, age)][ # right join first 2 columns of wide data set df[, 1:2], on = c("major", "age")][ # replace na 0 is.na(trf_in_sum), trf_in_sum := 0l][] major age trf_in_sum 1: math 2 5 2: math 3 5 3: math 4 0 4: mlsp 2 3 5: mlsp 3 1 6: mlsp 4 2 7: biol 2 9 8: biol 3 2 9: biol 4 2 10: pshy 2 3 11: pshy 3 8 12: pshy 4 0
edit 1: right joining first 2 columns of original (wide) data, result has same number , order of rows. na indicating missing data replaced 0.
caveat: pointed out op, value in mjrwhich not included in majorcolumn won't appear in result, e.g., agbu. not recommended.
variant 2:
this using melt() before follows frank's approach using cj() enhances using factors maintain given order of major column , "prettifies" result. note, handy forcats package used.
library(data.table) library(forcats) setdt(df)[ # make sure factor levels in order of occurence , major := fct_inorder(major)][ # reshape wide long multiple measures columns , melt(.sd, measure.vars = patterns("^mjr", "^trf"), value.name = c("mjr", "trf"))][ # omit null entries mjr != "0"][ # unify factor levels levels of major in lead , c("major", "mjr") := fct_unify(.(major, factor(mjr)))][ # use cross join create combinations of mjr , age, # right join results cj(mjr, age, unique = true), on = .(mjr = v1, age = v2), # aggregate join parameters .(trf_in_sum = sum(trf, na.rm = true)), = .eachi] mjr age trf_in_sum 1: math 2 5 2: math 3 5 3: math 4 0 4: mlsp 2 3 5: mlsp 3 1 6: mlsp 4 2 7: biol 2 9 8: biol 3 2 9: biol 4 2 10: pshy 2 3 11: pshy 3 8 12: pshy 4 0 13: agbu 2 0 14: agbu 3 0 15: agbu 4 0
now, result includes agbu while showing mjr , age combinations , preserving original order in major.
however, still might not perfect in case there entry in major doesn't appear in of mjr columns. cover case, full join, i.e., merge() all = true suit better.
Comments
Post a Comment