Complex sum function by grouping two variables within a dataframe in R -

July 15, 2015

for below dataframe entry:

df <- data.frame(major=c("math","math","math","mlsp","mlsp","mlsp","biol","biol","biol","pshy","pshy","pshy"), age=(c(2,3,4,2,3,4,2,3,4,2,3,4)),                   mjr_1=c("biol","pshy","agbu","math","pshy",0,"mlsp","math",0,0,"math","mlsp"), trf_mjr_1=(c(7,2,2,3,2,0,3,2,0,0,2,2)),                  mjr_2=c("pshy","biol",0,"biol","math",0,"math","pshy",0,0,"mlsp","biol"), trf_mjr_2=(c(3,1,0,2,1,0,2,4,0,0,1,2)),                  mjr_3=c(0,0,0,0,"biol",0,0,0,0,0,0,0), trf_mjr_3=(c(0,0,0,0,1,0,0,0,0,0,0,0)))

we following dataframe:

   major  age mjr_1 trf_mjr_1 mjr_2 trf_mjr_2 mjr_3 trf_mjr_3 1   math   2  biol         7  pshy         3     0         0 2   math   3  pshy         2  biol         1     0         0 3   math   4  agbu         2     0         0     0         0 4   mlsp   2  math         3  biol         2     0         0 5   mlsp   3  pshy         2  math         1  biol         1 6   mlsp   4     0         0     0         0     0         0 7   biol   2  mlsp         3  math         2     0         0 8   biol   3  math         2  pshy         4     0         0 9   biol   4     0         0     0         0     0         0 10  pshy   2     0         0     0         0     0         0 11  pshy   3  math         2  mlsp         1     0         0 12  pshy   4  mlsp         2  biol         2     0         0

well as, need have below output table having sum function entitled “trf_in - transferred in” column major adds number of each transferred majors (trf_mjr_1, trf_mjr_2, etc...) in appropriate grouping category of (major , age) in first 2 columns; although “major” category change according mjr_1, mjr_2, etc... shown below.

i appreciate avoid multiple “merge” or “ddply” functions since actual file big , has many variables..

   major age trf_in_sum  1:  math   2          5  2:  math   3          5  3:  math   4          0  4:  mlsp   2          3  5:  mlsp   3          1  6:  mlsp   4          2  7:  biol   2          9  8:  biol   3          2  9:  biol   4          2 10:  pshy   2          3 11:  pshy   3          8 12:  pshy   4          0     **13:  agbu   4          2**

explanation of output table:

row1: math major age 2:   trf_in = "3" trf_mjr_1 in row(4) having mjr_1= math , age =2  +  trf_in = "2"  trf_mjr_2 in row(7) having mjr_2= math , age =2  row2: math major age 3 :   trf_in = "1" trf_mjr_2 in row(5) having mjr_2= math , age =3  +  trf_in = "2"  trf_mjr_1 in row(8) having mjr_1= math , age =3 + trf_in = "2"  trf_mjr_1 in row(11) having mjr_1= math , age =3

the melt() function data.table package can reshape multiple measure columns simultaneously required here.

variant 1:

library(data.table) # reshape wide long format melt(setdt(df), id.vars = c("major", "age"),       measure.vars = patterns("^mjr_", "^trf_mjr_"))[        # omit null entries        value1 != "0" & value2 != 0l][          # aggregate          , .(trf_in_sum = sum(value2)),           keyby = .(major = value1, age)][            # right join first 2 columns of wide data set            df[, 1:2], on = c("major", "age")][              # replace na 0              is.na(trf_in_sum), trf_in_sum := 0l][]

    major age trf_in_sum  1:  math   2          5  2:  math   3          5  3:  math   4          0  4:  mlsp   2          3  5:  mlsp   3          1  6:  mlsp   4          2  7:  biol   2          9  8:  biol   3          2  9:  biol   4          2 10:  pshy   2          3 11:  pshy   3          8 12:  pshy   4          0

edit 1: right joining first 2 columns of original (wide) data, result has same number , order of rows. na indicating missing data replaced 0.

caveat: pointed out op, value in mjrwhich not included in majorcolumn won't appear in result, e.g., agbu. not recommended.

variant 2:

this using melt() before follows frank's approach using cj() enhances using factors maintain given order of major column , "prettifies" result. note, handy forcats package used.

library(data.table) library(forcats) setdt(df)[   # make sure factor levels in order of occurence   , major := fct_inorder(major)][     # reshape wide long multiple measures columns      , melt(.sd, measure.vars = patterns("^mjr", "^trf"),             value.name = c("mjr", "trf"))][              # omit null entries              mjr != "0"][                # unify factor levels levels of major in lead                , c("major", "mjr") := fct_unify(.(major, factor(mjr)))][                  # use cross join create combinations of mjr , age,                  # right join results                   cj(mjr, age, unique = true), on = .(mjr = v1, age = v2),                   # aggregate join parameters                  .(trf_in_sum = sum(trf, na.rm = true)), = .eachi]

     mjr age trf_in_sum  1: math   2          5  2: math   3          5  3: math   4          0  4: mlsp   2          3  5: mlsp   3          1  6: mlsp   4          2  7: biol   2          9  8: biol   3          2  9: biol   4          2 10: pshy   2          3 11: pshy   3          8 12: pshy   4          0 13: agbu   2          0 14: agbu   3          0 15: agbu   4          0

now, result includes agbu while showing mjr , age combinations , preserving original order in major.

however, still might not perfect in case there entry in major doesn't appear in of mjr columns. cover case, full join, i.e., merge() all = true suit better.

Search This Blog

Force Net