dataframe - Complex Search in R -
this unusual , difficult question has perplexed me number of days , hope explain correctly. have 2 databases i.e. data-frames in r, first approx 90,000 rows , record of every race-horse in uk. contains numerous fields, , importantly name of each horse , sire; 1 record per horse first database, sample , fields. second database contains on one-million rows , history of every race horse has taken part in on last ten years i.e. races has run or call 'appearances', contains name, date, track etc..; 1 record per appearance.second database, sample , fields
what attempting write few lines of code - not loop - provide me total number of every appearance made siblings of particular horse i.e. 1 grand total. first step easy - finding siblings i.e. horses common sire - , can see below (n.b findsire own function says , finds sire of horse referencing same dataframe. have simplified code clarity)
testhorse <- "save bees" siblings <- which(findsire(testhorse) == horses$sire) sibsname <- horses[sibs,1]
the produces sibsname 636 names long (snippet below), although average horse have 50 or siblings. construct loop , search second 'appearances' data-frame , individually match sibling names , total appearances of siblings combined. however, know if avoid loop - , time associated - , write few lines of code achieve same end i.e. search 636 horses in appearances database , calculate times each appears in database , total of these appearances, or put way, how many races have siblings of "save bees" taken part in. in advance.
[1] "abdication " "aberdonian " "acclamatory " "accolation " ..... [636]
using dplyr
, calling "first database" horses
, "second database" races
:
library(dplyr) test_horse = "save bees" select(horses, name, sire) %>% filter(sire == sire[name == tolower(test_horse)]) %>% inner_join(races, c("name" = "selection_name")) %>% summarize(horse = test_horse, sibling_group_races = n())
i making assumption want number of appearances of sibling group include appearances of test horse - omit them instead add , name != tolower(test_horse)
filter()
command.
as haven't shared data reproducibly, cannot test code. if have additional problems not able solve them unless share data reproducibly. ycw's comment has helpful link doing - encourage edit question include either (a) code simulate small sample of data, or (b) use dput()
on small sample of data share few rows in copy/pasteable format.
the code above querying 1 horse @ time - if intend use simpler create table each row represents sibling group , contains number of races. reference table instead of calculating on fly every time. this:
sibling_appearances = left_join(horses, races, = c("name" = "selection_name")) %>% group_by(sire) %>% summarize(offspring_appearances = n())
Comments
Post a Comment