dataframe - Complex Search in R -

January 15, 2013

this unusual , difficult question has perplexed me number of days , hope explain correctly. have 2 databases i.e. data-frames in r, first approx 90,000 rows , record of every race-horse in uk. contains numerous fields, , importantly name of each horse , sire; 1 record per horse first database, sample , fields. second database contains on one-million rows , history of every race horse has taken part in on last ten years i.e. races has run or call 'appearances', contains name, date, track etc..; 1 record per appearance.second database, sample , fields

what attempting write few lines of code - not loop - provide me total number of every appearance made siblings of particular horse i.e. 1 grand total. first step easy - finding siblings i.e. horses common sire - , can see below (n.b findsire own function says , finds sire of horse referencing same dataframe. have simplified code clarity)

 testhorse <- "save bees"   siblings <- which(findsire(testhorse) == horses$sire)   sibsname <- horses[sibs,1]

the produces sibsname 636 names long (snippet below), although average horse have 50 or siblings. construct loop , search second 'appearances' data-frame , individually match sibling names , total appearances of siblings combined. however, know if avoid loop - , time associated - , write few lines of code achieve same end i.e. search 636 horses in appearances database , calculate times each appears in database , total of these appearances, or put way, how many races have siblings of "save bees" taken part in. in advance.

 [1] "abdication "         "aberdonian "         "acclamatory "        "accolation "  ..... [636]

using dplyr, calling "first database" horses , "second database" races:

library(dplyr) test_horse = "save bees"  select(horses, name, sire) %>%   filter(sire == sire[name == tolower(test_horse)]) %>%   inner_join(races, c("name" = "selection_name")) %>%   summarize(horse = test_horse, sibling_group_races = n())

i making assumption want number of appearances of sibling group include appearances of test horse - omit them instead add , name != tolower(test_horse) filter() command.

as haven't shared data reproducibly, cannot test code. if have additional problems not able solve them unless share data reproducibly. ycw's comment has helpful link doing - encourage edit question include either (a) code simulate small sample of data, or (b) use dput() on small sample of data share few rows in copy/pasteable format.

the code above querying 1 horse @ time - if intend use simpler create table each row represents sibling group , contains number of races. reference table instead of calculating on fly every time. this:

sibling_appearances =    left_join(horses, races, = c("name" = "selection_name")) %>%   group_by(sire) %>%   summarize(offspring_appearances = n())

Search This Blog

Force Net

dataframe - Complex Search in R -

Comments

Post a Comment

Popular posts from this blog

ubuntu - PHP script to find files of certain extensions in a directory, returns populated array when run in browser, but empty array when run from terminal -

php - How can i create a user dashboard -

javascript - How to detect toggling of the fullscreen-toolbar in jQuery Mobile? -