pandas - how to make python loop faster to run pairwise association test -
i have list of patient id , drug names , list of patient id , disease names. want find indicative drug each disease.
to find want fisher exact test p-value each disease/drug pair. loop runs slowly, more 10 hours. there way make loop more efficient, or better way solve association problem?
my loop:
import numpy np import pandas pd scipy.stats import fisher_exact most_indicative_medication = {} rx_list = list(meps_meds.rxname.unique()) disease_list = list(meps_base_data.columns.values)[8:] in disease_list: print rx_dict = {} j in rx_list: subset = base[['id', i, 'rxname']].drop_duplicates() subset[j] = subset['rxname'] == j subset = subset.loc[subset[i].isin(['yes', 'no'])] subset = subset[[i, j]] tab = pd.crosstab(subset[i], subset[j]) if len(tab.columns) == 2: rx_dict[j] = fisher_exact(tab)[1] else: rx_dict[j] = np.nan most_indicative_medication[i] = min(rx_dict, key=rx_dict.get)
you need multiprocessing/multithreading, have added code.:
from multiprocessing.dummy import pool threadpool most_indicative_medication = {} rx_list = list(meps_meds.rxname.unique()) disease_list = list(meps_base_data.columns.values)[8:] def run_pairwise(i): print rx_dict = {} j in rx_list: subset = base[['id', i, 'rxname']].drop_duplicates() subset[j] = subset['rxname'] == j subset = subset.loc[subset[i].isin(['yes', 'no'])] subset = subset[[i, j]] tab = pd.crosstab(subset[i], subset[j]) if len(tab.columns) == 2: rx_dict[j] = fisher_exact(tab)[1] else: rx_dict[j] = np.nan most_indicative_medication[i] = min(rx_dict, key=rx_dict.get) pool = threadpool(3) pairwise_test_results = pool.map(run_pairwise,disease_list) pool.close() pool.join()
notes:http://chriskiehl.com/article/parallelism-in-one-line/
Comments
Post a Comment