python - smf.ols summary and metrics for three-class classification -
i use scikit-learn modelling purposes, , have no experience in r. however, had specific request forward selective logistic regression , trying use statsmodels.formula.api.ols three-class classification. i found , modified function think working can't sure because can't interpret output.
advice appreciated familiar statsmodels, particularly using statsmodels pandas.
i have 2 main issues:
i can't print summary table seems basis of results formatting class. error:
valueerror: shapes (18,3) , (18,3) not aligned: 3 (dim 1) != 18 (dim 0)
this related using ols classifier, doesn't work when restricting 2 classes. other methods , attributes, pvalues , rsquared, return similar errors. can't dig structure of summary() , can't find examples in documentation. examples appreciated.
interestingly, params attribute contains meaningful output. however, can't interpret it, it's organized in columns 0, 1, 2. obviously, there 3 resultant equations , there should 3 parameters each feature correspond, can't tell column refers test (ex. normal vs. positive, normal vs. negative, negative vs. positive).
0 1 2 intercept 0.268715 0.036415 0.694869 feature1 -0.019223 -0.015703 0.034926 feature3 0.023013 0.061053 -0.084067
for completeness, model.model.formula contains meaningful output.
here's sample data put excel doc, please use testing:
classname feature1 feature2 feature3 normal 3 3 6 positive 6 1 7 negative 2 2 4 normal 3 2 5 positive 5 4 3 negative 6 4 7 normal 8 1 6 positive 5 6 6 negative 3 3 8 normal 2 7 5 positive 4 2 3 negative 3 9 3 normal 2 5 9 positive 3 1 5 negative 5 2 6 normal 2 4 7 positive 1 2 6 negative 1 2 8
and here's code, simplified, save imports:
def forward_selected(df, response): remaining=set(df.columns) remaining.remove(response) print df.head() selected=[] current_score, best_new_score=0.0, 0.0 while remaining , current_score == best_new_score: scores_with_candidates=[] candidate in remaining: formula="{} ~ {} + 1".format(response, ' + '.join(selected+[candidate])) score=smf.ols(formula, df).fit() scores_with_candidates.append((score, candidate)) scores_with_candidates.sort() best_new_score, best_candidate = scores_with_candidates.pop() if current_score < best_new_score: remaining.remove(best_candidate) selected.append(best_candidate) print best_candidate current_score=best_new_score formula="{} ~ {} + 1".format(response, ' + '.join(selected)) model=smf.ols(formula, df).fit() return model def main(): #infile, feature_names df_raw = pd.read_excel('sampledata.xlsx') #infile model=forward_selected(df_raw, 'classname') print model.params print model.model.formula print model.summary() return if __name__ == '__main__': main()
any advice dig model , retrieve metrics (aic, rsquared, pvalues) appreciated.
Comments
Post a Comment