python - Matplotlib error: 'height' must be length 5 or scalar -


i attempting plot output of script 2 series bar graph using matplotlib in python 2.7.

my script prints 'msg' results in following output:

knn: 90.000000 (0.322734)

lda: 83.641395 (0.721210)

cart: 92.600996 (0.399870)

nb: 29.214167 (1.743959)

random forest: 92.617598 (0.323824)

after code outputs results of 'msg', attempt plot results 2 series bar graph using matplotlib , returned following error:

traceback (most recent call last):   file "comparison.py", line 113, in <module>     label='mean')   file "c:\users\scot\anaconda2\lib\site-packages\matplotlib\pyplot.py", line 2650, in bar     **kwargs)   file "c:\users\scot\anaconda2\lib\site-packages\matplotlib\__init__.py", line 1818, in inner     return func(ax, *args, **kwargs)   file "c:\users\scot\anaconda2\lib\site-packages\matplotlib\axes\_axes.py", line 2038, in bar     "must length %d or scalar" % nbars) valueerror: incompatible sizes: argument 'height' must length 5 or scalar 

im not sure how fix this, think may due values of results being float value? appreciated. here code:

# modules import pandas import numpy import os pandas.tools.plotting import scatter_matrix import matplotlib.pyplot plt matplotlib import style plt.rcdefaults() sklearn import preprocessing sklearn import cross_validation sklearn.metrics import classification_report sklearn.metrics import confusion_matrix sklearn.metrics import accuracy_score, precision_recall_curve, average_precision_score sklearn.linear_model import logisticregression sklearn.ensemble import randomforestclassifier sklearn.tree import decisiontreeclassifier sklearn.neighbors import kneighborsclassifier sklearn.discriminant_analysis import lineardiscriminantanalysis sklearn.naive_bayes import gaussiannb scipy.stats import ttest_ind, ttest_ind_from_stats scipy.special import stdtr sklearn.svm import svc collections import defaultdict sklearn.preprocessing import labelencoder import warnings  # load kdd dataset data_set = "nsl-kdd/kddtest+.arff" import os os.system("cls")  print "loading: ", data_set  warnings.catch_warnings():     warnings.simplefilter("ignore")      names = ['duration', 'protocol_type', 'service', 'flag', 'src_bytes', 'dst_bytes', 'land', 'wrong_fragment', 'urgent', 'hot', 'num_failed_logins', 'logged_in', 'num_compromised', 'su_attempted', 'num_root', 'num_file_creations',              'num_shells', 'num_access_files', 'num_outbound_cmds', 'is_host_login', 'is_guest_login', 'count', 'srv_count', 'serror_rate', 'srv_serror_rate', 'rerror_rate', 'srv_rerror_rate', 'same_srv_rate', 'diff_srv_rate', 'srv_diff_host_rate',              'dst_host_count', 'dst_host_srv_count', 'dst_host_same_srv_rate', 'dst_host_diff_srv_rate', 'dst_host_same_src_port_rate', 'dst_host_srv_diff_host_rate', 'dst_host_serror_rate', 'dst_host_srv_serror_rate', 'dst_host_rerror_rate', 'class',              'dst_host_srv_rerror_rate']      dataset = pandas.read_csv(data_set, names=names)      column in dataset.columns:         if dataset[column].dtype == type(object):             le = labelencoder()             dataset[column] = le.fit_transform(dataset[column])      array = dataset.values     x = array[:, 0:40]     y = array[:, 40]      # split-out validation dataset     validation_size = 0.20     seed = 7     x_train, x_validation, y_train, y_validation = cross_validation.train_test_split(         x, y, test_size=validation_size, random_state=seed)      # test options , evaluation metric     num_folds = 10     num_instances = len(x_train)     seed = 10     scoring = 'accuracy'      #  algorithms     models = []     models.append(('knn', kneighborsclassifier()))       models.append(('lda', lineardiscriminantanalysis()))       models.append(('cart', decisiontreeclassifier()))       models.append(('nb', gaussiannb()))       models.append(('random forest', randomforestclassifier()))       # models.append(('lr', logisticregression()))       # evaluate each model in turn     results = []     names = []     name, model in models:         kfold = cross_validation.kfold(n=num_instances, n_folds=num_folds, random_state=seed)         cv_results = cross_validation.cross_val_score(             model, x_train, y_train, cv=kfold, scoring=scoring)         results.append(cv_results)         names.append(name)         msg = "%s: %f (%f)" % (name, cv_results.mean() * 100, cv_results.std()                                * 100)  # multiplying 100 show percentage         print(msg)         # print cv_results * 100 # plots values make average      print ("\n")      # perform t test on each iteration of models.     in range(len(results) - 1):         j in range(i, len(results)):             t, p = ttest_ind(results[i], results[j], equal_var=false)             print("t_test between {} & {}: t value = {}, p value = {}".format(                 names[i], names[j], t, p))             print("\n")      plt.style.use('ggplot')     n_groups = 5     # create plot     fig, ax = plt.subplots()     index = numpy.arange(n_groups)     bar_width = 0.35     opacity = 0.8      rects1 = plt.bar(index, cv_results, bar_width,                      alpha=opacity,                      #  color='b',                      label='mean') # line 113      rects2 = plt.bar(index + bar_width, cv_results.std(), bar_width,                      alpha=opacity,                      color='g',                      label='standard_d')      plt.xlabel('models')     plt.ylabel('percentage')     plt.title('all model performance')     plt.xticks(index + bar_width, (names))     plt.legend()      plt.tight_layout()     plt.show() 

edit

printing cv_results appears following , 7 or 8 decimal places:

[ 90.48146099  90.48146099  89.42999447  89.5960155   90.03873824   89.9833979   89.9833979   89.76203652  90.09407858  90.14941893]  [ 83.34255672  84.94742667  82.2910902   83.78527947  84.3386829   83.9513005   82.78915329  84.06198118  83.39789707  83.50857775]  [ 93.1931378   92.69507471  91.92030991  92.52905368  92.69507471   92.41837299  92.58439402  92.25235196  92.19701162  92.14167128]  [ 29.05368013  26.89540675  31.54399557  28.22357499  29.27504151   27.94687327  33.20420587  28.99833979  28.55561704  28.44493636]  [ 93.35915883  93.02711677  92.25235196  91.69894853  93.02711677   92.63973437  92.58439402  92.14167128  92.47371334  92.69507471] 

if want plot means of cv_results, need calculate means .mean(), .std() in second plot.

also, go through process of appending cv_results each model results, when come plotting, seems still using cv_results, going cv_results last model accessed in loop.

it looks results list containing 5 numpy arrays. so, loop on list, calculate mean of each array, , use plot barplot:

mean_results = [res.mean() res in results] rects1 = plt.bar(index, mean_results,  bar_width,                  alpha=opacity,                  #  color='b',                  label='mean') 

alternatively, append cv_results.mean() list during original loop, , use list make bar plot.


Comments

Popular posts from this blog

ubuntu - PHP script to find files of certain extensions in a directory, returns populated array when run in browser, but empty array when run from terminal -

php - How can i create a user dashboard -

javascript - How to detect toggling of the fullscreen-toolbar in jQuery Mobile? -