1

我有以下数据框:

new_df = 

BankNum   | ID    | Labels

0098-7772 | AB123 | High
0098-7772 | ED245 | High
0098-7772 | ED343 | High
0870-7771 | ED200 | Mod
0870-7771 | ED100 | Mod
0098-2123 | GH564 | Low

我正在使用 scikitSVC来预测 Labels 'High''Mod''Low'. 我这样做如下:

new_df['BankNum'] = new_df['BankNum'].map(lambda x: x.replace('-',''))
new_df['BankNum'] = new_df.BankNum.astype(np.float128)

columns =['BankNum', 'ID']
le = LabelEncoder()
new_df['ID'] = le.fit_transform(new_df.ID)

new_df['Labels'] = le.fit_transform(new_df.Labels)

X_train, X_test, y_train, y_test = train_test_split(new_df[columns], new_df.Labels, test_size=0.2, random_state=42)

    clf = svm.SVC(gamma=0.001, C=100., probability=True, random_state=42)

    scores = cross_val_score(clf, X_train, y_train, cv=8)
    print "Cross Validation Score: "
    print scores.mean()

    clf.fit(X_train, y_train)

    predicted = clf.predict(X_test)
    print "Accuracy: "
    print(np.mean(predicted == y_test))
    print(metrics.classification_report(y_test, predicted))

我有两个问题:

1.) 对于分类报告,我得到如下输出:

               precision    recall  f1-score   support

          0       0.00      0.00      0.00      4780
          1       0.94      1.00      0.97    104719
          2       0.00      0.00      0.00      1425

avg / total       0.89      0.94      0.92    110924

为什么标签 0 和 2 的精度为 0.00?这可能是因为阶级不平衡吗?大约有 80893 个高标签、11798 个 Mod 标签和 279608 个低标签。或者 SVm 不是一个很好的模型吗?

2.) 我想获得每个预测的置信度分数。我google了一下,发现如下:

p = clf.predict_proba( X_test )
    auc = AUC(y_test, p[:,1] )
    print "SVM AUC", auc

但我收到错误:raise ValueError("{0} format is not supported".format(y_typeValueError: multiclass format is not supported

如何获得每个预测的置信度度量,然后也对其进行解释?非常感谢!!

4

0 回答 0