我正在研究响应率(不良)小于 1% 的二元分类问题。预测变量包括一组名义分类变量和连续变量。
最初,我尝试使用过采样技术 (SMOTE) 来平衡这两个类别。在过采样数据集上执行的逻辑回归导致良好的整体准确度,但误报率非常高。
我现在计划进行欠采样并运行多个逻辑回归模型。我正在处理的基本python代码如下。需要指导将这些多元逻辑回归模型的结果合二为一。
import pandas as pd
import numpy as np
from sklearn import preprocessing
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
#Set i for the number of equations required
for i in range(10):
#Create a sample of goods, good is pandas df containing goods
sample_good=good.sample(n=300,replace=True)
#Create a sample of bads, bad is pandas df containing bads. There are
#only 100 bads in the dataset
sample_bad=bad.sample(n=100,replace=True)
#Append the good and bad sample
sample=sample_good.append(sample_bad)
X = sample.loc[:, sample.columns != 'y']
y = sample.loc[:, sample.columns == 'y']
from sklearn.linear_model import LogisticRegression
from sklearn import metrics
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3,
random_state=0)
logreg = LogisticRegression()
logreg.fit(X_train, y_train)
y_pred = logreg.predict(X_test)
print('Accuracy of logistic regression classifier on test set:
{:.2f}'.format(logreg.score(X_test, y_test)))
上面的 for 循环运行了 10 次,构建了 10 个不同的模型。需要将这 10 个模型集成为一个模型的指导。我已经阅读了诸如装袋等可用技术。在这种情况下,由于响应率非常低,我创建的样本每次都需要有所有的坏处。