我正在尝试使用 pyspark 运行带有交叉验证的线性回归。但是有一点我不明白:我的模型总是选择参数最小的模型作为最佳模型。
我从https://vincentarelbundock.github.io/Rdatasets/datasets.html下载数据。有一个名为 SLID 的数据集。
它看起来像这样:
我删除了第一列,性别和语言列。然后我删除其中带有 NA 的行。然后我更改了列名。最后,数据如下所示:
然后,这是我的代码:
import findspark
findspark.init()
import pyspark
from pyspark.sql import SparkSession
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
from pyspark.ml.linalg import Vectors
from pyspark.ml.feature import VectorAssembler
from pyspark.ml.classification import LogisticRegression
from pyspark.ml.regression import LinearRegression
from pyspark.sql.functions import col
from pyspark.ml.evaluation import RegressionEvaluator
from pyspark.ml.tuning import CrossValidator, ParamGridBuilder
spark=SparkSession.builder.getOrCreate()
#please change it to your own path
df = spark.read.csv("/home/feng/Downloads/SLID.csv",header=True)
df1=df
df1=df1.select(*(col(c).cast("float").alias(c) for c in df1.columns))
df1=df1.withColumnRenamed('x2','label')
assembler = VectorAssembler(
inputCols=['x1','y1'],
outputCol="features")
output = assembler.transform(df1)
output1=output.select(output.label,output.features)
output2=output1.randomSplit([0.3,0.7])
training=output2[0]
testing=output2[1]
lr = LinearRegression(maxIter=10, regParam=0.01)
paramGrid = ParamGridBuilder() \
.addGrid(lr.maxIter, [1,2,5,10,20]) \
.addGrid(lr.regParam, [0.05,0.1, 0.3,0.5,0.7]) \
.addGrid(lr.elasticNetParam, [0, 0.5, 1])\
.build()
crossval = CrossValidator(estimator=lr,
estimatorParamMaps=paramGrid,
evaluator=RegressionEvaluator(predictionCol="prediction",labelCol="label",
metricName="rmse"),
numFolds=5) # use 3+ folds in practice
# Run cross-validation, and choose the best set of parameters.
cvModel = crossval.fit(training)
a=cvModel.bestModel.extractParamMap()
for keys,values in a.items():
print(keys)
print(values)
然后我有这样的结果:
LinearRegression_a0560f768ad8__labelCol
label
LinearRegression_a0560f768ad8__aggregationDepth
2
LinearRegression_a0560f768ad8__epsilon
1.35
LinearRegression_a0560f768ad8__standardization
True
LinearRegression_a0560f768ad8__maxIter
1
LinearRegression_a0560f768ad8__regParam
0.1
LinearRegression_a0560f768ad8__loss
squaredError
LinearRegression_a0560f768ad8__predictionCol
prediction
LinearRegression_a0560f768ad8__solver
auto
LinearRegression_a0560f768ad8__tol
1e-06
LinearRegression_a0560f768ad8__featuresCol
features
LinearRegression_a0560f768ad8__elasticNetParam
0.0
LinearRegression_a0560f768ad8__fitIntercept
True
这个结果就是一个例子。我实际上改变了很多不同的参数集。您可以在 paramGrid 中看到,我有三个具有不同值的参数:maxIter、regParam、elasticNetParam。无论我选择哪个值,算法都给了我使用最小参数的最佳模型。我不认为这是正确的。但我不知道为什么。
谁能向我解释一下,告诉我哪里做错了以及如何修改?


