python - 从 pandas df 生成“最佳拟合”斜率梯度并填充新列 b

Question

我正在尝试在数据框中的两个字段的各个子集上添加斜率计算，并将斜率值应用于每个子集中的所有行。（我以前在 excel 中使用过“斜率”函数，尽管我没有与确切的算法结婚。“desired_output”字段是我期望的输出。子集由“strike_order”列区分，子集从 1 开始并且没有特定的最高值。

"IV" 是 y 值 "Strike" 是 x 值

任何帮助将不胜感激，因为我什至不知道从哪里开始......

import pandas
df = pandas.DataFrame([[1200,1,.4,0.005],[1210,2,.35,0.005],[1220,3,.3,0.005],
[1230,4,.25,0.005],[1200,1,.4,0.003],[1210,2,.37,.003]],columns=
["strike","strike_order","IV","desired_output"])
df

    strike  strike_order    IV  desired_output
0   1200        1         0.40    0.005
1   1210        2         0.35    0.005
2   1220        3         0.30    0.005
3   1230        4         0.25    0.005
4   1200        1         0.40    0.003
5   1210        2         0.37    0.003

如果这不是一个很好的问题，请告诉我，我会努力让它变得更好。

score 1 · Accepted Answer

您可以使用numpy 的最小二乘我们可以将线方程重写y=mx+c为y = Ap, whereA = [[x 1]]和p = [[m], [c]]。然后使用 lstsq 求解 p，所以我们需要通过向 df 添加一列的方式来创建 A

import numpy as np
df['ones']=1
A = df[['strike','ones']]
y = df['IV']
m, c = np.linalg.lstsq(A,y)[0]

或者，您可以使用 scikit learn 的 linear_model 回归模型

您可以通过将数据绘制为散点图并将线方程绘制为图来验证结果

import matplotlib.pyplot as plt
plt.scatter(df['strike'],df['IV'],color='r',marker='d')
x = df['strike']
#plug x in the equation y=mx+c
y_line = c + m * x
plt.plot(x,y)
plt.xlabel('Strike')
plt.ylabel('IV')
plt.show()

结果图如下所示

score 0 · Accepted Answer

@Scott 这行得通，只是它的子集值为 0、1，所有后续的子集值为 2。我在开头添加了一个额外的条件，并添加了一个非常笨拙的种子“种子”值来阻止它寻找第 -1 行。

    import scipy
    seed=df.loc[0,"date_exp"]
    #seed ="08/11/200015/06/2001C"
    #print(seed)
    subset_counter = 0
    for index, row in df.iterrows():
        #if index['strike_order']==0:
        if row['date_exp'] ==seed:
         df.loc[index,'subset']=0

        elif row["strike_order"] == 1:
        df.loc[index,'subset'] = subset_counter
         subset_counter = 1 + df.loc[index-1,'subset']
        else:
          df.loc[index,'subset'] = df.loc[index-1,'subset']

    df['subset'] = df['subset'].astype(int)

这现在正是我想要的，虽然我认为使用种子值很笨重，如果 row == 0 等更喜欢使用。但现在是星期五，这行得通。

干杯

score 0 · Accepted Answer

尝试这个。

首先通过迭代数据框创建一个子集列，使用转换为 1 的罢工订单值作为子集之间的边界

#create subset column
subset_counter = 0
for index, row in df.iterrows():
    if row["strike_order"] == 1:
      df.loc[index,'subset'] = subset_counter
      subset_counter += 1
    else:
      df.loc[index,'subset'] = df.loc[index-1,'subset']

df['subset'] = df['subset'].astype(int)

然后使用 groupby 对每个子集进行线性回归

# run linear regression on subsets of the dataframe using groupby
from sklearn import linear_model
model = linear_model.LinearRegression()
for (group, df_gp) in df.groupby('subset'):
    X=df_gp[['strike']]
    y=df_gp.IV
    model.fit(X,y)
    df.loc[df.subset == df_gp.iloc[0].subset, 'slope'] = model.coef_

df

   strike  strike_order    IV  desired_output  subset  slope
0    1200             1  0.40           0.005       0 -0.005
1    1210             2  0.35           0.005       0 -0.005
2    1220             3  0.30           0.005       0 -0.005
3    1230             4  0.25           0.005       0 -0.005
4    1200             1  0.40           0.003       1 -0.003
5    1210             2  0.37           0.003       1 -0.003

python - 从 pandas df 生成“最佳拟合”斜率梯度并填充新列 b

3 回答 3

Related

Reference