我有一个数据框,其中每一行看起来像这样:
Pandas(Index=0, a=array([0.78420993, 0.61972316, 0.46183716, 0.48915005, 0.77913277,
0.06024269, 0.81624765, 0.88517468, 0.13920925, 0.1065294 ]), b=array([0.77951759, 0.66244447, 0.9437135 , 0.96207391, 0.78377241,
0.70583409, 0.29280931, 0.81172593, 0.10672919, 0.77779754]))
Pandas(Index=1, a=array([0.41687065, 0.42237762, 0.55709948, 0.90720451, 0.01000812,
0.82060544, 0.74241916, 0.62584166, 0.07328964, 0.66590757]), b=array([0.81446728, 0.10341285, 0.7268244 , 0.25971413, 0.0834643 ,
0.2000305 , 0.86025538, 0.79984722, 0.71790545, 0.44250095]))
Pandas(Index=2, a=array([0.42444138, 0.00104965, 0.51756438, 0.61498706, 0.05810544,
0.84679377, 0.33802997, 0.66426461, 0.52571041, 0.19639958]), b=array([0.17096899, 0.06315944, 0.31988078, 0.07621797, 0.84541187,
0.56356022, 0.00256211, 0.69829937, 0.69287602, 0.124942 ]))
接下来我想做的是抓取“a”和“b”列并创建一个新的数据框,其中每一行都是它们列表中的顺序元素。所以看起来像:
a b
0.78420993 0.41687065
0.61972316 0.77951759
我目前正在使用 pyspark 在 Databricks 笔记本上结合使用考拉和熊猫,并计划在 Spark 集群上运行。任何关于如何实现这一点的建议(牢记性能)都会非常有帮助!