apache-spark - 在不使用 pandas 的情况下转换行和列

Question

我有一个只有两列的数据框。我正在尝试将一列的值转换为标题，将另一列的值转换为它的值。尝试使用 pivot 和 all 但它不起作用。

df_pivot_test = sc.parallelize([('a',1), ('b',1), ('c',2), ('d',2), ('e',10)]).toDF(["id","score"])

id  score
a   1
b   1
c   3
d   6
e   10

试图将其转换为

a   b   c   d   e
1   1   3   6   10

关于我们如何做到这一点的任何想法？我不想使用 .toPandas() 我们可以通过转换成 pandas 数据框来实现它。但是我们有数十亿行，因此我们会遇到内存问题。

score 1 · Accepted Answer

你可以做得到pivot and groupBy你想要的结果。

Try with this method:

from pyspark.sql.functions import *

# with literal value in groupby clause

df_pivot_test.groupBy(lit(1)).pivot("id").agg(expr("first(score)")).drop("1").show()

(or)

# without any column in groupby clause
df_pivot_test.groupBy().pivot("id").agg(expr("first(score)")).show()

Result:

+---+---+---+---+---+
|  a|  b|  c|  d|  e|
+---+---+---+---+---+
|  1|  1|  2|  2| 10|
+---+---+---+---+---+

apache-spark - 在不使用 pandas 的情况下转换行和列

1 回答 1

Related

Reference