createDataFrame
此答案演示了如何使用、create_df
和来创建 PySpark DataFrame toDF
。
df = spark.createDataFrame([("joe", 34), ("luisa", 22)], ["first_name", "age"])
df.show()
+----------+---+
|first_name|age|
+----------+---+
| joe| 34|
| luisa| 22|
+----------+---+
您还可以传递createDataFrame
RDD 和模式以更精确地构造 DataFrame:
from pyspark.sql import Row
from pyspark.sql.types import *
rdd = spark.sparkContext.parallelize([
Row(name='Allie', age=2),
Row(name='Sara', age=33),
Row(name='Grace', age=31)])
schema = schema = StructType([
StructField("name", StringType(), True),
StructField("age", IntegerType(), False)])
df = spark.createDataFrame(rdd, schema)
df.show()
+-----+---+
| name|age|
+-----+---+
|Allie| 2|
| Sara| 33|
|Grace| 31|
+-----+---+
create_df
从我的Quinn项目中可以实现两全其美 - 它简洁且完全描述性:
from pyspark.sql.types import *
from quinn.extensions import *
df = spark.create_df(
[("jose", "a"), ("li", "b"), ("sam", "c")],
[("name", StringType(), True), ("blah", StringType(), True)]
)
df.show()
+----+----+
|name|blah|
+----+----+
|jose| a|
| li| b|
| sam| c|
+----+----+
toDF
与其他方法相比没有任何优势:
from pyspark.sql import Row
rdd = spark.sparkContext.parallelize([
Row(name='Allie', age=2),
Row(name='Sara', age=33),
Row(name='Grace', age=31)])
df = rdd.toDF()
df.show()
+-----+---+
| name|age|
+-----+---+
|Allie| 2|
| Sara| 33|
|Grace| 31|
+-----+---+