pyspark - 在读取数据框时使用“\”

Question

# File location and type
file_location = "/FileStore/tables/FileName.csv"
file_type = "csv"

#CSV options
infer_schema = "true"
first_row_is_header = "true"
delimiter = ","

# The applied options are for CSV files. For other files types, these will be ignored.
df = spark.read.format(file_type) \
  .option("inferSchema", infer_schema) \
  .option("header", first_row_is_header) \
  .option("sep", delimiter) \
  .load(file_location)

display(df)

这是从 csv 文件中读取数据的通用代码。在这段代码中，".option("inferSchema", infer_schema)" 有什么用，"" 在这段代码中会做什么？

score 1 · Accepted Answer

在行尾使用反斜杠被认为是续行，这意味着反斜杠的后面将被视为前一行。在您的情况下，这 5 行被视为一行。

之所以需要“”，首先，无论您放在引号中的什么都被视为字符串，因为这些函数“header”、“inferShema”和其他函数是语法的一部分，您需要保持它们原样。

这个答案https://stackoverflow.com/a/56933052/6633728可能会对您有所帮助。

score 0 · Accepted Answer

反斜杠''用于行尾，表示反斜杠后面的代码被认为在同一行。这主要是在长代码中完成的，其中代码在单行上展开。

inferSchema 用于推断数据框中列的数据类型。如果我们将 inferSchema 设为 true，那么 spark 会在加载数据的同时读取 dataframe 中的所有数据来推断列的数据类型。

"" 与 .option 函数一起使用。它用于在读取文件时添加不同的参数。可以使用选项函数添加许多参数，例如 header、inferSchema、sep、schema 等。

pyspark.sql.DataFrameReader.csv

您可以参考上面的链接以获得进一步的帮助。

pyspark - 在读取数据框时使用“\”

2 回答 2

Related

Reference