python - 使用 Pyspark 将 XML 转为数据框

Question

我正在尝试废弃一个 XML 文件并从 XML 文件上的标签创建一个数据框。我使用 pyspark 在 Databricks 上工作。

XML 文件：

<?xml version="1.0" encoding="UTF-8"?>
<note>
  <shorttitle>shorttitle_1</shorttitle>
  <shorttitle>shorttitle_2</shorttitle>
  <shorttitle>shorttitle_3</shorttitle>
  <shorttitle>shorttitle_4</shorttitle>
</note>

我的代码似乎从页面中删除了 XML 并从标签创建了一个列表，但是当我创建我的数据框并尝试输入所述列表时，我只看到一个数据框包含空值。

代码：

from pyspark.sql.types import *
from pyspark.sql.functions import *
import requests
from bs4 import BeautifulSoup


res = requests.get("http://files.fakeaddress.com/files01.xml")
soup = BeautifulSoup(res.content,'html.parser')
short_title = soup.find_all('shorttitle')[0:2]

field = [StructField("Short_Title",StringType(), True)]

schema = StructType(field)

df = spark.createDataFrame(short_title, schema)

输出：

+-----------+
|Short_Title|
+-----------+
|       null|
|       null|
+-----------+

想要的输出：

+-------------+
|Short_Title  |
+-------------+
|shorttitle_1 |
|shorttitle_2 |
+-------------+

score 0 · Accepted Answer

您可以使用 Databricks API 使用 Apache Spark XML 处理来完成它，下面是相同的代码示例片段，用于 hdfs 或本地上的第一个副本 xml。

schema = new StructType()
      .add("Short_Title",StringType)

df = spark.read
  .option("rowTag", "note")
  .schema(schema)
  .xml("files01.xml")

df.show()

score 0 · Accepted Answer

You can use the Spark-XML package, which creates a Spark Dataframe directly from your XML file(s) without any further hassle. It only becomes more complicated when you have nested keys in your XML file.

Installation of the package on your Databricks cluster is fairly straightforward using their maven repository, for which they provide the coordinates. However, I am uncertain if the package is still being updated.

python - 使用 Pyspark 将 XML 转为数据框

2 回答 2

Related

Reference