apache-spark - Spark saveAsTable 的位置位于 s3 存储桶的根本原因 NullPointerException

Question

我正在使用 Spark 3.0.1，我的分区表存储在 s3 中。请在此处找到问题的描述。

创建表

Create table root_table_test_spark_3_0_1 (
    id string,
    name string
)
USING PARQUET
PARTITIONED BY (id)
LOCATION  's3a://MY_BUCKET_NAME/'

在第二次运行时导致 NullPointerException 的代码

Seq(MinimalObject("id_1", "name_1"), MinimalObject("id_2", "name_2"))
      .toDS()
      .write
      .partitionBy("id")
      .mode(SaveMode.Append)
      .saveAsTable("root_table_test_spark_3_0_1")

当 Hive 元存储为空时，一切正常，但是当 spark 尝试同步时，问题就发生getCustomPartitionLocations了 InsertIntoHadoopFsRelationCommand。（例如第二次运行）

实际上它调用了以下方法： from ( org.apache.hadoop.fs.Path)

/** Adds a suffix to the final name in the path.*/
public Path suffix(String suffix) {
    return new Path(getParent(), getName()+suffix);
}

但是getParent()当我们在 root 时会返回 null，从而导致 NullPointerException。我目前正在考虑的唯一选择是重写此方法以执行以下操作：

/** Adds a suffix to the final name in the path.*/
public Path suffix(String suffix) {
    return (isRoot()) ? new Path(uri.getScheme(), uri.getAuthority(), suffix) : new Path(getParent(), getName()+suffix);
}

LOCATION当火花蜂巢表处于根级别时，有人遇到问题吗？任何解决方法？是否有任何已知问题已打开？

我的运行时不允许我覆盖 Path 类并修复该suffix方法，并且我无法从存储桶的根目录中移动我的数据，因为它已经存在 2 年了。

出现问题是因为我正在从 Spark 2.1.0 迁移到 Spark 3.0.1，并且检查自定义分区的行为出现在 Spark 2.2.0 中（https://github.com/apache/spark/pull/16460）

整个上下文有助于理解问题，但基本上你可以轻松地重现它

val path: Path = new Path("s3a://MY_BUCKET_NAME/")
println(path.suffix("/id=id"))

供参考。hadoop-common 版本是 2.7.4，请在此处找到完整的堆栈跟踪

NullPointerException
at org.apache.hadoop.fs.Path.<init>(Path.java:104)
    at org.apache.hadoop.fs.Path.<init>(Path.java:93)
    at org.apache.hadoop.fs.Path.suffix(Path.java:361)
    at org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand.$anonfun$getCustomPartitionLocations$1(InsertIntoHadoopFsRelationCommand.scala:262)
    at scala.collection.TraversableLike.$anonfun$flatMap$1(TraversableLike.scala:245)
    at scala.collection.mutable.ResizableArray.foreach(ResizableArray.scala:62)
    at scala.collection.mutable.ResizableArray.foreach$(ResizableArray.scala:55)
    at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:49)
    at scala.collection.TraversableLike.flatMap(TraversableLike.scala:245)
    at scala.collection.TraversableLike.flatMap$(TraversableLike.scala:242)
    at scala.collection.AbstractTraversable.flatMap(Traversable.scala:108)
    at org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand.getCustomPartitionLocations(InsertIntoHadoopFsRelationCommand.scala:260)
    at org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand.run(InsertIntoHadoopFsRelationCommand.scala:107)
    at org.apache.spark.sql.execution.datasources.DataSource.writeAndRead(DataSource.scala:575)
    at org.apache.spark.sql.execution.command.CreateDataSourceTableAsSelectCommand.saveDataIntoTable(createDataSourceTables.scala:218)
    at org.apache.spark.sql.execution.command.CreateDataSourceTableAsSelectCommand.run(createDataSourceTables.scala:166)

谢谢

score 1 · Accepted Answer

看起来像 spark 代码调用的情况，Path.suffix("something)并且因为根路径没有父路径，所以触发了 NPE

长期修复

针对 HADOOP 在 issues.apache.org 上提交 JIRA；为修复后缀（）提供一个带有测试的补丁，以便在根路径上调用时正确降级。最适合所有人
不要将根路径用作表的目标。
做这两个

选项#2 应该避免其他关于如何创建/提交表等的意外......某些代码可能会失败，因为尝试删除路径的根目录（此处为 s3a://some-bucket"）不会删除根，会吗？

换句话说：根目录到处都有“奇怪”的语义；大多数时候你不会在本地 FS 上注意到这一点，因为你从不尝试使用 / 作为工作的目的地，惊讶于 rm -rf / 与 rm -rf /subdir 等不同。 Spark、Hive 等从来没有写过使用 / 作为工作的目的地，所以你可以看到失败。

apache-spark - Spark saveAsTable 的位置位于 s3 存储桶的根本原因 NullPointerException

1 回答 1

Related

Reference