我们有一个在 Hadoop 2.7.2、Centos 7.2 上运行 Apache Spark 2.0 的集群。我们使用 Spark DataFrame/DataSet API 编写了一些新代码,但在将数据写入 Windows Azure 存储 Blob(默认 HDFS 位置)然后读取数据后,我们注意到连接结果不正确。我已经能够使用在集群上运行的以下代码片段来复制该问题。
case class UserDimensions(user: Long, dimension: Long, score: Double)
case class CentroidClusterScore(dimension: Long, cluster: Int, score: Double)
val dims = sc.parallelize(Array(UserDimensions(12345, 0, 1.0))).toDS
val cent = sc.parallelize(Array(CentroidClusterScore(0, 1, 1.0),CentroidClusterScore(1, 0, 1.0),CentroidClusterScore(2, 2, 1.0))).toDS
dims.show
cent.show
dims.join(cent, dims("dimension") === cent("dimension") ).show
输出
+-----+---------+-----+
| user|dimension|score|
+-----+---------+-----+
|12345| 0| 1.0|
+-----+---------+-----+
+---------+-------+-----+
|dimension|cluster|score|
+---------+-------+-----+
| 0| 1| 1.0|
| 1| 0| 1.0|
| 2| 2| 1.0|
+---------+-------+-----+
+-----+---------+-----+---------+-------+-----+
| user|dimension|score|dimension|cluster|score|
+-----+---------+-----+---------+-------+-----+
|12345| 0| 1.0| 0| 1| 1.0|
+-----+---------+-----+---------+-------+-----+
哪个是对的。然而,在写入和读取数据之后,我们看到了这个
dims.write.mode("overwrite").save("/tmp/dims2.parquet")
cent.write.mode("overwrite").save("/tmp/cent2.parquet")
val dims2 = spark.read.load("/tmp/dims2.parquet").as[UserDimensions]
val cent2 = spark.read.load("/tmp/cent2.parquet").as[CentroidClusterScore]
dims2.show
cent2.show
dims2.join(cent2, dims2("dimension") === cent2("dimension") ).show
输出
+-----+---------+-----+
| user|dimension|score|
+-----+---------+-----+
|12345| 0| 1.0|
+-----+---------+-----+
+---------+-------+-----+
|dimension|cluster|score|
+---------+-------+-----+
| 0| 1| 1.0|
| 1| 0| 1.0|
| 2| 2| 1.0|
+---------+-------+-----+
+-----+---------+-----+---------+-------+-----+
| user|dimension|score|dimension|cluster|score|
+-----+---------+-----+---------+-------+-----+
|12345| 0| 1.0| null| null| null|
+-----+---------+-----+---------+-------+-----+
但是,使用 RDD API 会产生正确的结果
dims2.rdd.map( row => (row.dimension, row) ).join( cent2.rdd.map( row => (row.dimension, row) ) ).take(5)
res5: Array[(Long, (UserDimensions, CentroidClusterScore))] = Array((0,(UserDimensions(12345,0,1.0),CentroidClusterScore(0,1,1.0))))
我们尝试将输出格式更改为 ORC 而不是 parquet,但我们看到了相同的结果。在本地而非集群上运行 Spark 2.0 不会出现此问题。在 Hadoop 集群的主节点上以本地模式运行 spark 也可以。只有在 YARN 上运行时,我们才会看到这个问题。
这似乎也与此问题非常相似:https ://issues.apache.org/jira/browse/SPARK-10896