0

我们有一个在 Hadoop 2.7.2、Centos 7.2 上运行 Apache Spark 2.0 的集群。我们使用 Spark DataFrame/DataSet API 编写了一些新代码,但在将数据写入 Windows Azure 存储 Blob(默认 HDFS 位置)然后读取数据后,我们注意到连接结果不正确。我已经能够使用在集群上运行的以下代码片段来复制该问题。

case class UserDimensions(user: Long, dimension: Long, score: Double)
case class CentroidClusterScore(dimension: Long, cluster: Int, score: Double)

val dims = sc.parallelize(Array(UserDimensions(12345, 0, 1.0))).toDS
val cent = sc.parallelize(Array(CentroidClusterScore(0, 1, 1.0),CentroidClusterScore(1, 0, 1.0),CentroidClusterScore(2, 2, 1.0))).toDS

dims.show
cent.show
dims.join(cent, dims("dimension") === cent("dimension") ).show

输出

+-----+---------+-----+                                                         
| user|dimension|score|
+-----+---------+-----+
|12345|        0|  1.0|
+-----+---------+-----+

+---------+-------+-----+
|dimension|cluster|score|
+---------+-------+-----+
|        0|      1|  1.0|
|        1|      0|  1.0|
|        2|      2|  1.0|
+---------+-------+-----+

+-----+---------+-----+---------+-------+-----+
| user|dimension|score|dimension|cluster|score|
+-----+---------+-----+---------+-------+-----+
|12345|        0|  1.0|        0|      1|  1.0|
+-----+---------+-----+---------+-------+-----+

哪个是对的。然而,在写入和读取数据之后,我们看到了这个

dims.write.mode("overwrite").save("/tmp/dims2.parquet")
cent.write.mode("overwrite").save("/tmp/cent2.parquet")

val dims2 = spark.read.load("/tmp/dims2.parquet").as[UserDimensions]
val cent2 = spark.read.load("/tmp/cent2.parquet").as[CentroidClusterScore]

dims2.show
cent2.show

dims2.join(cent2, dims2("dimension") === cent2("dimension") ).show

输出

+-----+---------+-----+                                                         
| user|dimension|score|
+-----+---------+-----+
|12345|        0|  1.0|
+-----+---------+-----+

+---------+-------+-----+
|dimension|cluster|score|
+---------+-------+-----+
|        0|      1|  1.0|
|        1|      0|  1.0|
|        2|      2|  1.0|
+---------+-------+-----+

+-----+---------+-----+---------+-------+-----+
| user|dimension|score|dimension|cluster|score|
+-----+---------+-----+---------+-------+-----+
|12345|        0|  1.0|     null|   null| null|
+-----+---------+-----+---------+-------+-----+

但是,使用 RDD API 会产生正确的结果

dims2.rdd.map( row => (row.dimension, row) ).join( cent2.rdd.map( row => (row.dimension, row) ) ).take(5)

res5: Array[(Long, (UserDimensions, CentroidClusterScore))] = Array((0,(UserDimensions(12345,0,1.0),CentroidClusterScore(0,1,1.0))))

我们尝试将输出格式更改为 ORC 而不是 parquet,但我们看到了相同的结果。在本地而非集群上运行 Spark 2.0 不会出现此问题。在 Hadoop 集群的主节点上以本地模式运行 spark 也可以。只有在 YARN 上运行时,我们才会看到这个问题。

这似乎也与此问题非常相似:https ://issues.apache.org/jira/browse/SPARK-10896

4

1 回答 1

0

此问题已由https://issues.apache.org/jira/browse/SPARK-17806中提交的拉取请求修复

于 2016-10-17T15:13:03.953 回答