1
val originalDF = Seq((1,"gaurav","jaipur",550,70000),(2,"sunil","noida",600,80000),(3,"rishi","ahmedabad",510,65000)).toDF("id","name","city","credit_score","credit_limit")

val changedDF= Seq((1,"gaurav","jaipur",550,70000),(2,"sunil","noida",650,90000),(4,"Joshua","cochin",612,85000)).toDF("id","name","city","creditscore","credit_limit")

所以上面的两个数据帧具有相同的表结构,我想找出在另一个数据帧(changedDF)中值发生变化的id。我尝试使用 spark 中的 except() 函数,但它给了我两行。Id 是这两个数据框之间的公共列。

changedDF.except(originalDF).show
+---+------+------+-----------+------------+
| id|  name|  city|creditscore|credit_limit|
+---+------+------+-----------+------------+
|  4|Joshua|cochin|        612|       85000|
|  2| sunil| noida|        650|       90000|
+---+------+------+-----------+------------+

而我只想要有任何变化的通用ID。像这样->

+---+------+------+-----------+------------+
| id|  name|  city|creditscore|credit_limit|
+---+------+------+-----------+------------+
|  2| sunil| noida|        650|       90000|
+---+------+------+-----------+------------+

有什么方法可以找出数据已更改的唯一公共 ID。谁能告诉我我可以遵循的任何方法来实现这一目标。

4

1 回答 1

2

您可以inner连接数据框,这将为您提供具有通用 ID 的结果。

originalDF.alias("a").join(changedDF.alias("b"), col("a.id") === col("b.id"), "inner")
  .select("a.*")
  .except(changedDF)
  .show

然后,您的预期结果将出来:

+---+-----+-----+------------+------------+
| id| name| city|credit_score|credit_limit|
+---+-----+-----+------------+------------+
|  2|sunil|noida|         600|       80000|
+---+-----+-----+------------+------------+
于 2020-02-26T12:59:48.733 回答