scala - Apache Spark 根据另一行更新 RDD 或数据集中的一行

Question

我试图弄清楚如何根据另一行更新一些行。

例如，我有一些数据，如

Id | useraname | ratings | city
--------------------------------
1, philip, 2.0, montreal, ...
2, john, 4.0, montreal, ...
3, charles, 2.0, texas, ...

我想将同一城市的用户更新为相同的 groupId（1 或 2）

Id | useraname | ratings | city
--------------------------------
1, philip, 2.0, montreal, ...
1, john, 4.0, montreal, ...
3, charles, 2.0, texas, ...

如何在我的 RDD 或 Dataset 中实现这一点？

因此，为了完整起见，如果Id是一个字符串，密集等级将不起作用？

例如？

Id | useraname | ratings | city
--------------------------------
a, philip, 2.0, montreal, ...
b, john, 4.0, montreal, ...
c, charles, 2.0, texas, ...

所以结果看起来像这样：

grade | useraname | ratings | city
--------------------------------
a, philip, 2.0, montreal, ...
a, john, 4.0, montreal, ...
c, charles, 2.0, texas, ...

score 2 · Accepted Answer

一个干净的方法是使用dense_rank()fromWindow函数。它枚举列中的唯一值Window。因为city是一String列，所以这些将按字母顺序增加。

import org.apache.spark.sql.functions.rank
import org.apache.spark.sql.expressions.Window

val df = spark.createDataFrame(Seq(
  (1, "philip", 2.0, "montreal"),
  (2, "john", 4.0, "montreal"),
  (3, "charles", 2.0, "texas"))).toDF("Id", "username", "rating", "city")

val w = Window.orderBy($"city")
df.withColumn("id", rank().over(w)).show()

+---+--------+------+--------+
| id|username|rating|    city|
+---+--------+------+--------+
|  1|  philip|   2.0|montreal|
|  1|    john|   4.0|montreal|
|  2| charles|   2.0|   texas|
+---+--------+------+--------+

score 0 · Accepted Answer

尝试：

df.select("city").distinct.withColumn("id", monotonically_increasing_id).join(df.drop("id"), Seq("city"))

scala - Apache Spark 根据另一行更新 RDD 或数据集中的一行

2 回答 2

Related

Reference