我正在尝试使用带有此数据集的 Spark MLlib 进行一些篮子市场分析:
Purchase_ID Category Furnisher Value
1 , A , 1 , 7
1 , B , 2 , 7
2 , A , 1 , 1
3 , C , 2 , 4
3 , A , 1 , 4
3 , D , 3 , 4
4 , D , 3 , 10
4 , A , 1 , 10
5 , E , 1 , 8
5 , B , 3 , 8
5 , A , 1 , 8
6 , A , 1 , 3
6 , B , 1 , 3
6 , C , 5 , 3
7 , D , 3 , 4
7 , A , 1 , 4
交易价值(价值)按每个 Purchase_ID 分组。而我想要的只是返回价值更高的前3个类别。基本上,我想返回这个数据集:
D,A
E,B,A
A,B
为此,我正在尝试使用以下代码:
val data = sc.textFile("PATH");
case class Transactions(Purchase_ID:String,Category:String,Furnisher:String,Value:String);
def csvToMyClass(line: String) = {
val split = line.split(',');
Transactions(split(0),split(1),split(2),split(3))}
val df = data.map(csvToMyClass)
.toDF("Purchase_ID","Category","Furnisher","Value")
.(select("Purchase_ID","Category") FROM (SELECT "Purchase_ID","Category",dense_rank() over (PARTITION BY "Category" ORDER BY "Value" DESC) as rank) tmp WHERE rank <= 3)
.distinct();
排名函数不正确...
任何人都知道如何解决这个问题?
非常感谢!