0

我正在尝试使用带有此数据集的 Spark MLlib 进行一些篮子市场分析:

Purchase_ID Category    Furnisher   Value
1   ,   A   ,   1   ,   7
1   ,   B   ,   2   ,   7
2   ,   A   ,   1   ,   1
3   ,   C   ,   2   ,   4
3   ,   A   ,   1   ,   4
3   ,   D   ,   3   ,   4
4   ,   D   ,   3   ,   10
4   ,   A   ,   1   ,   10
5   ,   E   ,   1   ,   8
5   ,   B   ,   3   ,   8
5   ,   A   ,   1   ,   8
6   ,   A   ,   1   ,   3
6   ,   B   ,   1   ,   3
6   ,   C   ,   5   ,   3
7   ,   D   ,   3   ,   4
7   ,   A   ,   1   ,   4

交易价值(价值)按每个 Purchase_ID 分组。而我想要的只是返回价值更高的前3个类别。基本上,我想返回这个数据集:

D,A
E,B,A
A,B

为此,我正在尝试使用以下代码:

val data = sc.textFile("PATH");

case class Transactions(Purchase_ID:String,Category:String,Furnisher:String,Value:String);

def csvToMyClass(line: String) = {
val split = line.split(',');
Transactions(split(0),split(1),split(2),split(3))}

val df = data.map(csvToMyClass)
        .toDF("Purchase_ID","Category","Furnisher","Value")
        .(select("Purchase_ID","Category") FROM (SELECT "Purchase_ID","Category",dense_rank() over (PARTITION BY "Category" ORDER BY "Value" DESC) as rank) tmp WHERE rank <= 3)
        .distinct();

排名函数不正确...

任何人都知道如何解决这个问题?

非常感谢!

4

0 回答 0