apache-spark - 如何对火花进行分位数离散化？

Question

我想在没有 Spark.ML 的情况下将 RDD[Float] 分位数离散化为 10 个，所以我需要计算 10th-Percentile, 20th-Percentile...80th-Percentile,90th-Percentile

数据集很大，无法收集到本地！

有没有有效的算法来解决这个问题？

score 0 · Accepted Answer

如果您使用的是 Spark 版本 > 2.0，则已经提供了此功能。您必须将您的 RDD[Float] 转换为数据框。使用approxQuantile(String col, double[] probabilities, double relativeError)从DataFrameStatFunctions. 从文档中说：

此方法实现了 Greenwald-Khanna 算法的变体（具有一些速度优化）。该算法首次出现在 Greenwald 和 Khanna 的 Space-efficient Online Computation of Quantile Summaries

apache-spark - 如何对火花进行分位数离散化？

1 回答 1

Related

Reference