python - 行之间的 PySpark 成对距离

Question

现在我正在使用 PySpark，想知道有没有办法在行之间进行成对距离。例如，有一个这样的数据集。

+--------------------+------------+--------+-------+-------+
|             product| Mitsubishi | Toyota | Tesla | Honda |
+--------------------+------------+--------+-------+-------+
|Mitsubishi          |           0|     0.8|    0.2|      0|
|Toyota              |           0|       0|      0|      0|  
|Tesla               |         0.1|     0.4|      0|    0.3|
|Honda               |           0|     0.5|    0.1|      0|
+--------------------+------------+--------+-------+-------+

我很好奇，因为在熊猫中我使用了这行代码sklearn：

from sklearn.metrics import pairwise_distances
array = df1_corr.drop(columns=['new_product_1']).values
correlation = pairwise_distances(array, array, metric = 'correlation')

PySpark 怎么样，有内置的pairwise_distance吗？或在sparkml？

score 0 · Accepted Answer

解决问题的方法是 pandas_udf。这是与您的场景类似的好读物和示例。

https://databricks.com/blog/2017/10/30/introducing-vectorized-udfs-for-pyspark.html

https://towardsdatascience.com/scalable-python-code-with-pandas-udfs-a-data-science-application-dd515a628896

python - 行之间的 PySpark 成对距离

1 回答 1

Related

Reference