r - 如何形成基于词汇表的 tfidf sparklyr 数据框

Question

必须构建一个 Tf-idf 矩阵/数据框，其中术语/单词作为列名，而不是使用 sparklyr 的索引。我选择了 ft_count_vectorizer，因为它可以存储词汇。但是在找到 tf-idf 后我被卡住了，我无法将这些术语映射到它的 tf-idf 值。对此领域的任何帮助将不胜感激。这是我尝试过的。

tf_idf<-cleantext %>%
  ft_tokenizer("Summary", "tokenized") %>%
  ft_stop_words_remover(input.col = "tokenized", output.col = "clean_words",
                        ml_default_stop_words(sc,language = ("english"))) %>%
  ft_count_vectorizer(input_col = "clean_words",output_col="tffeatures")%>%
  ft_idf(input_col="tffeatures",output_col="tfidffeatures")

tf-idf 是一个 spark_tbl 类，它还包括 clean_words(vocabulary) 和 tfidf 特性。这两个特性都以列表的形式出现。我需要提供 tfidf 功能作为输入，并以 clean_words 作为列标题。最好的方法是什么。我被困在这里。任何帮助或帮助将不胜感激。

score 0 · Accepted Answer

虽然像这样在技术上可行的操作并没有太多的实际应用。Apache Spark 没有针对处理具有广泛数据的执行计划进行优化，例如通过扩展矢量化列可能生成的执行计划。

如果您仍然坚持下去，您将不得不提取CountVectorizer. 一种可能的方法是使用 ML Pipelines（您可以查看我关于如何在 sparklyr 中训练 ML 模型并预测另一个数据帧上的新值的答案？详细说明）。

使用您拥有的转换器，您可以定义Pipeline和：fitPipelineModel

model <- ml_pipeline( 
  ft_tokenizer(sc, "Summary", "tokenized"),
  ft_stop_words_remover(sc, input.col = "tokenized",
                        output.col = "clean_words",
                        ml_default_stop_words(sc, language = "english")),
  ft_count_vectorizer(sc, input_col = "clean_words",
                      output_col = "tff eatures"),
  ft_idf(sc, input_col = "tffeatures",output_col = "tfidffeatures")
) %>% ml_fit(cleantext)

然后检索CountVectorizerModel并提取词汇：

vocabulary <- ml_stage(model, "count_vectorizer")$vocabulary %>% unlist()

最后transform是数据，应用sdf_separate_column，并选择感兴趣的列：

ml_transform(model, cleantext) %>% 
  sdf_separate_column("tfidffeatures", vocabulary) %>% 
  select(one_of(vocabulary))

r - 如何形成基于词汇表的 tfidf sparklyr 数据框

1 回答 1

Related

Reference