machine-learning - How to classify text with Knime

Question

I'm trying to classify some data using knime with knime-labs deep learning plugin.

I have about 16.000 products in my DB, but I have about 700 of then that I know its category.

I'm trying to classify as much as possible using some DM (data mining) technique. I've downloaded some plugins to knime, now I have some deep learning tools as some text tools.

Here is my workflow, I'll use it to explain what I'm doing:

I'm transforming the product name into vector, than applying into it. After I train a DL4J learner with DeepMLP. (I'm not really understand it all, it was the one that I thought I got the best results). Than I try to apply the model in the same data set.

I thought I would get the result with the predicted classes. But I'm getting a column with output_activations that looks that gets a pair of doubles. when sorting this column I get some related date close to each other. But I was expecting to get the classes.

Here is a print of the result table, here you can see the output with the input.

In columns selection it's getting just the converted_document and selected des_categoria as Label Column (learning node config). And in Predictor node I checked the "Append SoftMax Predicted Label?"

The nom_produto is the text column that I'm trying to use to predict the des_categoria column that it the product category.

I'm really newbie about DM and DL. If you could get me some help to solve what I'm trying to do would be awesome. Also be free to suggest some learning material about what attempting to achieve

PS: I also tried to apply it into the unclassified data (17,000 products), but I got the same result.

score 2 · Accepted Answer

我不会回答这个工作流程，因为它不会是一个简单的工作流程。但是，请务必在 KNIME 服务器上找到文本挖掘示例，即使用词袋方法的示例。

任务

产品到类别的映射应该是一项直接的数据挖掘任务，因为解释目标变量的信息以准详尽的方式可用。但是，根据要训练的类别数量，您可能需要超过 700 个实例来学习。

一些资源

这里有一些资源，只有第一个真正专门用于文本挖掘：

信息检索导论，特别是第13章；
Data Science for Business是数据挖掘的优秀介绍，包括文本挖掘（第 10 章），也不要忘记关于相似性的章节（第 6 章）；
使用 R 进行机器学习的优点是易于访问（第 4 章提供了使用 R 代码进行文本分类的示例）。

预处理

首先，您必须对产品标签进行一些预处理。为此目的使用 KNIME 的文本分析预处理节点，即在您使用以下内容转换产品标签之后Strings to Document：

Case Convert,Punctuation Erasure和Snowball Stemmer;
您可能不需要Stop Word Filter，但是，可能有诸如“产品”之类的准停用词，您可能需要手动删除Dictionary Filter；
注意不要在没有先测试其影响的情况下使用以下任何内容：（N Chars Filter可能g是一个有用的词），Number Filter（数字可能表示数量，这可能对分类有用）。

如果您在相关节点上遇到任何问题（例如Punctuation Erasure，由于分词器可能会非常棘手），您始终可以String Manipulation在转换Strings to Document.

保持简短：查找表

您可以基于 700 个训练实例构建查找表。数据挖掘技术和资源 (2)一书详细介绍了这种方法。如果任何模型的性能比查找表差，您应该放弃该模型。

接下来是什么？

如果查找表或 k-nn 适合您，则无需添加任何其他内容。

如果这些方法中的任何一个失败了，您可能需要分析它失败的确切情况。此外，训练集的大小可能太小，因此您可以手动分类另外几百或几千个实例。

如果在增加训练集大小后，您仍然在处理一个糟糕的模型，您可以尝试将词袋方法与Naive Bayes分类器一起使用（参见信息检索参考的第 13 章）。这里没有详细说明词袋方法和朴素贝叶斯的空间，但您会发现上面的资源对此很有用。

最后一点。就个人而言，我发现 KNIME 的Naive Bayes节点表现不佳，可能是因为它没有实现拉普拉斯平滑。但是，KNIMER Learner和R Predictor节点将允许您使用 R 的e1071包，如资源 (3) 所示。

machine-learning - How to classify text with Knime

1 回答 1

任务

一些资源

预处理

保持简短：查找表

最近的邻居

接下来是什么？

machine-learning - How to classify text with Knime

1 回答 1

任务

一些资源

预处理

保持简短：查找表

最近的邻居

接下来是什么 ？

Related

Reference

接下来是什么？