cluster-analysis - Heavily unbalanced/skewed data clusters

Question

I am facing some issues with my k-means clustering results on Alteryx. I am trying to conduct topic modelling on my data set of around 5000 text descriptions. After data cleaning, parsing and removing stop words and common words, I created a Document Term Matrix of 20 words and around 5000 documents.

After running K-Means Clustering on Alteryx, no matter how many clusters I indicated, there will always be only 1 document in all clusters except one with all the rest. For example:

2 Clusters

Cluster 1: 19 words
Cluster 2: 1 word

3 Clusters

Cluster 1: 18 words
Cluster 2: 1 word
Cluster 3: 1 word

5 Clusters

Cluster 1: 16 words
Cluster 2: 1 word
Cluster 3: 1 word
Cluster 4: 1 word
Cluster 5: 1 word

This clustering behavior happens no matter how many clusters I indicated. Looking for some help to shed some light and identify if these results would mean my data has problems or if I did not use the correct settings?

Thanks in advance!

score 0 · Accepted Answer

您是否在预处理后查看过数据？

现在可能很多文档都是空的，或者只包含一个单词。

除了找到常用词外，剩下的不多了。

cluster-analysis - Heavily unbalanced/skewed data clusters

2 Clusters

3 Clusters

5 Clusters

1 回答 1

Related

Reference