python - NLTK 自定义分类语料库不读取文件

Question

我创建了自己的语料库，类似于 nltk 中的 movie_reviews 语料库（按 neg|pos 分类。）

在 neg 和 pos 文件夹中是 txt 文件。

代码：

from nltk.corpus import CategorizedPlaintextCorpusReader

    mr = CategorizedPlaintextCorpusReader('C:\mycorpus', r'(?!\.).*\.txt',
            cat_pattern=r'(neg|pos)/.*')

当我尝试阅读这些文件或与其中一个文件进行交互时，我无法做到。

例如len(mr.categories())运行，但不返回任何内容：

>>>

我已经阅读了有关自定义分类语料库的多个文档和问题，但我仍然无法使用它们。

完整代码：

import nltk
from nltk.corpus import CategorizedPlaintextCorpusReader

mr = CategorizedPlaintextCorpusReader('C:\mycorpus', r'(?!\.).*\.txt',
        cat_pattern=r'(neg|pos)/.*')

len(mr.categories())

我最终希望能够对我的数据执行朴素贝叶斯算法，但我无法读取内容。

路径： C:\mycorpus\pos

C:\mycorpus\neg

pos 文件中包含一个“cv.txt”，而 neg 包含一个“example.txt”

score 3 · Accepted Answer

我正在使用 Linux，对您的代码（带有玩具语料库文件）的以下修改对我来说可以正常工作：

import nltk
from nltk.corpus import CategorizedPlaintextCorpusReader

import os


mr = CategorizedPlaintextCorpusReader(
    '/home/ely/programming/nltk-test/mycorpus',
    r'(?!\.).*\.txt',
    cat_pattern=os.path.join(r'(neg|pos)', '.*')
)

print(len(mr.categories()))

这表明当您在 Windows 系统上时，将cat_pattern字符串用作文件系统分隔符存在问题。/

在我的示例中使用os.path.join，或者pathlib如果使用 Python 3，将是解决它的好方法，因此它与操作系统无关，并且您不会因正则表达式转义斜杠与文件系统分隔符混合而绊倒。

实际上，您可以将这种方法用于参数字符串中文件系统分隔符的所有情况，并且通常是一个好习惯，以使代码可移植并避免奇怪的字符串处理技术债务。

score 1 · Accepted Answer

在我看来，你有什么奇怪的

cat_pattern=r'(neg|pos)/.*'

因为您在基于 MsDOS 的系统上（我猜是 Windows），并且文件夹包含用\ 表示，而不是 /（或者我不明白）

python - NLTK 自定义分类语料库不读取文件

2 回答 2

Related

Reference