lucene - 如何查看 Lucene 索引

Question

我正在尝试学习和理解 lucene 的工作原理，lucene 索引中的内容。基本上我想看看数据是如何在 lucene 索引中表示的？

我lucene-core 8.6.0用作依赖项

下面是我非常基本的 Lucene 代码

    private Document create(File file) throws IOException {
        Document document = new Document();

        Field field = new Field("contents", new FileReader(file), TextField.TYPE_NOT_STORED);
        Field fieldPath = new Field("path", file.getAbsolutePath(), TextField.TYPE_STORED);
        Field fieldName = new Field("name", file.getName(), TextField.TYPE_STORED);

        document.add(field);
        document.add(fieldPath);
        document.add(fieldName);

        //Create analyzer
        Analyzer analyzer = new StandardAnalyzer();

        //Create IndexWriter pass the analyzer

        Path indexPath = Files.createTempDirectory("tempIndex");
        Directory directory = FSDirectory.open(indexPath);
        IndexWriterConfig indexWriterCOnfig = new IndexWriterConfig(analyzer);
        IndexWriter iwriter = new IndexWriter(directory, indexWriterCOnfig);
        iwriter.addDocument(document);
        iwriter.close();
        return document;
    }

注意：我了解 Lucene 背后的知识 - 倒排索引，但我缺乏对 lucene 库使用此概念以及如何创建文件以便使用 lucene 使搜索变得容易和可行的理解。

我试过豪华轿车，但没有用。即使我在 web.xml 中给出了索引位置，它也不起作用

score 4 · Accepted Answer

如果您想看一个好的介绍性代码示例，使用当前版本的 Lucene（构建索引然后使用它），您可以从基本演示开始。演示的源代码可以在 Github上找到。

如果你想探索你的索引数据，一旦它被创建，你可以使用 Luke。如果您以前没有使用过：要运行 Luke，您需要从主下载页面下载二进制版本。解压缩文件，然后导航到该目录。然后运行相关脚本（或）。lukeluke.batluke.sh

LIMO（我能找到的唯一版本的工具是Sourceforge 上的这个。鉴于它是从 2007 年开始的，几乎可以肯定它不再与最新的 Lucene 索引文件兼容。也许某处有更新的版本。）

如果您想了解典型 Lucene 索引中的文件概览，可以从此处开始。

许多特定问题可以通过查看相关包和类的API 文档来回答。

就个人而言，我还发现Solr和ElasticSearch文档对于解释通常与 Lucene 直接相关的特定概念非常有用。

除此之外，我不太担心 Lucene 如何管理其内部索引数据结构。相反，我专注于可用于访问该数据的不同类型的分析器和查询。

更新：SimpleTextCodec

现在是几个月后了，但这里有另一种探索 Lucene 索引数据的方法：SimpleTextCodec. 标准 Lucene 编解码器（如何将数据写入索引文件并从中读取）使用二进制格式 - 因此人类不可读。您不能只打开索引文件并查看其中的内容。

但是，如果您将编解码器更改为SimpleTextCodec，则 Lucene 将创建纯文本索引文件，您可以在其中更清楚地看到结构。

此编解码器纯粹用于信息/教育，不应在生产中使用。

要使用编解码器，您首先需要包含相关的依赖项——例如，像这样：

<dependency>
    <groupId>org.apache.lucene</groupId>
    <artifactId>lucene-codecs</artifactId>
    <version>8.7.0</version>
</dependency>

现在您可以按如下方式使用这个新编解码器：

iwc.setCodec(new SimpleTextCodec());

因此，例如：

final String indexPath = "/path/to/index_dir";
final String docsPath = "/path/to/inputs_dir";
final Path docDir = Paths.get(docsPath);
Directory dir = FSDirectory.open(Paths.get(indexPath));
Analyzer analyzer = new StandardAnalyzer();
IndexWriterConfig iwc = new IndexWriterConfig(analyzer);
iwc.setOpenMode(OpenMode.CREATE);
iwc.setCodec(new SimpleTextCodec());
System.out.println(iwc.getCodec().getName());
try ( IndexWriter writer = new IndexWriter(dir, iwc)) {
    // read documents, and write index data:
    indexDocs(writer, docDir);
}

您现在可以在文本阅读器（例如 Notepad++）中自由地检查生成的索引文件。

在我的例子中，索引数据产生了几个文件——但我在这里感兴趣的是我的*.scf文件——一个“复合”文件，包含各种“虚拟文件”部分，其中存储了人类可读的索引数据。

score 1 · Accepted Answer

如果索引很大（例如数百 GB），Luke 有时会无法打开它。Luke 有一个基于命令行的替代方案，称为I-Rex. 它是为信息检索研究而开发的。这是它的链接：https ://github.com/souravsaha/I-REX/tree/shell-lucene8

随意添加/编辑代码。

lucene - 如何查看 Lucene 索引

2 回答 2

Related

Reference