3

我正在研究一个基本上是基于知识的问答系统的项目。我的系统接受用户查询,从维基百科下载相关文档,剥离所有 html 标签并提取纯文本。在此之后,它将文档标记为句子,然后形成术语-文档(TD)矩阵(查询也作为句子传递)。然后将该 TD 矩阵转发到 pLSA(概率潜在语义分析)算法。然后,最后计算文档(句子)向量与查询向量之间的余弦相似度。根据与查询向量的相似度,将最相关的句子显示为答案。(Stemming 也在 TD 矩阵的形成过程中完成)。问题是确实显示了结果,但不是最相关的。我哪里错了?我遵循的策略是否正确,或确实存在任何其他可能有帮助的算法?下面我展示了我的系统返回的一些问题及其答案:

What is photosynthesis?
ANSWER  1 :   The stroma contains stacks (grana) of thylakoids, which are the site of photosynthesis 

ANSWER  2 :   Factors leaf is the primary site of photosynthesis in plants 

ANSWER  3 :   Samuel Ruben and Martin Kamen used radioactive isotopes to determine that the oxygen liberated in photosynthesis came from the water 

ANSWER  4 :   In plants, algae and cyanobacteria, photosynthesis releases oxygen 

另一个问题

What is Artificial Intelligence?
ANSWER  1 :   the problem of creating 'artificial intelligence' will substantially be solved" 

ANSWER  2 :   37 The leading-edge definition of artificial intelligence research is changing over time 

ANSWER  3 :   Stories of these creatures and their fates discuss many of the same hopes, fears and ethical concerns that are presented by artificial intelligence 

ANSWER  4 :   History of artificial intelligence and Timeline of artificial intelligence Thinking machines and artificial beings appear in Greek myths , such as Talos of Crete , the bronze robot of Hephaestus , and Pygmalion's Galatea 13 Human likenesses believed to have intelligence were built in every major civilization 

另一个问题

Who is a hacker?

ANSWER  1 :   19 Hackers (short stories) Helba from the  

ANSWER  2 :   16 Rafael Núñez aka RaFa was a notorious most wanted hacker by the FBI since 2001 

ANSWER  3 :   Often, this type of 'white hat' hacker is called an ethical hacker 
ANSWER  4 :   Hackers also commonly use port scanners  

又一次运行

What is biology?
ANSWER  1 :   Molecular biology is the study of biology at a molecular level 

ANSWER  2 :   molecular biology studies the complex interactions of systems of biological molecules 

ANSWER  3 :   The similarities and differences between cell types are particularly relevant to molecular biology 

ANSWER  4 :   Contents History Foundations of modern biology 2 
4

2 回答 2

2

This is a well studied problem called Question Answering (QA). I have provided a summary about QA in another answer. In particular, all of your examples would fall under the category of "definition questions", according to TREC. I suggest perusing some of the papers resulting from a query of "TREC definition questions" on Google or Google Scholar for ideas.

于 2012-03-23T14:51:23.433 回答
1

我认为如果您保持完整的统计方法,将很难改进您的系统。从统计 NLP 的角度来看,你确实做了正确的事情。现在,您可以微调一些参数。为此,您必须通过告诉系统哪个答案是正确的来构建一个训练语料库……然后查看参数必须采用哪个值才能给您这个答案。

话虽如此,我认为微调参数不会将您的准确率提高 20% ~30% 以上。

如果您想走得更远,您将需要更语义化的方法,并象征性地表示知识。检查例如http://www.jfsowa.com/

于 2012-03-23T14:03:04.073 回答