gpt4 book ai didi

c# - Lucene.Net 搜索文件名/路径

转载 作者:太空宇宙 更新时间:2023-11-03 23:06:18 25 4
gpt4 key购买 nike

我第一次尝试使用 Lucene.Net

我有一个包含 500 个文档(html 和 pdf)的索引,其中包含一些字段,例如 url、内容、标题

当我搜索内容和/或标题时一切正常
但是当我搜索一个 Url 时,我没有得到任何结果

所以我找到了类似“/tlfdi/epapers/datenschutz2016/files/assets/common/downloads/page0004.pdf”但不是“page0004.pdf”的网址
还有“*”它不起作用。

索引和搜索使用 WhitespaceAnalyzer。使用 StandardAnalyzer,当我搜索“/kontakt/index.aspx”时,我得到的结果为零

搜索:

  WhitespaceAnalyzer analyzer = new WhitespaceAnalyzer();
MultiFieldQueryParser parser = new MultiFieldQueryParser(Version.LUCENE_30,
new[] { "url", "title", "description", "content", "keywords" }, analyzer);
Query query = parseQuery(searchQuery, parser);
Lucene.Net.Search.ScoreDoc[] hits = (Lucene.Net.Search.ScoreDoc[])searcher.Search(
query, null, hits_limit, Sort.RELEVANCE).ScoreDocs;

有人可以帮忙吗?

最佳答案

分析器的标准票价不会满足您的要求,相反,您必须编写自定义分词器和分析器。

这很简单!我们只需要草拟出一个分词器和一个分析器。

UrlTokenizer 负责生成 token 。

// UrlTokenizer delimits tokens by whitespace, '.' and '/'
using AttributeSource = Lucene.Net.Util.AttributeSource;
public class UrlTokenizer : CharTokenizer
{
public UrlTokenizer(System.IO.TextReader @in)
: base(@in)
{
}
public UrlTokenizer(AttributeSource source, System.IO.TextReader @in)
: base(source, @in)
{
}

public UrlTokenizer(AttributeFactory factory, System.IO.TextReader @in)
: base(factory, @in)
{
}
//
// This is where all the magic happens for our UrlTokenizer!
// Whitespace, forward slash or a period are a token boundary.
//
protected override bool IsTokenChar(char c)
{
return !char.IsWhiteSpace(c) && c != '/' && c != '.';
}
}

UrlAnalyzer 使用输入流并实现 UrlTokenizer 和 LowerCaseFilter 以进行不区分大小写的搜索。

// Custom Analyzer implementing UrlTokenizer and LowerCaseFilter.
public sealed class UrlAnalyzer : Analyzer
{
public override TokenStream TokenStream(System.String fieldName, System.IO.TextReader reader)
{
//
// This is where all the magic happens for UrlAnalyzer!
// UrlTokenizer token text are filtered to lowercase text.
return new LowerCaseFilter(new UrlTokenizer(reader));
}
public override TokenStream ReusableTokenStream(System.String fieldName, System.IO.TextReader reader)
{
Tokenizer tokenizer = (Tokenizer)PreviousTokenStream;
if (tokenizer == null)
{
tokenizer = new UrlTokenizer(reader);
PreviousTokenStream = tokenizer;
}
else
tokenizer.Reset(reader);
return tokenizer;
}
}

所以这是演示 UrlAnalyzer 的代码。为了清楚起见,我用 QueryParser 替换了 MultiFieldQueryParser。

//
// Demonstrate UrlAnalyzer using an in memory index.
//
public static void testUrlAnalyzer()
{
string url = @"/tlfdi/epapers/datenschutz2016/files/assets/common/downloads/page0004.pdf";
UrlAnalyzer analyzer = new UrlAnalyzer();
Directory directory = new RAMDirectory();
QueryParser parser = new QueryParser(Lucene.Net.Util.Version.LUCENE_30, "url", analyzer);
IndexWriter writer = new IndexWriter(directory, analyzer, true, new IndexWriter.MaxFieldLength(2048));
//
// index a document. We're only interested in the "url" field of a document.
//
Document doc = new Document();
Field field = new Field("url", url, Field.Store.NO, Field.Index.ANALYZED);
doc.Add(field);
writer.AddDocument(doc);
writer.Commit();
//
// search the index for any documents having 'page004.pdf' in the url field.
//
string searchText = "url:page0004.pdf";
IndexReader reader = IndexReader.Open(directory, true);
IndexSearcher searcher = new IndexSearcher(reader);
Query query = parser.Parse(searchText);
ScoreDoc[] hits = searcher.Search(query, null, 10, Sort.RELEVANCE).ScoreDocs;
if (hits.Length == 0)
throw new System.Exception("RamblinRose is fail!");
//
// search the index for any documents having the full URL we indexed.
//
searchText = "url:\"" + url + "\"";
query = parser.Parse(searchText);
hits = searcher.Search(query, null, 10, Sort.RELEVANCE).ScoreDocs;
if (hits.Length == 0)
throw new System.Exception("RamblinRose is fail!");
}

Lucene.net 是个好东西。我希望这段代码能提高你对 Lucene 分析的理解。

祝你好运!

附注当心通配符搜索来解决问题,它是大型索引的真正 killer 。

关于c# - Lucene.Net 搜索文件名/路径,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/41057413/

25 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com