go - golang实现全文搜索的高效方式-6ren

go - golang实现全文搜索的高效方式

转载作者：IT王子更新时间：2023-10-29 01:46:41

33

4

我试图在 golang 中实现一个简单的全文搜索，但我的所有实现都太慢而无法克服阈值。

任务如下:

文档是小写单词除以空格的非空字符串
每个文档都有一个隐式标识符，等于它在输入数组中的索引
New()构造索引
Search():接受一个查询，该查询也是一串用空格分隔的小写单词，并返回一个排序的文档唯一标识符数组，其中包含查询中的所有单词，无论它们的顺序如何

例子:

index := New([]string{
"this is the house that jack built",  //: 0
"this is the rat that ate the malt",  //: 1
})

index.Search("")  // -> []
index.Search("in the house that jack built")  // -> []
index.Search("malt rat")  // -> [1]
index.Search("is this the")  // -> [0, 1]

我已经尝试实现:

每个文档和所有文档的二叉搜索树
每个文档和所有文档的一个 trie(前缀树)
倒排索引搜索

二叉搜索树(针对所有文档):

type Tree struct {
    m           map[int]bool
    word        string
    left        *Tree
    right       *Tree
}

type Index struct {
    tree *Tree
}

二叉搜索树(每个文档一棵树):

type Tree struct {
    word  string
    left  *Tree
    right *Tree
}

type Index struct {
    tree  *Tree
    index int
    next  *Index
}

trie(针对所有文档):

type Trie struct {
    m        map[uint8]*Trie
    end_node map[int]bool
}

type Index struct {
    trie *Trie
}

trie(针对每个文档):

type Trie struct {
    m        map[uint8]*Trie
    end_node bool
}

type Index struct {
    trie  *Trie
    index int
    next  *Index
}

倒排索引:

type Index struct {
    m map[string]map[int]bool
}

倒排索引的新实现和搜索:

// New creates a fulltext search index for the given documents
func New(docs []string) *Index {
    m := make(map[string]map[int]bool)

    for i := 0; i < len(docs); i++ {
        words := strings.Fields(docs[i])
        for j := 0; j < len(words); j++ {
            if m[words[j]] == nil {
                m[words[j]] = make(map[int]bool)
            }
            m[words[j]][i+1] = true
        }
    }
    return &(Index{m})
}

// Search returns a slice of unique ids of documents that contain all words from the query.
func (idx *Index) Search(query string) []int {
    if query == "" {
        return []int{}
    }
    ret := make(map[int]bool)
    arr := strings.Fields(query)
    fl := 0
    for i := range arr {
        if idx.m[arr[i]] == nil {
            return []int{}
        }
        if fl == 0 {
            for value := range idx.m[arr[i]] {
                ret[value] = true
            }
            fl = 1
        } else {
            tmp := make(map[int]bool)
            for value := range ret {
                if idx.m[arr[i]][value] == true {
                    tmp[value] = true
                }
            }
            ret = tmp
        }
    }
    ret_arr := []int{}
    for value := range ret {
        ret_arr = append(ret_arr, value-1)
    }
    sort.Ints(ret_arr)
    return ret_arr
}

我是做错了什么还是有更好的 golang 搜索算法？

感谢任何帮助。

最佳答案

对于特定于语言的部分，我真的帮不了你，但如果它有任何帮助，这里有一个伪代码，它描述了一个 Trie 实现以及一个以相当有效的方式解决你当前问题的函数。

struct TrieNode{
    map[char] children      // maps character to children
    set[int] contains       // set of all ids of documents that contain the word
}

// classic search function in trie, except it returns a set of document ids instead of a simple boolean
function get_doc_ids(TrieNode node, string w, int depth){
    if (depth == length(w)){
        return node.contains
    } else {
        if (node.hasChild(w[depth]) {
            return get_doc_ids(node.getChild(w[depth], w, depth+1)
        } else {
            return empty_set()
        }
    }
}

// the answering query function, as straight forward as it can be
function answer_query(TrieNode root, list_of_words L){
    n = length(L)
    result = get_docs_ids(root, L[0], 0)
    for i from 1 to n-1 do {
        result = intersection(result, get_docs_ids(root, L[i], 0))  // set intersection 
        if (result.is_empty()){
            break  // no documents contains them all, no need to check further
        }
    }
    return result
}

关于go - golang实现全文搜索的高效方式，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/55552357/

33

4

0

文章推荐： docker - 初始化 : true does not forward signals

文章推荐： c# - 无效的 XML 注释 : invalid character

文章推荐： go - 构建期间的模块依赖缓存问题

文章推荐： xml - xsl 将 xml 转换为 json

svn - 搜索颠覆历史(全文)
有没有办法对 Subversion 存储库执行全文搜索，包括所有历史记录？例如，我编写了一个在某处使用过的功能，但后来不需要它，所以我对文件进行了 svn rm'd，但现在我需要再次找到它以将其用于
MySQL - 通过部分单词匹配和相关性评分进行高效搜索(全文)
如何进行 MySQL 搜索，既匹配部分单词，又提供准确的相关性排序？ SELECT name, MATCH(name) AGAINST ('math*' IN BOOLEAN MODE) AS rel
postgresql - 全文 Postgres
我在 postgresql 中创建了一个用于全文搜索的索引。 CREATE INDEX pesquisa_idx ON chamado USING gin(to_tsvector('portugues
Mysql 未对结果进行排名，全文
我已经设置了一个数据库并启用了全文搜索，当我使用以下内容搜索数据库时，数据库中有一些条目包含“测试”一词，还有一个条目包含“测试更多”: SELECT keywords, title FROM dat
具有特定词序的 MySQL 全文
我想知道是否可以进行 MATCH() AGAINST()(全文)搜索，使得不直接相邻的单词需要按特定顺序排列？在我的网站上，当用户在双引号之间键入单词时，搜索将仅显示具有特定顺序的这些单词的结果。例如
MYSQL 全文 - 意外结果
我有一个 80,000 行的数据库，当我测试一些 FULLTEXT 查询时，我遇到了一个意想不到的结果。我已从 MYSQL 中删除停用词并将最小字长设置为 3。当我执行此查询时: SELECT `s
MySQL - 查找部分字符串 - 全文？
我刚刚在我的 MYSQL 数据库中发现了一堆流氓数据... 到达它的唯一方法是通过其中一列 - FILE_PATH，其中包含文件路径的斜杠剥离版本。我需要在这组文件中找到一些恶意文件——它们的文件名都
带词干的 MySQL 全文
我正在为我的站点构建一个小的搜索功能。我正在接受用户的查询，提取关键字，然后针对提取的关键字运行全文 MySQL 搜索。问题在于 MySQL 将词干视为文字。这是正在发生的过程: 用户搜索“棒球”之
database - (全文)搜索与数据库设计
这是一个关于使用(关系)数据库设计全文搜索的系统架构问题。我使用的具体软件是 Solr 和 PostgreSQL，仅供引用。假设我们正在构建一个有两个用户 Andy 和 Betty 的论坛 -- P
元素数组中的数组上的 MongoDB 全文
当元素数组中的数组包含应与我的搜索匹配的文本时，我无法检索文档。这里有两个示例文档: { _id: ..., 'foo': [ { 'name
mysql - 全文 : this query very slow
我正在使用这个查询，但不幸的是它运行缓慢: SELECT *, (MATCH(`title`) AGAINST ('$word' IN BOOLEAN MODE) * 2 + MATC
php - Mysql(全文？)搜索产品
我正在构建一个非常简单的产品目录，它将在 mysql 表中存储产品，我想尽快搜索产品(并尽可能相关)。产品数据库将非常大(大约 500.000 个产品)，这就是为什么使用“like”而不使用索引的搜索
Mysql 全文、匹配...和搜索字段中的@
select count(distinct email_address) from users WHERE MATCH (email_address) AGAINST ('@r
MySQL 全文 MATCH AGAINST 不适用于复数
我正在尝试在 mySQL 中进行简单的全文搜索，但在复数方面遇到一些问题。我确实相信我符合50% 规则。我不认为我使用了停用词。我正在运行这样的查询: SELECT * FROM product
mysql - 全文 InnoDB 搜索没有响应
我在 innoDB 数据库中使用全文搜索时遇到了一个大问题。首先，ns_pages 表有超过 2.6m 的记录，全文索引有 3 个键 block 。该数据库在具有 128GB RAM 的 Dell
MySQL 全文 : have a result weigh more
我有一个城市和州的数据库(大约 43,000 个)。我对其进行全文搜索，如下所示: select city, state, match(city, state_short, state) agains
Mysql 全文 50% 阈值
我正在使用带有自然语言全文的 Mysql FULLTEXT 搜索，不幸的是，我遇到了 FULLTEXT 50% 阈值，如果给定的关键字出现在总行数的 50% 时间，则不允许我搜索行。我搜索并找到了一
mysql - 全文 mysql 搜索不起作用
如果我搜索单词hello，那么我没有匹配到，而我搜索单词hella，那么我得到了匹配。同样的情况也发生在“Non”这个词上。我在 Mac 上的 MAMP 和 sqlfiddle.com 上进行了测试，
Postgresql 全文(pg_trgm)更好地处理精确匹配？
所以我有一个简单的场景。我有一张 field 表(事件 field 等)。我的查询看起来像: SELECT * FROM venues WHERE venues.name % 'Philips Are
MySQL 全文(非)搜索
我有一个表，其中有视频数据，如“标题”、“描述”等。我正在尝试使用 MySQL 全文索引编写一个搜索引擎。 SQL 查询适用于某些单词，但不是每个单词。这是我的 SQL 查询； SELECT * FR

首页

博学

6Ren·AI

商城

go - golang实现全文搜索的高效方式