gpt4 book ai didi

python - 带有真实 "Full Text Search"和拼写错误的 SQLite(FTS+spellfix 一起)

转载 作者:IT王子 更新时间:2023-10-29 06:28:36 26 4
gpt4 key购买 nike

假设我们有 100 万行这样的行:

import sqlite3
db = sqlite3.connect(':memory:')
c = db.cursor()
c.execute('CREATE TABLE mytable (id integer, description text)')
c.execute('INSERT INTO mytable VALUES (1, "Riemann")')
c.execute('INSERT INTO mytable VALUES (2, "All the Carmichael numbers")')

背景:

我知道如何用 Sqlite 做到这一点:

  • 使用 spellfix 查找具有单词查询的行,最多有几个拼写错误模块和 Levenshtein 距离(我已经发布了一个 detailed answer here 关于如何编译它,如何使用它,...):

    db.enable_load_extension(True)
    db.load_extension('./spellfix')
    c.execute('SELECT * FROM mytable WHERE editdist3(description, "Riehmand") < 300'); print c.fetchall()

    #Query: 'Riehmand'
    #Answer: [(1, u'Riemann')]

    对于 1M 行,这会非常慢!作为detailed here , postgresql 可能会使用 trigrams 对此进行优化。 Sqlite 提供的一种快速解决方案是使用 VIRTUAL TABLE USING spellfix:

    c.execute('CREATE VIRTUAL TABLE mytable3 USING spellfix1')
    c.execute('INSERT INTO mytable3(word) VALUES ("Riemann")')
    c.execute('SELECT * FROM mytable3 WHERE word MATCH "Riehmand"'); print c.fetchall()

    #Query: 'Riehmand'
    #Answer: [(u'Riemann', 1, 76, 0, 107, 7)], working!
  • 使用 FTS(“全文搜索”)查找包含匹配一个或多个单词的查询的表达式:

    c.execute('CREATE VIRTUAL TABLE mytable2 USING fts4(id integer, description text)')
    c.execute('INSERT INTO mytable2 VALUES (2, "All the Carmichael numbers")')
    c.execute('SELECT * FROM mytable2 WHERE description MATCH "NUMBERS carmichael"'); print c.fetchall()

    #Query: 'NUMBERS carmichael'
    #Answer: [(2, u'All the Carmichael numbers')]

    它不区分大小写,您甚至可以使用顺序错误的两个单词等进行查询:FTS 确实非常强大。但缺点是每个查询关键字都必须正确拼写,即 FTS 本身不允许拼写错误。

问题:

如何使用 Sqlite 进行全文搜索 (FTS) 并允许拼写错误即“FTS + spellfix”一起使用

示例:

  • DB 中的行:“所有 Carmichael 数”
  • 查询:"NUMMBER carmickaeel" 应该匹配它!

如何用 Sqlite 做到这一点?

this page 以来,使用 Sqlite 可能是可能的状态:

Or, it [spellfix] could be used with FTS4 to do full-text search using potentially misspelled words.

相关问题:String similarity with Python + Sqlite (Levenshtein distance / edit distance)

最佳答案

spellfix1 文档实际上告诉您如何执行此操作。来自Overview section :

If you intend to use this virtual table in cooperation with an FTS4 table (for spelling correction of search terms) then you might extract the vocabulary using an fts4aux table:

INSERT INTO demo(word) SELECT term FROM search_aux WHERE col='*';

SELECT term from search_aux WHERE col='*' 语句 extracts all the indexed tokens .

将此与您的示例联系起来,其中 mytable2 是您的 fts4 虚拟表,您可以创建一个 fts4aux 表并将这些标记插入到您的 mytable3 spellfix1 表:

CREATE VIRTUAL TABLE mytable2_terms USING fts4aux(mytable2);
INSERT INTO mytable3(word) SELECT term FROM mytable2_terms WHERE col='*';

您可能希望进一步限定该查询以跳过任何已插入到 spellfix1 中的术语,否则您最终会得到双重条目:

INSERT INTO mytable3(word)
SELECT term FROM mytable2_terms
WHERE col='*' AND
term not in (SELECT word from mytable3_vocab);

现在您可以使用 mytable3 将拼写错误的单词映射到更正的标记,然后在针对 mytable2MATCH 查询中使用这些更正的标记。

根据您的需要,这可能意味着您需要进行自己的 token 处理和查询构建;没有公开的 fts4 查询语法解析器。因此,您的双标记搜索字符串需要拆分,每个标记通过 spellfix1 表运行以映射到现有标记,然后将这些标记提供给 fts4 查询。

忽略 SQL 语法来处理这个问题,使用 Python 进行拆分很容易:

def spellcheck_terms(conn, terms):
cursor = conn.cursor()
base_spellfix = """
SELECT :term{0} as term, word FROM spellfix1data
WHERE word MATCH :term{0} and top=1
"""
terms = terms.split()
params = {"term{}".format(i): t for i, t in enumerate(terms, 1)}
query = " UNION ".join([
base_spellfix.format(i + 1) for i in range(len(params))])
cursor.execute(query, params)
correction_map = dict(cursor)
return " ".join([correction_map.get(t, t) for t in terms])

def spellchecked_search(conn, terms):
corrected_terms = spellcheck_terms(conn, terms)
cursor = conn.cursor()
fts_query = 'SELECT * FROM mytable2 WHERE mytable2 MATCH ?'
cursor.execute(fts_query, (corrected_terms,))
return cursor.fetchall()

然后返回 [('All the Carmichael numbers',)] for spellchecked_search(db, "NUMMBER carmickaeel")

在 Python 中保留拼写检查处理可以让您根据需要支持更复杂的 FTS 查询;你可能需要 reimplement the expression parser这样做,但至少 Python 为您提供了执行此操作的工具。

一个完整的例子,将上述方法打包到一个类中,它只是将术语提取为字母数字字符序列(根据我对表达式语法规范的阅读,这就足够了):

import re
import sqlite3
import sys

class FTS4SpellfixSearch(object):
def __init__(self, conn, spellfix1_path):
self.conn = conn
self.conn.enable_load_extension(True)
self.conn.load_extension(spellfix1_path)

def create_schema(self):
self.conn.executescript(
"""
CREATE VIRTUAL TABLE IF NOT EXISTS fts4data
USING fts4(description text);
CREATE VIRTUAL TABLE IF NOT EXISTS fts4data_terms
USING fts4aux(fts4data);
CREATE VIRTUAL TABLE IF NOT EXISTS spellfix1data
USING spellfix1;
"""
)

def index_text(self, *text):
cursor = self.conn.cursor()
with self.conn:
params = ((t,) for t in text)
cursor.executemany("INSERT INTO fts4data VALUES (?)", params)
cursor.execute(
"""
INSERT INTO spellfix1data(word)
SELECT term FROM fts4data_terms
WHERE col='*' AND
term not in (SELECT word from spellfix1data_vocab)
"""
)

# fts3 / 4 search expression tokenizer
# no attempt is made to validate the expression, only
# to identify valid search terms and extract them.
# the fts3/4 tokenizer considers any alphanumeric ASCII character
# and character in the range U+0080 and over to be terms.
if sys.maxunicode == 0xFFFF:
# UCS2 build, keep it simple, match any UTF-16 codepoint 0080 and over
_fts4_expr_terms = re.compile(u"[a-zA-Z0-9\u0080-\uffff]+")
else:
# UCS4
_fts4_expr_terms = re.compile(u"[a-zA-Z0-9\u0080-\U0010FFFF]+")

def _terms_from_query(self, search_query):
"""Extract search terms from a fts3/4 query

Returns a list of terms and a template such that
template.format(*terms) reconstructs the original query.

terms using partial* syntax are ignored, as you can't distinguish
between a misspelled prefix search that happens to match existing
tokens and a valid spelling that happens to have 'near' tokens in
the spellfix1 database that would not otherwise be matched by fts4

"""
template, terms, lastpos = [], [], 0
for match in self._fts4_expr_terms.finditer(search_query):
token, (start, end) = match.group(), match.span()
# skip columnname: and partial* terms by checking next character
ismeta = search_query[end:end + 1] in {":", "*"}
# skip digits if preceded by "NEAR/"
ismeta = ismeta or (
token.isdigit() and template and template[-1] == "NEAR"
and "/" in search_query[lastpos:start])
if token not in {"AND", "OR", "NOT", "NEAR"} and not ismeta:
# full search term, not a keyword, column name or partial*
terms.append(token)
token = "{}"
template += search_query[lastpos:start], token
lastpos = end
template.append(search_query[lastpos:])
return terms, "".join(template)

def spellcheck_terms(self, search_query):
cursor = self.conn.cursor()
base_spellfix = """
SELECT :term{0} as term, word FROM spellfix1data
WHERE word MATCH :term{0} and top=1
"""
terms, template = self._terms_from_query(search_query)
params = {"term{}".format(i): t for i, t in enumerate(terms, 1)}
query = " UNION ".join(
[base_spellfix.format(i + 1) for i in range(len(params))]
)
cursor.execute(query, params)
correction_map = dict(cursor)
return template.format(*(correction_map.get(t, t) for t in terms))

def search(self, search_query):
corrected_query = self.spellcheck_terms(search_query)
cursor = self.conn.cursor()
fts_query = "SELECT * FROM fts4data WHERE fts4data MATCH ?"
cursor.execute(fts_query, (corrected_query,))
return {
"terms": search_query,
"corrected": corrected_query,
"results": cursor.fetchall(),
}

以及使用该类的交互式演示:

>>> db = sqlite3.connect(":memory:")
>>> fts = FTS4SpellfixSearch(db, './spellfix')
>>> fts.create_schema()
>>> fts.index_text("All the Carmichael numbers") # your example
>>> from pprint import pprint
>>> pprint(fts.search('NUMMBER carmickaeel'))
{'corrected': 'numbers carmichael',
'results': [('All the Carmichael numbers',)],
'terms': 'NUMMBER carmickaeel'}
>>> fts.index_text(
... "They are great",
... "Here some other numbers",
... )
>>> pprint(fts.search('here some')) # edgecase, multiple spellfix matches
{'corrected': 'here some',
'results': [('Here some other numbers',)],
'terms': 'here some'}
>>> pprint(fts.search('NUMMBER NOT carmickaeel')) # using fts4 query syntax
{'corrected': 'numbers NOT carmichael',
'results': [('Here some other numbers',)],
'terms': 'NUMMBER NOT carmickaeel'}

关于python - 带有真实 "Full Text Search"和拼写错误的 SQLite(FTS+spellfix 一起),我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/52803014/

26 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com