python - 带有真实 "Full Text Search"和拼写错误的 SQLite(FTS+spellfix 一起)-6ren

python - 带有真实 "Full Text Search"和拼写错误的 SQLite(FTS+spellfix 一起)

转载作者：IT王子更新时间：2023-10-29 06:28:36

假设我们有 100 万行这样的行:

import sqlite3
db = sqlite3.connect(':memory:')
c = db.cursor()
c.execute('CREATE TABLE mytable (id integer, description text)')
c.execute('INSERT INTO mytable VALUES (1, "Riemann")')
c.execute('INSERT INTO mytable VALUES (2, "All the Carmichael numbers")')

背景:

我知道如何用 Sqlite 做到这一点:

使用 spellfix 查找具有单词查询的行，最多有几个拼写错误模块和 Levenshtein 距离(我已经发布了一个 detailed answer here 关于如何编译它，如何使用它，...):

db.enable_load_extension(True)
db.load_extension('./spellfix')
c.execute('SELECT * FROM mytable WHERE editdist3(description, "Riehmand") < 300'); print c.fetchall()

#Query: 'Riehmand'
#Answer: [(1, u'Riemann')]

对于 1M 行，这会非常慢!作为detailed here , postgresql 可能会使用 trigrams 对此进行优化。 Sqlite 提供的一种快速解决方案是使用 VIRTUAL TABLE USING spellfix:

c.execute('CREATE VIRTUAL TABLE mytable3 USING spellfix1')
c.execute('INSERT INTO mytable3(word) VALUES ("Riemann")')
c.execute('SELECT * FROM mytable3 WHERE word MATCH "Riehmand"'); print c.fetchall()

#Query: 'Riehmand'
#Answer: [(u'Riemann', 1, 76, 0, 107, 7)], working!

使用 FTS(“全文搜索”)查找包含匹配一个或多个单词的查询的表达式:

c.execute('CREATE VIRTUAL TABLE mytable2 USING fts4(id integer, description text)')
c.execute('INSERT INTO mytable2 VALUES (2, "All the Carmichael numbers")')
c.execute('SELECT * FROM mytable2 WHERE description MATCH "NUMBERS carmichael"'); print c.fetchall()

#Query: 'NUMBERS carmichael'
#Answer: [(2, u'All the Carmichael numbers')]

它不区分大小写，您甚至可以使用顺序错误的两个单词等进行查询:FTS 确实非常强大。但缺点是每个查询关键字都必须正确拼写，即 FTS 本身不允许拼写错误。

问题:

如何使用 Sqlite 进行全文搜索 (FTS) 并允许拼写错误？即“FTS + spellfix”一起使用

示例:

DB 中的行:“所有 Carmichael 数”
查询:"NUMMBER carmickaeel" 应该匹配它!

如何用 Sqlite 做到这一点？

自 this page 以来，使用 Sqlite 可能是可能的状态:

Or, it [spellfix] could be used with FTS4 to do full-text search using potentially misspelled words.

最佳答案

spellfix1 文档实际上告诉您如何执行此操作。来自Overview section :

If you intend to use this virtual table in cooperation with an FTS4 table (for spelling correction of search terms) then you might extract the vocabulary using an fts4aux table:
INSERT INTO demo(word) SELECT term FROM search_aux WHERE col='*';

SELECT term from search_aux WHERE col='*' 语句 extracts all the indexed tokens .

将此与您的示例联系起来，其中 mytable2 是您的 fts4 虚拟表，您可以创建一个 fts4aux 表并将这些标记插入到您的 mytable3 spellfix1 表:

CREATE VIRTUAL TABLE mytable2_terms USING fts4aux(mytable2);
INSERT INTO mytable3(word) SELECT term FROM mytable2_terms WHERE col='*';

您可能希望进一步限定该查询以跳过任何已插入到 spellfix1 中的术语，否则您最终会得到双重条目:

INSERT INTO mytable3(word)
    SELECT term FROM mytable2_terms
    WHERE col='*' AND 
        term not in (SELECT word from mytable3_vocab);

现在您可以使用 mytable3 将拼写错误的单词映射到更正的标记，然后在针对 mytable2 的 MATCH 查询中使用这些更正的标记。

根据您的需要，这可能意味着您需要进行自己的 token 处理和查询构建；没有公开的 fts4 查询语法解析器。因此，您的双标记搜索字符串需要拆分，每个标记通过 spellfix1 表运行以映射到现有标记，然后将这些标记提供给 fts4 查询。

忽略 SQL 语法来处理这个问题，使用 Python 进行拆分很容易:

def spellcheck_terms(conn, terms):
    cursor = conn.cursor()
    base_spellfix = """
        SELECT :term{0} as term, word FROM spellfix1data
        WHERE word MATCH :term{0} and top=1
    """
    terms = terms.split()
    params = {"term{}".format(i): t for i, t in enumerate(terms, 1)}
    query = " UNION ".join([
        base_spellfix.format(i + 1) for i in range(len(params))])
    cursor.execute(query, params)
    correction_map = dict(cursor)
    return " ".join([correction_map.get(t, t) for t in terms])

def spellchecked_search(conn, terms):
    corrected_terms = spellcheck_terms(conn, terms)
    cursor = conn.cursor()
    fts_query = 'SELECT * FROM mytable2 WHERE mytable2 MATCH ?'
    cursor.execute(fts_query, (corrected_terms,))
    return cursor.fetchall()

然后返回 [('All the Carmichael numbers',)] for spellchecked_search(db, "NUMMBER carmickaeel")。

在 Python 中保留拼写检查处理可以让您根据需要支持更复杂的 FTS 查询；你可能需要 reimplement the expression parser这样做，但至少 Python 为您提供了执行此操作的工具。

一个完整的例子，将上述方法打包到一个类中，它只是将术语提取为字母数字字符序列(根据我对表达式语法规范的阅读，这就足够了):

import re
import sqlite3
import sys

class FTS4SpellfixSearch(object):
    def __init__(self, conn, spellfix1_path):
        self.conn = conn
        self.conn.enable_load_extension(True)
        self.conn.load_extension(spellfix1_path)

    def create_schema(self):
        self.conn.executescript(
            """
            CREATE VIRTUAL TABLE IF NOT EXISTS fts4data
                USING fts4(description text);
            CREATE VIRTUAL TABLE IF NOT EXISTS fts4data_terms
                USING fts4aux(fts4data);
            CREATE VIRTUAL TABLE IF NOT EXISTS spellfix1data
                USING spellfix1;
            """
        )

    def index_text(self, *text):
        cursor = self.conn.cursor()
        with self.conn:
            params = ((t,) for t in text)
            cursor.executemany("INSERT INTO fts4data VALUES (?)", params)
            cursor.execute(
                """
                INSERT INTO spellfix1data(word)
                SELECT term FROM fts4data_terms
                WHERE col='*' AND
                    term not in (SELECT word from spellfix1data_vocab)
                """
            )

    # fts3 / 4 search expression tokenizer
    # no attempt is made to validate the expression, only
    # to identify valid search terms and extract them.
    # the fts3/4 tokenizer considers any alphanumeric ASCII character
    # and character in the range U+0080 and over to be terms.
    if sys.maxunicode == 0xFFFF:
        # UCS2 build, keep it simple, match any UTF-16 codepoint 0080 and over
        _fts4_expr_terms = re.compile(u"[a-zA-Z0-9\u0080-\uffff]+")
    else:
        # UCS4
        _fts4_expr_terms = re.compile(u"[a-zA-Z0-9\u0080-\U0010FFFF]+")

    def _terms_from_query(self, search_query):
        """Extract search terms from a fts3/4 query

        Returns a list of terms and a template such that
        template.format(*terms) reconstructs the original query.

        terms using partial* syntax are ignored, as you can't distinguish
        between a misspelled prefix search that happens to match existing
        tokens and a valid spelling that happens to have 'near' tokens in
        the spellfix1 database that would not otherwise be matched by fts4

        """
        template, terms, lastpos = [], [], 0
        for match in self._fts4_expr_terms.finditer(search_query):
            token, (start, end) = match.group(), match.span()
            # skip columnname: and partial* terms by checking next character
            ismeta = search_query[end:end + 1] in {":", "*"}
            # skip digits if preceded by "NEAR/"
            ismeta = ismeta or (
                token.isdigit() and template and template[-1] == "NEAR"
                and "/" in search_query[lastpos:start])
            if token not in {"AND", "OR", "NOT", "NEAR"} and not ismeta:
                # full search term, not a keyword, column name or partial*
                terms.append(token)
                token = "{}"
            template += search_query[lastpos:start], token
            lastpos = end
        template.append(search_query[lastpos:])
        return terms, "".join(template)

    def spellcheck_terms(self, search_query):
        cursor = self.conn.cursor()
        base_spellfix = """
            SELECT :term{0} as term, word FROM spellfix1data
            WHERE word MATCH :term{0} and top=1
        """
        terms, template = self._terms_from_query(search_query)
        params = {"term{}".format(i): t for i, t in enumerate(terms, 1)}
        query = " UNION ".join(
            [base_spellfix.format(i + 1) for i in range(len(params))]
        )
        cursor.execute(query, params)
        correction_map = dict(cursor)
        return template.format(*(correction_map.get(t, t) for t in terms))

    def search(self, search_query):
        corrected_query = self.spellcheck_terms(search_query)
        cursor = self.conn.cursor()
        fts_query = "SELECT * FROM fts4data WHERE fts4data MATCH ?"
        cursor.execute(fts_query, (corrected_query,))
        return {
            "terms": search_query,
            "corrected": corrected_query,
            "results": cursor.fetchall(),
        }

以及使用该类的交互式演示:

>>> db = sqlite3.connect(":memory:")
>>> fts = FTS4SpellfixSearch(db, './spellfix')
>>> fts.create_schema()
>>> fts.index_text("All the Carmichael numbers")  # your example
>>> from pprint import pprint
>>> pprint(fts.search('NUMMBER carmickaeel'))
{'corrected': 'numbers carmichael',
 'results': [('All the Carmichael numbers',)],
 'terms': 'NUMMBER carmickaeel'}
>>> fts.index_text(
...     "They are great",
...     "Here some other numbers",
... )
>>> pprint(fts.search('here some'))  # edgecase, multiple spellfix matches
{'corrected': 'here some',
 'results': [('Here some other numbers',)],
 'terms': 'here some'}
>>> pprint(fts.search('NUMMBER NOT carmickaeel'))  # using fts4 query syntax 
{'corrected': 'numbers NOT carmichael',
 'results': [('Here some other numbers',)],
 'terms': 'NUMMBER NOT carmickaeel'}

关于python - 带有真实 "Full Text Search"和拼写错误的 SQLite(FTS+spellfix 一起)，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/52803014/

文章推荐： ruby - 在 Ruby gsub block (正则表达式)中使用命名捕获组

文章推荐： Ruby:模块和 super ？

java 正则表达式匹配 &[text(text - text text) !text]
我目前正在创建一个正则表达式来拆分所有匹配以下格式的字符串:&[text(text - text text) !text]。这里的文本实际上可以是任何字符。并且间距很重要。文本将如图所示列出。我已经
javascript - 使用正则表达式将 (,,,text,,4,text,3,,) 转换为 (text,4,text,3)
这个问题在这里已经有了答案: Remove duplicate commas and extra commas at start/end with RegExp in Javascript, and
Python xml 迷你。生成 Some text 元素
我有以下代码。 from xml.dom.minidom import Document doc = Document() root = doc.createElement('root') doc.a
javascript - 如何使用 jQuery :contains(some text) selector but only select "some text" from "this is some text"?
这个问题在这里已经有了答案: 关闭 10 年前。 Possible Duplicate: Find text string in jQuery and make it bold 如何使用 jQuer
javascript - libmagic。 text/plain 而不是 text/javascript text/css
我使用 libmagic 在我的元素的 Web 界面中获取文件的 MIME 类型。我在 css 和 js 文件上得到文本/纯 mime 类型。例如 chromium 显示以下警告: Resource
html - 如何设置
s inline : text, img, text, text
起初我必须阅读很多教程，但我仍然不知道我做错了什么...... 我想内联使用 4 个 div。在我想放置的那些 div 中:文本、图像、文本、文本。我希望中间文本自动设置为最大宽度。我写了一个简单的
javascript - 替换每次出现的 [b : "text"] to text where text can be anything
我想替换所有出现的 [b: "text"]至text使用 JavaScript 和 RegEx。目前我知道如何替换 [b: ""]至使用'/\[b: ""\]/g'但我不知道如果 " 之间有文本该怎么
text - 使用 text() 向绘图添加文本的替代方法
这可能是一个幼稚的问题，但我想知道是否有比使用 text() 更好的方法将文本添加到绘图中。注意，我也在使用 layout()以及。具体来说，我有一个情节的一部分，我想在其中添加一些带有标题的文本，然
text - 批量查找并替换Sublime Text 2
我必须反复从 latex 源粘贴代码，因此每次都必须做很多查找和替换操作('“a'=>'ä'，'” o'=>'ö'，...) 。有没有一种方法可以存储这些搜索和替换规则，例如，我可以通过一次按键执行
text - 为什么在编写代码时Sublime Text 3不会跳行？
当我在Sublime Text 3代码屏幕中编写代码时，它连续地向右滑动，如图所示。我该怎么办？请注意第10行。最佳答案如果您只想为当前 View (正在编辑的当前文件)激活自动换行，只需vie
text - Sublime Text 字体目录
是否有可能更改 sublime text 中的默认字体目录？我只想使用可移植 sublime 文本存储在我的 pendrive 上的字体，这样我就不必在我使用可移植 sublime 文本的每台机器上安
"text"框旁边的Android "Text Field"
我是 Android 开发的新手，我有一个愚蠢的问题。如何将“文本字段”框放在一行中的文本旁边。例子: Please Enter the number: [ ] 关于 "t
c# - 用打印引号替换直引号 : "My text" to „My text“
我想自动将“我的文本”更改为“我的文本”，因为这是用德语写的正确方式。引号可以在文本中的任何位置。有没有一种简单的方法可以实现这一点？解决方案应该检查第一个字符，最后一个字符，比如“this”，或
silverlight - 使用 XAML 和文本 Text ="Some text {Some binding} some more text}"进行内联绑定(bind)的最佳实践
我想知道是否有特殊的语法来绑定(bind)与现有文本连接的文本。像这样。显然，这行不通。什么是最佳实践？使用 SL4。最佳答案使用StringFormat在 Binding 上。 WPF
javascript - console.log ('true text' || 很明显吗？真的？ 'text' : 'text1' ); logs 'text' ?
我认为它应该打印“真实文本”，因为它相当于 true console.log('true text' || true ? 'text' : 'text1'); 但是，输出是“文本”；抱歉，如果是愚蠢的
javascript - break text with css (text == white space == text) float 文本，文本中断
有没有办法通过 css 打破文本，以便中间有一个“空白”？目前我正在通过手工打破文本来解决这个问题 -但这是愚蠢的。我知道有一个函数可以让文本在另一个 div 中结束和开始，但 IE 不支持它。文本
text - Tcl/Tk : highlight some line in text widget or change the color for specific line text
我想为我的Tcl/Tk工具实现一个效果:在text控件中，根据具体情况，希望高亮一些线条的背景色，其他线条正常透明.有可能吗？我尝试了一些选项，例如:-highlightbackground 、-i
python - 当 'text' 可能包含更多 {{ text }} block 时，如何用 re.sub() 替换表达式 {{ text }} ？
我正在尝试解析原始维基百科文章内容，例如the article on Sweden ，使用re.sub()。但是，我在尝试替换 {{some text}} block 时遇到了问题，因为它们可以包含更
c# - 单声道 GTK# : Trying to remove text in ComboBox and then prepend new text to the ComboBox but some of the old text remains
我试图先删除 ComboBox 中的所有内容。然后在其前面添加文本，但保留了一些旧文本。有没有办法重置或清除 ComboBox？或者我怎样才能最好地实现这一目标？ public void GetBad
python - spaCy (v3.0) `nlp.make_doc(text)` 和 `nlp(text)` 之间的区别？为什么训练时要用 `nlp.make_doc(text)`？
我知道我们应该创建 Example对象并将其传递给 nlp.update() 方法。根据 docs 中的示例, 我们有 for raw_text, entity_offsets in train_da

IT王子

个人简介

我是一名优秀的程序员,十分优秀！

作者热门文章

滴滴打车优惠券免费领取

全站热门文章

首页

博学

6Ren·AI

商城

python - 带有真实 "Full Text Search"和拼写错误的 SQLite(FTS+spellfix 一起)

背景:

问题: