python - 如何在 Spacy 中添加额外的货币字符-6ren

python - 如何在 Spacy 中添加额外的货币字符

转载作者：太空宇宙更新时间：2023-11-04 09:45:49

我有文档，其中字符 \u0080 用作欧元。我想将这些字符和其他字符添加到货币符号列表中，以便货币实体被 Spacy NER 拾取。处理此问题的最佳方法是什么？

此外，我还有一些情况，其中金钱表示为 5,000 加元，而 NER 并未将其选为金钱。处理这种情况的最佳方法是什么，是训练 NER 还是添加 CAD 作为货币符号？

最佳答案

<强>1。 'u\0080' 问题

首先，'u\0080' 字符的解释似乎取决于您使用的平台，它不会在 Windows 7 机器上打印，但它可以在Linux机器...

为了完整起见，我假设您从包含 '' 转义序列的 html 文档中获取文本(应该打印为 €浏览器)、'\u0080' 字符和我们识别为货币的其他一些任意符号。

在将文本内容传递给 spaCy 之前，我们可以调用 html.unescape，它将负责将翻译成 € ，这又将被默认配置识别为货币。

text_html = ("I just found out that CAD 1,000 is about 641.3 &#x80. "
             "Some people call it 641.3 \u0080. "
             "Fantastic! But in the U.K. I'd rather pay 344🎅 or \U0001F33B56.")

text = html.unescape(text_html)

其次，如果有不被识别为货币的符号，例如 🎅 和 🌻，那么我们可以更改 Defaults 我们用来将它们定义为货币的语言。

这包括将 lex_attr_getters[IS_CURRENCY] 函数替换为自定义函数，该函数包含描述货币的符号列表。

def is_currency_custom(text):
    # Stripping punctuation
    table = str.maketrans({key: None for key in string.punctuation})
    text = text.translate(table)

    all_currencies = ["\U0001F385", "\U0001F33B", "\u0080", "CAD"]
    if text in all_currencies:
        return True
    return is_currency_original(text)

# Keep a reference to the original is_currency function
is_currency_original = EnglishDefaults.lex_attr_getters[IS_CURRENCY]
# Assign a new function for IS_CURRENCY
EnglishDefaults.lex_attr_getters[IS_CURRENCY] = is_currency_custom

<强>2。 5,000 加元问题

对于这个问题，一个简单的解决方案是定义一个特例。我们对分词器说，凡是遇到 CAD 的地方，都是特例，需要按照我们的指示去做。我们可以设置 IS_CURRENCY 标志等。

special_case = [{
        ORTH: u'CAD', 
        TAG: u'$', 
        IS_CURRENCY: True}]

nlp.tokenizer.add_special_case(u'CAD', special_case)

请注意，这并不完美，因为您可能会得到误报。想象一下来自一家销售 CAD 绘图服务的加拿大公司的文档......所以这很好但不是很好。

如果我们想要更精确，我们可以创建一个 Matcher 对象来查找像 CURRENCY[SPACE]NUMBER 或 NUMBER[SPACE] 这样的模式CURRENCY 并将 MONEY 实体与其相关联。

matcher = Matcher(nlp.vocab)

MONEY = nlp.vocab.strings['MONEY']

# This is the matcher callback that sets the MONEY entity
def add_money_ent(matcher, doc, i, matches):
    match_id, start, end = matches[i]
    doc.ents += ((MONEY, start, end),)

matcher.add(
    'MoneyRedefined', 
    add_money_ent,
    [{'IS_CURRENCY': True}, {'IS_SPACE': True, 'OP': '?'}, {'LIKE_NUM': True}],
    [{'LIKE_NUM': True}, {'IS_SPACE': True, 'OP': '?'}, {'IS_CURRENCY': True}]
)

然后使用 matcher(doc) 将其应用于您的 doc 对象。 'OP' 键使模式可选，允许它匹配 0 次或 1 次。

<强>3。完整代码

import spacy
from spacy.symbols import IS_CURRENCY
from spacy.lang.en import EnglishDefaults
from spacy.matcher import Matcher
from spacy import displacy
import html
import string


def is_currency_custom(text):
    # Stripping punctuation
    table = str.maketrans({key: None for key in string.punctuation})
    text = text.translate(table)

    all_currencies = ["\U0001F385", "\U0001F33B", "\u0080", "CAD"]
    if text in all_currencies:
        return True
    return is_currency_original(text)

# Keep a reference to the original is_currency function
is_currency_original = EnglishDefaults.lex_attr_getters[IS_CURRENCY]
# Assign a new function for IS_CURRENCY
EnglishDefaults.lex_attr_getters[IS_CURRENCY] = is_currency_custom

nlp = spacy.load('en')

matcher = Matcher(nlp.vocab)

MONEY = nlp.vocab.strings['MONEY']

# This is the matcher callback that sets the MONEY entity
def add_money_ent(matcher, doc, i, matches):
    match_id, start, end = matches[i]
    doc.ents += ((MONEY, start, end),)

matcher.add(
    'MoneyRedefined', 
    add_money_ent,
    [{'IS_CURRENCY': True}, {'IS_SPACE': True, 'OP': '?'}, {'LIKE_NUM': True}],
    [{'LIKE_NUM': True}, {'IS_SPACE': True, 'OP': '?'}, {'IS_CURRENCY': True}]
)

text_html = ("I just found out that CAD 1,000 is about 641.3 &#x80. "
             "Some people call it 641.3 \u0080. "
             "Fantastic! But in the U.K. I'd rather pay 344🎅 or \U0001F33B56.")

text = html.unescape(text_html)

doc = nlp(text)

matcher(doc)

displacy.serve(doc, style='ent')