python-3.x - 使用 spacy 进行 POS 模式挖掘-6ren

python-3.x - 使用 spacy 进行 POS 模式挖掘

转载作者：行者123 更新时间：2023-12-03 16:51:11

31

4

我正在尝试使用 python 3 中的 spacy 从文本中提取语言特征。我的输入看起来像这样

Sent_id Text
1   I am exploring text analytics using spacy
2   amazing spacy is going to help me

我正在通过使用我提供的特定 POS 模式将单词提取为三元组/二元组短语来寻找这样的输出。像 NOUN VERB NOUN、ADJ NOUN 等，并保留数据帧结构。如果一个句子中有多个短语，则必须用新短语复制记录。

Sent_id Text    Feature Pattern
1   I am exploring text analytics using spacy   exploring text analytics    VERB NOUN NOUN
1   I am exploring text analytics using spacy   analytics using spacy   NOUN VERB NOUN
2   amazing spacy is going to help me   amazing spacy   ADJ NOUN

最佳答案

代码在注释中解释

import spacy
import pandas as pd
import re

# Load spacy model once and reuse 
nlp = spacy.load('en_core_web_sm')

# The dataframe with text
df = pd.DataFrame({
        'Sent_id': [1,2],
        'Text': [ "I am exploring text analytics using spacy", "amazing spacy is going to help me"]
    }) 

# Patters we are intrested in 
patterns = ["VERB NOUN", "NOUN VERB NOUN"]

# Convert each pattern into regular expression
re_patterns = [" ".join(["(\w+)_!"+pos for pos in p.split()]) for p in patterns]


def extract(nlp, text, patterns, re_patterns):
    """Extracts the pieces in text maching the POS pattern in patterns

    args:
        nlp : Loaded Spicy model object
        text: The input text
        patterns: The list of patters to be searched
        re_patterns: The patterns converted into regex

    returns: A list of tuples of form (t,p) where 
    t is the part of text matching the pattern p in patterns
    """
    doc = nlp(text)   
    matches = list()
    text_pos = " ".join([token.text+"_!"+token.pos_ for token in doc])
    for i, pattern in enumerate(re_patterns):
        for result in re.findall(pattern, text_pos):
            matches.append([" ".join(result), patterns[i]])
    return matches

# Test it 
print (extract(nlp, "A sleeping cat and walking dog", patterns, re_patterns))
# Returns
# [['sleeping cat', 'VERB NOUN'], ['walking dog', 'VERB NOUN']]

# Extract the matched patterns
df['matches'] = df['Text'].apply(lambda x: extract(nlp,x,patterns,re_patterns))


# Convert the list of tuples into rows
df = df.matches.apply(pd.Series).merge(df, right_index = True, left_index = True).drop(["matches"], axis = 1)\
.melt(id_vars = ['Sent_id', 'Text'], value_name = "matches").drop("variable", axis = 1)

# Add the matched text and matched patterns into new columns
df[['matched_text','matched_pattern']]= df.matches.apply(pd.Series)

# Drop the column and cleanup
df = df.drop("matches", axis = 1).sort_values('Sent_id')
df = df.drop_duplicates(subset =["matched_text", "matched_pattern"], keep='last')

输出:

    Sent_id     Text                                matched_text    matched_pattern
0   1   I am exploring text analytics using spacy   exploring text  VERB NOUN
2   1   I am exploring text analytics using spacy   using spacy     VERB NOUN
4   1   I am exploring text analytics using spacy   analytics using spacy   NOUN VERB NOUN
1   2   amazing spacy is going to help me           NaN              NaN

关于python-3.x - 使用 spacy 进行 POS 模式挖掘，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/55393087/

31

4

0

文章推荐： c# - 多个 Serilog 记录器

javascript - 挖掘 JavaScript 对象的路径
我正在尝试提取 MultiLevelPushMenu 插件中当前元素的根路径。 https://github.com/adgsm/multi-level-push-menu 所有者给出了将根级路径提取
sql-server - 用于检查/挖掘 SQL 分析服务挖掘模型的前端应用程序
我目前使用 Excel 和 SQL Server 商业智能工作室来浏览我的模型，但我一直在寻找一个体面的、中等用户友好的前端应用程序，可用于通过 SSAS 挖掘模型跋涉。我了解如何将预测用于特定目的
gcc 可以对未知的迭代次数进行循环优化( strip 挖掘/阻塞)吗？
我正在尝试使用 GCC >= 4.4 中可用的 Graphite 循环优化框架，但如果迭代次数未知，我似乎无法让它应用任何转换。例如。这个示例代码: int __attribute__((hot))
python - 如何使用 awk、Perl 或 Python 挖掘 XML 文档？
我有一个具有以下数据格式的 XML 文件: .... 谁能告诉我如何使用 awk 单行程序对 XML 文件进行数据挖掘？例如，我想知道 abc 的 attr3。它会返回 345 给我。最佳答案
Azure Active Directory - 从 JWT token 挖掘 oAuth2Permission、appRole 和组声明
场景:我在不同的 Azure 租户中有 2 个 AAD 应用程序 - 让我们调用租户 A 和 B。租户 A 中的应用程序定义了租户 B 中的应用程序已同意的自定义 appRole 和 oAuth2Pe
Azure Active Directory - 从 JWT token 挖掘 oAuth2Permission、appRole 和组声明
场景:我在不同的 Azure 租户中有 2 个 AAD 应用程序 - 让我们调用租户 A 和 B。租户 A 中的应用程序定义了租户 B 中的应用程序已同意的自定义 appRole 和 oAuth2Pe

首页

博学

6Ren·AI

商城

python-3.x - 使用 spacy 进行 POS 模式挖掘