gpt4 book ai didi

Python 词干提取(使用 pandas 数据框)

转载 作者:太空狗 更新时间:2023-10-30 02:02:56 24 4
gpt4 key购买 nike

我创建了一个数据框,其中包含要提取的句子。我想使用 Snowballstemmer 通过我的分类算法获得更高的准确性。我怎样才能做到这一点?

import pandas as pd
from nltk.stem.snowball import SnowballStemmer

# Use English stemmer.
stemmer = SnowballStemmer("english")

# Sentences to be stemmed.
data = ["programmers program with programming languages", "my code is working so there must be a bug in the interpreter"]

# Create the Pandas dataFrame.
df = pd.DataFrame(data, columns = ['unstemmed'])

# Split the sentences to lists of words.
df['unstemmed'] = df['unstemmed'].str.split()

# Make sure we see the full column.
pd.set_option('display.max_colwidth', -1)

# Print dataframe.
df

+----+---------------------------------------------------------------+
| | unstemmed |
|----+---------------------------------------------------------------|
| 0 | ['programmers', 'program', 'with', 'programming', 'languages']|
| 1 | ['my', 'code', 'is', 'working', 'so', 'there', 'must', |
| | 'be', 'a', 'bug', 'in', 'the', 'interpreter'] |
+----+---------------------------------------------------------------+

最佳答案

您必须对每个单词应用词干提取并将其存储到“词干提取”列中。

df['stemmed'] = df['unstemmed'].apply(lambda x: [stemmer.stem(y) for y in x]) # Stem every word.
df = df.drop(columns=['unstemmed']) # Get rid of the unstemmed column.
df # Print dataframe.

+----+--------------------------------------------------------------+
| | stemmed |
|----+--------------------------------------------------------------|
| 0 | ['program', 'program', 'with', 'program', 'languag'] |
| 1 | ['my', 'code', 'is', 'work', 'so', 'there', 'must', |
| | 'be', 'a', 'bug', 'in', 'the', 'interpret'] |
+----+--------------------------------------------------------------+

关于Python 词干提取(使用 pandas 数据框),我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/37443138/

24 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com