gpt4 book ai didi

r - 如何查看映射到特定词干单词的原始单词

转载 作者:行者123 更新时间:2023-12-02 09:06:45 24 4
gpt4 key购买 nike

我正在使用 R 中的 tm_map 进行一些文本分析。我运行以下代码(没有错误)来生成(词干和其他预处理)单词的文档术语矩阵。

  corpus = Corpus(VectorSource(textVector))
corpus = tm_map(corpus, tolower)
corpus = tm_map(corpus, PlainTextDocument)
corpus = tm_map(corpus, removePunctuation)
corpus = tm_map(corpus, removeWords, c(stopwords("english")))
corpus = tm_map(corpus, stemDocument, language="english")

dtm = DocumentTermMatrix(corpus)
mostFreqTerms = findFreqTerms(dtm, lowfreq=125)

但是当我查看我的(词干)mostFreqTerms 时,我看到一些让我思考的词,“嗯,哪些词是词干产生的?”另外,可能有一些词干词乍一看对我来说有意义,但也许我忽略了一个事实,即它们实际上包含具有不同含义的词。

我想应用此 SO 答案中描述的策略/技术来在词干提取期间保留特定术语(例如,防止“自然”和“自然化”成为相同的词干术语。 Text-mining with the tm-package - word stemming

但为了最全面地做到这一点,我希望看到映射到我最常见的词干词的所有单独单词的列表。有没有办法找到词干后产生我的mostFreqTerms 列表的单词?

编辑:可重现的示例

textVector = c("Trisha Takinawa: Here comes Mayor Adam West 
himself. Mr. West do you have any words
for our viewers?Mayor Adam West: Box toaster
aluminum maple syrup... no I take that one
back. Im gonna hold onto that one.
Now MaxPower is adding adamant
so this example works")

corpus = Corpus(VectorSource(textVector))
corpus = tm_map(corpus, tolower)
corpus = tm_map(corpus, PlainTextDocument)
corpus = tm_map(corpus, removePunctuation)
corpus = tm_map(corpus, removeWords, c(stopwords("english")))
corpus = tm_map(corpus, stemDocument, language="english")

dtm = DocumentTermMatrix(corpus)
mostFreqTerms = findFreqTerms(dtm, lowfreq=2)
mostFreqTerms

...上面的mostFreqTerms输出

[1] "adam" "one" "west"

我正在寻找一种编程方式来确定词干“adam”来自原始单词“adam”和“adamant”。

最佳答案

在这里您可以看到词干词“west”来自词“west”、“west”和“wester”。

import nltk

from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
from nltk.stem import RSLPStemmer
import string

st = RSLPStemmer()
punctuations = list(string.punctuation)
textVector = "Trisha Takinawa: Here comes Mayor adams West himself. Mr. \
West do you have any words for our viewers?Mayor Adam Wester: \
Box toaster aluminum maple syrup... no I take that one back. Im gonna hold \
onto that one. Now MaxPower is adding adamant so this example works"

tokens = word_tokenize(textVector.lower())
tokens = [w for w in tokens if not w in punctuations]
filtered_words = [w for w in tokens if not w in stopwords.words('english')]
steammed_words = [st.stem(w) for w in filtered_words ]

allWordDist = nltk.FreqDist(w for w in steammed_words)

for w in allWordDist.most_common(2):
for i in range(len(steammed_words)):
if steammed_words[i] == w[0]:
print str(w[0])+"="+ filtered_words[i]

西=西

西=西

西=西

广告=亚当斯

广告=亚当

关于r - 如何查看映射到特定词干单词的原始单词,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/30005043/

24 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com