gpt4 book ai didi

python - 在 python 中精确复制 R 文本预处理

转载 作者:太空狗 更新时间:2023-10-30 01:15:17 28 4
gpt4 key购买 nike

我想以与在 R 中相同的方式使用 Python 预处理文档语料库。例如,给定初始语料库 corpus,我想以预处理结束对应于使用以下 R 代码生成的语料库:

library(tm)
library(SnowballC)

corpus = tm_map(corpus, tolower)
corpus = tm_map(corpus, removePunctuation)
corpus = tm_map(corpus, removeWords, c("myword", stopwords("english")))
corpus = tm_map(corpus, stemDocument)

在 Python 中是否有一种简单或直接的(最好是预先构建的)方法来执行此操作?有没有办法确保完全相同的结果?


比如我想预处理

@Apple ear pods are AMAZING! Best sound from in-ear headphones I've ever had!

进入

ear pod amaz best sound inear headphon ive ever

最佳答案

在预处理步骤中让 nltktm 之间的事情完全相同似乎很棘手,所以我认为最好的方法是使用 rpy2 在 R 中运行预处理并将结果拉入 python:

import rpy2.robjects as ro
preproc = [x[0] for x in ro.r('''
tweets = read.csv("tweets.csv", stringsAsFactors=FALSE)
library(tm)
library(SnowballC)
corpus = Corpus(VectorSource(tweets$Tweet))
corpus = tm_map(corpus, tolower)
corpus = tm_map(corpus, removePunctuation)
corpus = tm_map(corpus, removeWords, c("apple", stopwords("english")))
corpus = tm_map(corpus, stemDocument)''')]

然后,您可以将它加载到 scikit-learn 中——您唯一需要做的就是让 CountVectorizerDocumentTermMatrix是去掉长度小于3的terms:

from sklearn.feature_extraction.text import CountVectorizer
def mytokenizer(x):
return [y for y in x.split() if len(y) > 2]

# Full document-term matrix
cv = CountVectorizer(tokenizer=mytokenizer)
X = cv.fit_transform(preproc)
X
# <1181x3289 sparse matrix of type '<type 'numpy.int64'>'
# with 8980 stored elements in Compressed Sparse Column format>

# Sparse terms removed
cv2 = CountVectorizer(tokenizer=mytokenizer, min_df=0.005)
X2 = cv2.fit_transform(preproc)
X2
# <1181x309 sparse matrix of type '<type 'numpy.int64'>'
# with 4669 stored elements in Compressed Sparse Column format>

让我们用 R 验证这是否匹配:

tweets = read.csv("tweets.csv", stringsAsFactors=FALSE)
library(tm)
library(SnowballC)
corpus = Corpus(VectorSource(tweets$Tweet))
corpus = tm_map(corpus, tolower)
corpus = tm_map(corpus, removePunctuation)
corpus = tm_map(corpus, removeWords, c("apple", stopwords("english")))
corpus = tm_map(corpus, stemDocument)
dtm = DocumentTermMatrix(corpus)
dtm
# A document-term matrix (1181 documents, 3289 terms)
#
# Non-/sparse entries: 8980/3875329
# Sparsity : 100%
# Maximal term length: 115
# Weighting : term frequency (tf)

sparse = removeSparseTerms(dtm, 0.995)
sparse
# A document-term matrix (1181 documents, 309 terms)
#
# Non-/sparse entries: 4669/360260
# Sparsity : 99%
# Maximal term length: 20
# Weighting : term frequency (tf)

如您所见,现在两种方法之间存储的元素和术语的数量完全匹配。

关于python - 在 python 中精确复制 R 文本预处理,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/22797393/

28 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com