gpt4 book ai didi

python - 带标签的 CSV 文件

转载 作者:太空宇宙 更新时间:2023-11-03 14:00:27 26 4
gpt4 key购买 nike

按照此处的建议Python Tf idf algorithm我使用此代码来获取一组文档中单词的频率。

import pandas as pd
import csv
import os
from sklearn.feature_extraction.text import TfidfVectorizer
from nltk import word_tokenize
from nltk.stem.porter import PorterStemmer
import codecs

def tokenize(text):
tokens = word_tokenize(text)
stems = []
for item in tokens: stems.append(PorterStemmer().stem(item))
return stems

with codecs.open("book1.txt",'r','utf-8') as i1,\
codecs.open("book2.txt",'r','utf-8') as i2,\
codecs.open("book3.txt",'r','utf-8') as i3:
# your corpus
t1=i1.read().replace('\n',' ')
t2=i2.read().replace('\n',' ')
t3=i3.read().replace('\n',' ')

text = [t1,t2,t3]
# word tokenize and stem
text = [" ".join(tokenize(txt.lower())) for txt in text]
vectorizer = TfidfVectorizer()
matrix = vectorizer.fit_transform(text).todense()
# transform the matrix to a pandas df
matrix = pd.DataFrame(matrix, columns=vectorizer.get_feature_names())
# sum over each document (axis=0)
top_words = matrix.sum(axis=0).sort_values(ascending=False)

top_words.to_csv('dict.csv', index=True, float_format="%f",encoding="utf-8")

在最后一行,我创建了一个 csv 文件,其中列出了所有单词及其频率。有没有办法给它们贴上标签,看看某个单词是只属于第三个文档,还是属于所有文档?我的目标是从 csv 文件中删除仅出现在第三个文档 (book3) 中的所有单词

最佳答案

您可以使用isin()属性,从整个语料库的 top_words 中过滤掉第三本书中的 top_words

(对于下面的示例,我从 http://www.gutenberg.org/ 随机下载了三本书)

import codecs
import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer
# import nltk
# nltk.download('punkt')
from nltk import word_tokenize
from nltk.stem.porter import PorterStemmer

def tokenize(text):
tokens = word_tokenize(text)
stems = []
for item in tokens: stems.append(PorterStemmer().stem(item))
return stems

with codecs.open("56732-0.txt",'r','utf-8') as i1,\
codecs.open("56734-0.txt",'r','utf-8') as i2,\
codecs.open("56736-0.txt",'r','utf-8') as i3:
# your corpus
t1=i1.read().replace('\n',' ')
t2=i2.read().replace('\n',' ')
t3=i3.read().replace('\n',' ')

text = [t1,t2,t3]
# word tokenize and stem
text = [" ".join(tokenize(txt.lower())) for txt in text]
vectorizer = TfidfVectorizer()
matrix = vectorizer.fit_transform(text).todense()
# transform the matrix to a pandas df
matrix = pd.DataFrame(matrix, columns=vectorizer.get_feature_names())
# sum over each document (axis=0)
top_words = matrix.sum(axis=0).sort_values(ascending=False)

# top_words for the 3rd book alone
text = [" ".join(tokenize(t3.lower()))]
matrix = vectorizer.fit_transform(text).todense()
matrix = pd.DataFrame(matrix, columns=vectorizer.get_feature_names())
top_words3 = matrix.sum(axis=0).sort_values(ascending=False)

# Mask out words in t3
mask = ~top_words.index.isin(top_words3.index)
# Filter those words from top_words
top_words = top_words[mask]

top_words.to_csv('dict.csv', index=True, float_format="%f",encoding="utf-8")

关于python - 带标签的 CSV 文件,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/49283979/

26 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com