gpt4 book ai didi

python - Scikit : Remove feature row if present in all documents

转载 作者:行者123 更新时间:2023-11-30 09:29:48 24 4
gpt4 key购买 nike

我正在做文本分类。我有大约 32K(垃圾邮件和火腿)文件。

import numpy as np
import pandas as pd
import sklearn.datasets as dataset
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.naive_bayes import BernoulliNB
from sklearn.preprocessing import LabelEncoder
import re
from sklearn.feature_selection import SelectKBest
from sklearn.feature_selection import chi2
from sklearn.linear_model import SGDClassifier
from BeautifulSoup import BeautifulSoup
from sklearn.feature_extraction import text
from sklearn import cross_validation
from sklearn import svm
from sklearn.grid_search import GridSearchCV
from sklearn.feature_selection import VarianceThreshold

# Now load files from spam and ham
data = dataset.load_files("/home/voila/Downloads/enron1/")
xData = data.data
yData = data.target
print data.target_names


countVector = CountVectorizer(decode_error='ignore' , stop_words = 'english')
countmatrix = countVector.fit_transform(xData)

countermatrix 将是一个矩阵,其中 countermatrix[i][j] 表示文档 i 中单词 j 的计数>

现在我想删除超过 80% 的文档中出现 countermatrix[i][j] > 1(意味着单词太常见)的所有功能。

我该怎么做?

谢谢

最佳答案

您可以通过将 max_df 设置为小于 1 的值来实现此目的,请参阅 docs .

关于python - Scikit : Remove feature row if present in all documents,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/30483830/

24 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com