gpt4 book ai didi

dataset - 为 scikit-learn 准备数据

转载 作者:太空宇宙 更新时间:2023-11-03 17:56:36 24 4
gpt4 key购买 nike

我正在研究一个关于作者归属的小型 NLP 项目:我有一些来自两位作者的文本,我想知道是谁写的。

我有一些预处理的文本(标记化、后位标记等),我想将其加载到 sciki-learn 中。

文档具有以下形状:

Testo   -   SPN Testo   testare+v+indic+pres+nil+1+sing testo+n+m+sing  O
: - XPS colon colon+punc O
" - XPO " quotation_mark+punc O
Buongiorno - I buongiorno buongiorno+inter buongiorno+n+m+_ O
a - E a a+prep O
tutti - PP tutto tutto+adj+m+plur+pst+ind tutto+pron+_+m+_+plur+ind O
. <eos> XPS full_stop full_stop+punc O
Ci - PP pro loc+pron+loc+_+3+_+clit pro+pron+accdat+_+1+plur+clit O
sarebbe - VI essere essere+v+cond+pres+nil+2+sing O
molto - B molto molto+adj+m+sing+pst+ind

因此,这是一个 6 列的制表符分隔文本文件(单词、句尾标记、词性、引理、形态信息和命名实体识别标记)。

每个文件代表一个要分类的文档。

为 scikit learn 塑造它们的最佳方式是什么?

最佳答案

他们在 scikit-learn 示例中使用的结构 https://scikit-learn.org/stable/tutorial/text_analytics/working_with_text_data.html#在这里描述 http://scikit-learn.org/stable/modules/generated/sklearn.datasets.load_files.html

替换这个

# Load some categories from the training set
if opts.all_categories:
categories = None
else:
categories = [
'alt.atheism',
'talk.religion.misc',
'comp.graphics',
'sci.space',
]

if opts.filtered:
remove = ('headers', 'footers', 'quotes')
else:
remove = ()

print("Loading 20 newsgroups dataset for categories:")
print(categories if categories else "all")

data_train = fetch_20newsgroups(subset='train', categories=categories,
shuffle=True, random_state=42,
remove=remove)

data_test = fetch_20newsgroups(subset='test', categories=categories,
shuffle=True, random_state=42,
remove=remove)

使用您的数据加载语句,例如:

# Load some categories from the training set
categories = [
'high',
'low',
]

print("loading dataset for categories:")
print(categories if categories else "all")

train_path='c:/Users/username/Documents/SciKit/train'
data_train = load_files(train_path, encoding='latin1')

test_path='c:/Users/username/Documents/SciKit/test'
data_test = load_files(test_path, encoding='latin1')

并在每个训练和测试目录中为您的类别文件创建“high”和“low”子目录。

关于dataset - 为 scikit-learn 准备数据,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/28386108/

24 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com