python - 在 Scikit 中加载自定义数据集(类似于 20 个新闻组集)以对文本文档进行分类-6ren

python - 在 Scikit 中加载自定义数据集(类似于 20 个新闻组集)以对文本文档进行分类

转载作者：太空狗更新时间：2023-10-29 21:34:00

28

4

我正在尝试运行 this scikit example code对于我的 Ted Talks 自定义数据集。每个目录都是一个主题，主题下是包含每个 Ted 演讲描述的文本文件。

这就是我的数据集树结构。如您所见，每个目录都是一个主题，下面是带有描述的文本文件。

Topics/
|-- Activism
|   |-- 1149.txt
|   |-- 1444.txt
|   |-- 157.txt
|   |-- 1616.txt
|   |-- 1706.txt
|   |-- 1718.txt
|-- Adventure
|   |-- 1036.txt
|   |-- 1777.txt
|   |-- 2930.txt
|   |-- 2968.txt
|   |-- 3027.txt
|   |-- 3290.txt
|-- Advertising
|   |-- 3673.txt
|   |-- 3685.txt
|   |-- 6567.txt
|   `-- 6925.txt
|-- Africa
|   |-- 1045.txt
|   |-- 1072.txt
|   |-- 1103.txt
|   |-- 1112.txt
|-- Aging
|   |-- 1848.txt
|   |-- 2495.txt
|   |-- 2782.txt
|-- Agriculture
|   |-- 3469.txt
|   |-- 4140.txt
|   |-- 4733.txt
|   |-- 4939.txt

我将我的数据集制作成类似于 20news 组的形式，其树结构如下:

20news-18828/
|-- alt.atheism
|   |-- 49960
|   |-- 51060
|   |-- 51119

|-- comp.graphics
|   |-- 37261
|   |-- 37913
|   |-- 37914
|   |-- 37915
|   |-- 37916
|   |-- 37917
|   |-- 37918
|-- comp.os.ms-windows.misc
|   |-- 10000
|   |-- 10001
|   |-- 10002
|   |-- 10003
|   |-- 10004
|   |-- 10005

在original code (98-124)，这就是训练和测试数据直接从 scikit 加载的方式。

print("Loading 20 newsgroups dataset for categories:")
print(categories if categories else "all")

data_train = fetch_20newsgroups(subset='train', categories=categories,
                                shuffle=True, random_state=42,
                                remove=remove)

data_test = fetch_20newsgroups(subset='test', categories=categories,
                               shuffle=True, random_state=42,
                               remove=remove)
print('data loaded')

categories = data_train.target_names    # for case categories == None
def size_mb(docs):
    return sum(len(s.encode('utf-8')) for s in docs) / 1e6

data_train_size_mb = size_mb(data_train.data)
data_test_size_mb = size_mb(data_test.data)

print("%d documents - %0.3fMB (training set)" % (
    len(data_train.data), data_train_size_mb))
print("%d documents - %0.3fMB (test set)" % (
    len(data_test.data), data_test_size_mb))
print("%d categories" % len(categories))
print()

# split a training set and a test set
y_train, y_test = data_train.target, data_test.target

由于此数据集可与 Scikit 一起使用，因此它的标签等都是内置的。就我而言，我知道如何加载数据集 (Line 84) :

dataset = load_files('./TED_dataset/Topics/')

我不知道在那之后我应该做什么。我想知道我应该如何在训练和测试中拆分这些数据并从我的数据集中生成这些标签:

data_train.data,  data_test.data

总而言之，我只想加载我的数据集，在此代码上无错误地运行它。我有 uploaded the dataset here对于那些可能想要看到它的人。

我已经提到了 this question其中简要介绍了测试列车的装载。我还想知道如何从我的数据集中获取 data_train.target_names。

编辑:

我尝试获取返回错误的训练和测试:

dataset = load_files('./TED_dataset/Topics/')
train, test = train_test_split(dataset, train_size = 0.8)

更新后的代码是 here .

最佳答案

我认为您正在寻找这样的东西:

In [1]: from sklearn.datasets import load_files

In [2]: from sklearn.cross_validation import train_test_split

In [3]: bunch = load_files('./Topics')

In [4]: X_train, X_test, y_train, y_test = train_test_split(bunch.data, bunch.target, test_size=.4)

# Then proceed to train your model and validate.

请注意，bunch.target 是一个整数数组，它是存储在 bunch.target_names 中的类别名称的索引。

In [14]: X_test[:2]
Out[14]:
['Psychologist Philip Zimbardo asks, "Why are boys struggling?" He shares some stats (lower graduation rates, greater worries about intimacy and relationships) and suggests a few reasons -- and challenges the TED community to think about solutions.Philip Zimbardo was the leader of the notorious 1971 Stanford Prison Experiment -- and an expert witness at Abu Ghraib. His book The Lucifer Effect explores the nature of evil; now, in his new work, he studies the nature of heroism.',
 'Human growth has strained the Earth\'s resources, but as Johan Rockstrom reminds us, our advances also give us the science to recognize this and change behavior. His research has found nine "planetary boundaries" that can guide us in protecting our planet\'s many overlapping ecosystems.If Earth is a self-regulating system, it\'s clear that human activity is capable of disrupting it. Johan Rockstrom has led a team of scientists to define the nine Earth systems that need to be kept within bounds for Earth to keep itself in balance.']

In [15]: y_test[:2]
Out[15]: array([ 84, 113])

In [16]: [bunch.target_names[idx] for idx in y_test[:2]]
Out[16]: ['Education', 'Global issues']

关于python - 在 Scikit 中加载自定义数据集(类似于 20 个新闻组集)以对文本文档进行分类，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/33612296/

28

4

0

文章推荐： Python-pandas 将 NA 替换为数据框中一组的中位数或均值

文章推荐： python - 将 Python XML ElementTree 转换为字符串

MSBuild:为主项目生成 XML 文档，但不为依赖项目生成 XML 文档
我有一个 .sln 文件，里面有几个项目。为了简单起见，让我们称它们为... 项目A 项目B 项目C ...其中 A 是引用 B 和 C 的主要项目。我的目标是更新我的构建脚本，为 ProjectA
api - 如何生成 Magento 的 API 文档/文档？
我安装了 Magento，我想知道如何生成完整的 API 文档，例如 http://docs.magentocommerce.com/ 上的文档是使用 phpdoc 生成的。 Magento 中是否包
java - 创建自定义 jsdocs、java 文档、php 文档
我通常使用jetbrains family ide。在为函数创建文档时非常有用，只需输入 /** 如何在创建文档时创建自定义标签，例如@date标签。最佳答案 JavaScript、Java: st
java - 无法打开使用 jOpenDocument 创建的 ODS 文档 Google 文档
我正在尝试使用 jOpenDocument library创建文档。我已经执行了创建电子表格的示例 - 代码编译并运行正常，但当我尝试使用 Excel Office 2012 或 Google Doc
javascript - HTML DOM 从哪里开始？ window ？文档？文档.defaultView？
如标题。有没有介绍HTML DOM构造的图片？最佳答案 DOM(文档对象模型)从文档节点开始。它被称为“根节点”。观察下面的树(括号中对应的nodeType): [HTMLDocument]
ide - 如何更改 ColdFusion 帮助以显示 ColdFusion 8 文档，而不是 ColdFusion 9 文档？
我喜欢 ColdFusion Builder。但我不喜欢帮助只有 CF9 文档。有什么方法可以将其更改为拥有 ColdFusion 8 文档？最佳答案 http://livedocs.adobe.c
javascript - jQuery 脚本 : function(window, 文档，未定义)与 ;(函数($，窗口，文档，未定义)
这个问题在这里已经有了答案: What is the consequence of this bit of javascript? (4 个答案) 关闭 9 年前。我看到一些 jQuery 脚本嵌
c# - 使用 XML 文件中的数据生成 Word 文档 (docx)/基于模板将 XML 转换为 Word 文档
我有一个 XML 文件，其中包含需要在 Word 文档中填充的数据。我需要找到一种方法来定义一个模板，该模板可用作从 XML 文件填充数据并创建输出文档的基线。我相信有两种方法可以做到这一点。创
AVAudioEngine 文档
我正在尝试查找有关如何使用 AVAudioEngine 的详细文档。有谁知道我在哪里可以找到它？我找到了这个，但与文档丰富的 UI 内容相比，它似乎非常简陋。 https://developer.a
tensorflow 文档
我对 Tensorflow 文档越来越感到恼火和沮丧。我在谷歌上搜索了有关的文档 tf.reshape 我被定向到一个通用页面，例如 here 。我想查看 tf.reshape 的详细信息，而不是整
Clojure:文档
我正在学习本教程:http://moxleystratton.com/clojure/clojure-tutorial-for-the-non-lisp-programmer 然后遇到了这个片段: u
Swagger 文档
如何在 swagger 中为对象数组编写文档。这是我的代码，但我不知道如何访问对象数组中的数据。 { "first_name":"Sam", "last_name":"Smith",
Javascript 文档
是否有针对 Javascript 的 JavaDocs 之类的东西？当我在 netbeans IDE 中按 ctrl+space 时写javascript，指定对象的javascript文档就出来了
jquery 文档
关闭。这个问题不符合Stack Overflow guidelines .它目前不接受答案。我们不允许提问寻求书籍、工具、软件库等的推荐。您可以编辑问题，以便用事实和引用来回答。关闭 5 年前。
Javascript 文档
我需要 JavaScript 中的 heredoc 之类的东西。你对此有什么想法吗？我需要跨浏览器功能。我发现了这个: heredoc = '\ \ \ zzz\ \
03、WSDL 文档
WSDL 文档是包含一系列的，可描述某个 web service 的定义的，简单的 XML 文档 WSDL 文档结构 WSDL 文档用下表这些主要的元素来描述某个 web service 的
lua - OCRopus 文档？
是否有 ocropus 的文档？我正在寻找对以下功能的解释: make_SegmentPageByRAST(): segment() RegionExtractor(): setPageLines(
关于如何添加事件处理程序的 C# 文档
这个问题在这里已经有了答案: Understanding events and event handlers in C# (13 个回答) 4年前关闭。我正在使用 NRECO 和 ffmpeg 对视
Javascript 文档.domain
我正在尝试访问工作服务器以与名为 Spotfire 的应用程序一起使用。我的同事把这个传给我，现在已经休息了几个星期，我对他的建议有意见。实际上，当我通过 localhost 运行我的 Web 应用
Elm 文档 - "a"是什么意思？
Elm 文档没有给出示例用法，因此很难理解类型规范的含义。在几个地方，我看到“a”用作参数标识符，例如 Platform.Cmd : map : (a -> msg) -> Cmd a -> Cmd

首页

博学

6Ren·AI

商城

python - 在 Scikit 中加载自定义数据集(类似于 20 个新闻组集)以对文本文档进行分类