gpt4 book ai didi

elasticsearch - 无法通过 Storm 爬虫从 Elasticsearch 中爬取数据

转载 作者:行者123 更新时间:2023-12-03 02:21:46 28 4
gpt4 key购买 nike

我已按照本网站的建议使用了以下版本的所需库和资源:

https://medium.com/analytics-vidhya/web-scraping-and-indexing-with-stormcrawler-and-elasticsearch-a105cb9c02ca

当我手动向其中添加数据时,我的 elasticdb 工作正常,但是当我使用 Stormcrawler 时,状态链接位于 localhost:9200工作正常,但 localhost:9200 的内容链接无法显示内容,并且抓取后状态显示 FETCH_ERROR。

这是我的crawler.flux文件:

name: "crawler"

includes:
- resource: true
file: "/crawler-default.yaml"
override: false

- resource: false
file: "crawler-conf.yaml"
override: true

- resource: false
file: "es-conf.yaml"
override: true

spouts:
- id: "spout"
className: "com.digitalpebble.stormcrawler.elasticsearch.persistence.AggregationSpout"
parallelism: 1

bolts:
- id: "partitioner"
className: "com.digitalpebble.stormcrawler.bolt.URLPartitionerBolt"
parallelism: 1
- id: "fetcher"
className: "com.digitalpebble.stormcrawler.bolt.FetcherBolt"
parallelism: 1
- id: "sitemap"
className: "com.digitalpebble.stormcrawler.bolt.SiteMapParserBolt"
parallelism: 1
- id: "parse"
className: "com.digitalpebble.stormcrawler.bolt.JSoupParserBolt"
parallelism: 1
- id: "index"
className: "com.digitalpebble.stormcrawler.elasticsearch.bolt.IndexerBolt"
parallelism: 1
- id: "status"
className: "com.digitalpebble.stormcrawler.elasticsearch.persistence.StatusUpdaterBolt"
parallelism: 1
- id: "status_metrics"
className: "com.digitalpebble.stormcrawler.elasticsearch.metrics.StatusMetricsBolt"
parallelism: 1

streams:
- from: "spout"
to: "partitioner"
grouping:
type: SHUFFLE

- from: "spout"
to: "status_metrics"
grouping:
type: SHUFFLE

- from: "partitioner"
to: "fetcher"
grouping:
type: FIELDS
args: ["key"]

- from: "fetcher"
to: "sitemap"
grouping:
type: LOCAL_OR_SHUFFLE

- from: "sitemap"
to: "parse"
grouping:
type: LOCAL_OR_SHUFFLE

- from: "parse"
to: "index"
grouping:
type: LOCAL_OR_SHUFFLE

- from: "fetcher"
to: "status"
grouping:
type: FIELDS
args: ["url"]
streamId: "status"

- from: "sitemap"
to: "status"
grouping:
type: FIELDS
args: ["url"]
streamId: "status"

- from: "parse"
to: "status"
grouping:
type: FIELDS
args: ["url"]
streamId: "status"

- from: "index"
to: "status"
grouping:
type: FIELDS
args: ["url"]
streamId: "status"

最佳答案

latest ES tutorial可能是一个更好的起点。您需要先注入(inject) URL。

关于elasticsearch - 无法通过 Storm 爬虫从 Elasticsearch 中爬取数据,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/62199597/

28 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com