gpt4 book ai didi

elasticsearch - Stormcrawler缓慢,具有高延迟,可爬行300个域

转载 作者:行者123 更新时间:2023-12-02 22:45:56 24 4
gpt4 key购买 nike

自大约3个月以来,我目前正在努力解决此问题。爬网程序似乎每10分钟获取一次页面,但在这之间似乎什么也没做。总体来说吞吐量很慢。我正在并行爬网300个域。这应该使大约30页/秒,并具有10秒的爬网延迟。目前大约是每秒2页。

拓扑可在具有

  • 8GB内存
  • 普通硬盘
  • Core Duo CPU
  • Ubuntu 16.04

  • Elasticsearch已安装在具有相同规格的另一台计算机上。

    在这里,您可以从Grafana信息中心查看指标

    The Grafana Dashboard

    它们还反射(reflect)在Storm UI中看到的进程延迟中:

    Storm UI

    我当前的Stormcrawler架构是:

    spouts:
    - id: "spout"
    className: "com.digitalpebble.stormcrawler.elasticsearch.persistence.AggregationSpout"
    parallelism: 25

    bolts:
    - id: "partitioner"
    className: "com.digitalpebble.stormcrawler.bolt.URLPartitionerBolt"
    parallelism: 1
    - id: "fetcher"
    className: "com.digitalpebble.stormcrawler.bolt.FetcherBolt"
    parallelism: 6
    - id: "sitemap"
    className: "com.digitalpebble.stormcrawler.bolt.SiteMapParserBolt"
    parallelism: 1
    - id: "parse"
    className: "com.digitalpebble.stormcrawler.bolt.JSoupParserBolt"
    parallelism: 1
    - id: "index"
    className: "de.hpi.bpStormcrawler.BPIndexerBolt"
    parallelism: 1
    - id: "status"
    className: "com.digitalpebble.stormcrawler.elasticsearch.persistence.StatusUpdaterBolt"
    parallelism: 4
    - id: "status_metrics"
    className: "com.digitalpebble.stormcrawler.elasticsearch.metrics.StatusMetricsBolt"
    parallelism: 1

    配置(这里是最相关的部分):

    config:
    topology.workers: 1
    topology.message.timeout.secs: 300
    topology.max.spout.pending: 100
    topology.debug: false

    fetcher.threads.number: 50

    worker.heap.memory.mb: 2049
    partition.url.mode: byDomain

    fetcher.server.delay: 10.0

    这是 Storm 配置(也只是相关部分):

    nimbus.childopts: "-Xmx1024m -Djava.net.preferIPv4Stack=true"

    ui.childopts: "-Xmx768m -Djava.net.preferIPv4Stack=true"

    supervisor.childopts: "-Djava.net.preferIPv4Stack=true"

    worker.childopts: "-Xmx1500m -Djava.net.preferIPv4Stack=true"


    您知道可能是什么问题吗?还是仅仅是硬件问题?

    我已经尝试过的
  • 将fetcher.server.delay增加到一个较高和较低的值,这不会改变任何内容
  • 减少并增加访存线程的数量
  • 玩并行性
  • 计算是否为网络带宽。带宽为400mbit / s,平均页面大小为0.5MB,则为15MB / s,即120mbit / s,这也不是问题
  • 增加了 worker 数量

  • 您还有其他需要检查的想法或可以解释读取缓慢的原因吗?也许也只是速度较慢的硬件?还是瓶颈是Elasticsearch?

    提前非常感谢你

    编辑:

    我将拓扑更改为两名工作人员,并且经常发生错误

    2018-07-03 17:18:46.326 c.d.s.e.p.AggregationSpout Thread-33-spout-executor[26 26] [INFO] [spout #12]  Populating buffer with nextFetchDate <= 2018-06-21T17:52:42+02:00
    2018-07-03 17:18:46.327 c.d.s.e.p.AggregationSpout I/O dispatcher 26 [ERROR] Exception with ES query
    java.io.IOException: Unable to parse response body for Response{requestLine=POST /status/status/_search?typed_keys=true&ignore_unavailable=false&expand_wildcards=open&allow_no_indices=true&preference=_shards%3A12&search_type=query_then_fetch&batched_reduce_size=512 HTTP/1.1, host=http://ts5565.byod.hpi.de:9200, response=HTTP/1.1 200 OK}
    at org.elasticsearch.client.RestHighLevelClient$1.onSuccess(RestHighLevelClient.java:548) [stormjar.jar:?]
    at org.elasticsearch.client.RestClient$FailureTrackingResponseListener.onSuccess(RestClient.java:600) [stormjar.jar:?]
    at org.elasticsearch.client.RestClient$1.completed(RestClient.java:355) [stormjar.jar:?]
    at org.elasticsearch.client.RestClient$1.completed(RestClient.java:346) [stormjar.jar:?]
    at org.apache.http.concurrent.BasicFuture.completed(BasicFuture.java:119) [stormjar.jar:?]
    at org.apache.http.impl.nio.client.DefaultClientExchangeHandlerImpl.responseCompleted(DefaultClientExchangeHandlerImpl.java:177) [stormjar.jar:?]
    at org.apache.http.nio.protocol.HttpAsyncRequestExecutor.processResponse(HttpAsyncRequestExecutor.java:436) [stormjar.jar:?]
    at org.apache.http.nio.protocol.HttpAsyncRequestExecutor.inputReady(HttpAsyncRequestExecutor.java:326) [stormjar.jar:?]
    at org.apache.http.impl.nio.DefaultNHttpClientConnection.consumeInput(DefaultNHttpClientConnection.java:265) [stormjar.jar:?]
    at org.apache.http.impl.nio.client.InternalIODispatch.onInputReady(InternalIODispatch.java:81) [stormjar.jar:?]
    at org.apache.http.impl.nio.client.InternalIODispatch.onInputReady(InternalIODispatch.java:39) [stormjar.jar:?]
    at org.apache.http.impl.nio.reactor.AbstractIODispatch.inputReady(AbstractIODispatch.java:114) [stormjar.jar:?]
    at org.apache.http.impl.nio.reactor.BaseIOReactor.readable(BaseIOReactor.java:162) [stormjar.jar:?]
    at org.apache.http.impl.nio.reactor.AbstractIOReactor.processEvent(AbstractIOReactor.java:337) [stormjar.jar:?]
    at org.apache.http.impl.nio.reactor.AbstractIOReactor.processEvents(AbstractIOReactor.java:315) [stormjar.jar:?]
    at org.apache.http.impl.nio.reactor.AbstractIOReactor.execute(AbstractIOReactor.java:276) [stormjar.jar:?]
    at org.apache.http.impl.nio.reactor.BaseIOReactor.execute(BaseIOReactor.java:104) [stormjar.jar:?]
    at org.apache.http.impl.nio.reactor.AbstractMultiworkerIOReactor$Worker.run(AbstractMultiworkerIOReactor.java:588) [stormjar.jar:?]
    at java.lang.Thread.run(Thread.java:748) [?:1.8.0_171]
    Caused by: java.lang.NullPointerException


    爬网过程似乎仍然更加平衡,但仍无法获取很多链接

    enter image description here

    同样在运行拓扑几周后,延迟也增加了很多

    enter image description here

    最佳答案

    很抱歉收到您的回复,假期刚回来。

    从该图判断,该工作器重新启动,这使我认为某些东西正在阻止或崩溃拓扑。过了一会儿什么都没发生后,工作进程重新启动,它处理了一些URL,问题再次发生。

    您是否在日志中检查了错误消息?日志中是否有内存转储?您能否隔离导致问题的URL?

    关于elasticsearch - Stormcrawler缓慢,具有高延迟,可爬行300个域,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/50950750/

    24 4 0
    Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
    广告合作:1813099741@qq.com 6ren.com