gpt4 book ai didi

elasticsearch - Stormcrawler -> Elasticsearch 的最佳设置,如果爬行的礼貌不是问题?

转载 作者:行者123 更新时间:2023-12-03 00:49:56 25 4
gpt4 key购买 nike

我们的大学网络系统有大约 1200 个站点,包括几百万个页面。我们在一台本地运行 apache 的机器上安装和配置了 Stormcrawler,并将驱动器映射到 Web 环境的文件系统。这意味着我们可以让 Stormcrawler 尽可能快地爬行,而不会产生任何网络流量,也不会影响公共(public)网络的存在。我们让 Tika 解析器运行以索引 .doc、.pdf 等。

  • 所有网站都在 *.example.com 域下。
  • 我们有一个使用大量 CPU 运行的 Elasticsearch 实例,
    内存和磁盘。
  • index-index 有 4 个分片。
  • 指标索引有 1 个分片。
  • 状态索引有 10 个分片。

  • 考虑到所有这些,我们可以做的最佳爬取配置是什么让爬虫忽略礼貌并在本地网络环境中爆炸并尽可能快地爬取所有内容?

    以下是 es-crawler.flux 中关于 spouts 和 bolts 的当前设置:
    name: "www-all-crawler"

    includes:
    - resource: true
    file: "/crawler-default.yaml"
    override: false

    - resource: false
    file: "crawler-conf.yaml"
    override: true

    - resource: false
    file: "es-conf.yaml"
    override: true

    spouts:
    - id: "spout"
    className: "com.digitalpebble.stormcrawler.elasticsearch.persistence.AggregationSpout"
    parallelism: 10

    bolts:
    - id: "partitioner"
    className: "com.digitalpebble.stormcrawler.bolt.URLPartitionerBolt"
    parallelism: 1
    - id: "fetcher"
    className: "com.digitalpebble.stormcrawler.bolt.FetcherBolt"
    parallelism: 2
    - id: "sitemap"
    className: "com.digitalpebble.stormcrawler.bolt.SiteMapParserBolt"
    parallelism: 1
    - id: "parse"
    className: "com.digitalpebble.stormcrawler.bolt.JSoupParserBolt"
    parallelism: 1
    - id: "index"
    className: "com.digitalpebble.stormcrawler.elasticsearch.bolt.IndexerBolt"
    parallelism: 1
    - id: "status"
    className: "com.digitalpebble.stormcrawler.elasticsearch.persistence.StatusUpdaterBolt"
    parallelism: 1
    - id: "status_metrics"
    className: "com.digitalpebble.stormcrawler.elasticsearch.metrics.StatusMetricsBolt"
    parallelism: 1
    - id: "redirection_bolt"
    className: "com.digitalpebble.stormcrawler.tika.RedirectionBolt"
    parallelism: 1
    - id: "parser_bolt"
    className: "com.digitalpebble.stormcrawler.tika.ParserBolt"
    parallelism: 1

    streams:
    - from: "spout"
    to: "partitioner"
    grouping:
    type: SHUFFLE

    - from: "spout"
    to: "status_metrics"
    grouping:
    type: SHUFFLE
    - from: "partitioner"
    to: "fetcher"
    grouping:
    type: FIELDS
    args: ["key"]

    - from: "fetcher"
    to: "sitemap"
    grouping:
    type: LOCAL_OR_SHUFFLE

    - from: "sitemap"
    to: "parse"
    grouping:
    type: LOCAL_OR_SHUFFLE

    - from: "parse"
    to: "index"
    grouping:
    type: LOCAL_OR_SHUFFLE

    - from: "fetcher"
    to: "status"
    grouping:
    type: FIELDS
    args: ["url"]
    streamId: "status"

    - from: "sitemap"
    to: "status"
    grouping:
    type: FIELDS
    args: ["url"]
    streamId: "status"
    - from: "parse"
    to: "status"
    grouping:
    type: FIELDS
    args: ["url"]
    streamId: "status"

    - from: "index"
    to: "status"
    grouping:
    type: FIELDS
    args: ["url"]
    streamId: "status"

    - from: "parse"
    to: "redirection_bolt"
    grouping:
    type: LOCAL_OR_SHUFFLE

    - from: "redirection_bolt"
    to: "parser_bolt"
    grouping:
    type: LOCAL_OR_SHUFFLE

    - from: "redirection_bolt"
    to: "index"
    grouping:
    type: LOCAL_OR_SHUFFLE


    - from: "parser_bolt"
    to: "index"
    grouping:
    type: LOCAL_OR_SHUFFLE

    - from: "redirection_bolt"
    to: "parser_bolt"
    grouping:
    type: LOCAL_OR_SHUFFLE
    streamId: "tika"

    和爬虫conf.yaml:
    # Custom configuration for StormCrawler
    # This is used to override the default values from crawler-default.xml and provide additional ones
    # for your custom components.
    # Use this file with the parameter -conf when launching your extension of ConfigurableTopology.
    # This file does not contain all the key values but only the most frequently used ones. See crawler-default.xml for an extensive list.

    config:
    topology.workers: 2
    topology.message.timeout.secs: 300
    topology.max.spout.pending: 100
    topology.debug: false

    fetcher.threads.number: 50

    # give 2gb to the workers
    worker.heap.memory.mb: 2048

    # mandatory when using Flux
    topology.kryo.register:
    - com.digitalpebble.stormcrawler.Metadata

    # metadata to transfer to the outlinks
    # used by Fetcher for redirections, sitemapparser, etc...
    # these are also persisted for the parent document (see below)
    # metadata.transfer:
    # - customMetadataName

    # lists the metadata to persist to storage
    # these are not transfered to the outlinks
    metadata.persist:
    - _redirTo
    - error.cause
    - error.source
    - isSitemap
    - isFeed

    http.agent.name: "Storm Crawler"
    http.agent.version: "1.0"
    http.agent.description: "built with StormCrawler Archetype 1.13"
    http.agent.url: "http://example.com/"
    http.agent.email: "noreply@example"

    # The maximum number of bytes for returned HTTP response bodies.
    # The fetched page will be trimmed to 65KB in this case
    # Set -1 to disable the limit.
    http.content.limit: 2000000
    jsoup.treat.non.html.as.error: false


    # FetcherBolt queue dump => comment out to activate
    # if a file exists on the worker machine with the corresponding port number
    # the FetcherBolt will log the content of its internal queues to the logs
    # fetcherbolt.queue.debug.filepath: "/tmp/fetcher-dump-{port}"

    parsefilters.config.file: "parsefilters.json"
    urlfilters.config.file: "urlfilters.json"

    # revisit a page daily (value in minutes)
    # set it to -1 to never refetch a page
    fetchInterval.default: 2880

    # revisit a page with a fetch error after 2 hours (value in minutes)
    # set it to -1 to never refetch a page
    fetchInterval.fetch.error: 120

    # never revisit a page with an error (or set a value in minutes)
    ### Currently set to check back in 1 month.
    fetchInterval.error: 40320

    # text extraction for JSoupParserBolt
    textextractor.include.pattern:
    - DIV[id="block-edu-bootstrap-subtheme-content" class="block block-system block-system-main-block"]
    - MAIN[role="main"]
    - DIV[id="content--news"]
    - DIV[id="content--person"]
    - ARTICLE[class="node container node--type-facility facility-full node-101895 node--promoted node--view-mode-full py-5"]
    - ARTICLE[class="node container node--type-spotlight spotlight-full node-90543 node--promoted node--view-mode-full py-5"]
    - DIV[class="field field--name-field-content field--type-entity-reference-revisions field--label-hidden field__items"]
    - ARTICLE
    - BODY
    # - DIV[id="maincontent"]
    # - DIV[itemprop="articleBody"]
    # - ARTICLE

    textextractor.exclude.tags:
    - STYLE
    - SCRIPT
    - FOOTER

    # custom fetch interval to be used when a document has the key/value in its metadata
    # and has been fetched successfully (value in minutes)
    # fetchInterval.FETCH_ERROR.isFeed=true: 30
    # fetchInterval.isFeed=true: 10

    # configuration for the classes extending AbstractIndexerBolt
    # indexer.md.filter: "someKey=aValue"
    indexer.url.fieldname: "url"
    indexer.text.fieldname: "content"
    indexer.canonical.name: "canonical"
    indexer.md.mapping:
    - parse.title=title
    - parse.keywords=keywords
    - parse.description=description
    - domain=domain

    # Metrics consumers:
    topology.metrics.consumer.register:
    - class: "org.apache.storm.metric.LoggingMetricsConsumer"
    parallelism.hint: 1

    和 es-conf.yaml:
    # configuration for Elasticsearch resources

    config:
    # ES indexer bolt
    # adresses can be specified as a full URL
    # if not we assume that the protocol is http and the port 9200
    es.indexer.addresses: "https://example.com:9200"
    es.indexer.index.name: "www-all-index"
    # es.indexer.pipeline: "_PIPELINE_"
    #### Check the document type thoroughly it needs to match with the elastic search index mapping ####
    es.indexer.doc.type: "doc"
    es.indexer.user: "{username}"
    es.indexer.password: "{password}"
    es.indexer.create: false
    #### Change the Cluster Name ####
    es.indexer.settings:
    cluster.name: "edu-web"

    # ES metricsConsumer
    es.metrics.addresses: "https://example.com:9200"
    es.metrics.index.name: "www-all-metrics"
    #### Check the document type thoroughly it needs to match with the elastic search index mapping ####
    es.metrics.doc.type: "datapoint"
    es.metrics.user: "{username}"
    es.metrics.password: "{password}"
    #### Change the Cluster Name ####
    es.metrics.settings:
    cluster.name: "edu-web"

    # ES spout and persistence bolt
    es.status.addresses: "https://example.com:9200"
    es.status.index.name: "www-all-status"
    #### Check the document type thoroughly it needs to match with the elastic search index mapping ####
    es.status.doc.type: "status"
    es.status.user: "{username}"
    es.status.password: "{password}"
    # the routing is done on the value of 'partition.url.mode'
    es.status.routing: true
    # stores the value used for the routing as a separate field
    # needed by the spout implementations
    es.status.routing.fieldname: "metadata.hostname"
    es.status.bulkActions: 500
    es.status.flushInterval: "5s"
    es.status.concurrentRequests: 1
    #### Change the Cluster Name ####
    es.status.settings:
    cluster.name: "edu-web"

    ################
    # spout config #
    ################

    # positive or negative filter parsable by the Lucene Query Parser
    # es.status.filterQuery: "-(metadata.hostname:stormcrawler.net)"

    # time in secs for which the URLs will be considered for fetching after a ack of fail
    spout.ttl.purgatory: 30

    # Min time (in msecs) to allow between 2 successive queries to ES
    spout.min.delay.queries: 1000

    # Delay since previous query date (in secs) after which the nextFetchDate value will be reset to the current time
    # Setting this to -1 or a large value means that the ES will cache the results but also that less and less results
    # might be returned.
    spout.reset.fetchdate.after: 120

    es.status.max.buckets: 50
    es.status.max.urls.per.bucket: 20
    # field to group the URLs into buckets
    es.status.bucket.field: "metadata.hostname"
    # field to sort the URLs within a bucket
    es.status.bucket.sort.field: "nextFetchDate"
    # field to sort the buckets
    es.status.global.sort.field: "nextFetchDate"

    # CollapsingSpout : limits the deep paging by resetting the start offset for the ES query
    es.status.max.start.offset: 500

    # AggregationSpout : sampling improves the performance on large crawls
    es.status.sample: false

    # AggregationSpout (expert): adds this value in mins to the latest date returned in the results and
    # use it as nextFetchDate
    es.status.recentDate.increase: -1
    es.status.recentDate.min.gap: -1
    topology.metrics.consumer.register:
    - class: "com.digitalpebble.stormcrawler.elasticsearch.metrics.MetricsConsumer"
    parallelism.hint: 1
    #whitelist:
    # - "fetcher_counter"
    # - "fetcher_average.bytes_fetched"
    #blacklist:
    # - "__receive.*"

    pom.xml:
    <project xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
    xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/maven-v4_0_0.xsd">

    <modelVersion>4.0.0</modelVersion>
    <groupId>www.all.edu</groupId>
    <artifactId>www-all</artifactId>
    <version>1.0-SNAPSHOT</version>
    <packaging>jar</packaging>

    <properties>
    <project.build.sourceEncoding>UTF-8</project.build.sourceEncoding>
    <stormcrawler.version>1.13</stormcrawler.version>
    </properties>

    <build>
    <plugins>
    <plugin>
    <groupId>org.apache.maven.plugins</groupId>
    <artifactId>maven-compiler-plugin</artifactId>
    <version>3.2</version>
    <configuration>
    <source>1.8</source>
    <target>1.8</target>
    </configuration>
    </plugin>
    <plugin>
    <groupId>org.codehaus.mojo</groupId>
    <artifactId>exec-maven-plugin</artifactId>
    <version>1.3.2</version>
    <executions>
    <execution>
    <goals>
    <goal>exec</goal>
    </goals>
    </execution>
    </executions>
    <configuration>
    <executable>java</executable>
    <includeProjectDependencies>true</includeProjectDependencies>
    <includePluginDependencies>false</includePluginDependencies>
    <classpathScope>compile</classpathScope>
    </configuration>
    </plugin>
    <plugin>
    <groupId>org.apache.maven.plugins</groupId>
    <artifactId>maven-shade-plugin</artifactId>
    <version>1.3.3</version>
    <executions>
    <execution>
    <phase>package</phase>
    <goals>
    <goal>shade</goal>
    </goals>
    <configuration>
    <createDependencyReducedPom>false</createDependencyReducedPom>
    <transformers>
    <transformer
    implementation="org.apache.maven.plugins.shade.resource.ServicesResourceTransformer" />
    <transformer
    implementation="org.apache.maven.plugins.shade.resource.ManifestResourceTransformer">
    <mainClass>org.apache.storm.flux.Flux</mainClass>
    <manifestEntries>
    <Change></Change>
    <Build-Date></Build-Date>
    </manifestEntries>
    </transformer>
    </transformers>
    <!-- The filters below are necessary if you want to include the Tika
    module -->
    <filters>
    <filter>
    <artifact>*:*</artifact>
    <excludes>
    <exclude>META-INF/*.SF</exclude>
    <exclude>META-INF/*.DSA</exclude>
    <exclude>META-INF/*.RSA</exclude>
    </excludes>
    </filter>
    <filter>
    <!-- https://issues.apache.org/jira/browse/STORM-2428 -->
    <artifact>org.apache.storm:flux-core</artifact>
    <excludes>
    <exclude>org/apache/commons/**</exclude>
    <exclude>org/apache/http/**</exclude>
    <exclude>org/yaml/**</exclude>
    </excludes>
    </filter>
    </filters>
    </configuration>
    </execution>
    </executions>
    </plugin>
    </plugins>
    </build>

    <dependencies>
    <dependency>
    <groupId>com.digitalpebble.stormcrawler</groupId>
    <artifactId>storm-crawler-core</artifactId>
    <version>${stormcrawler.version}</version>
    </dependency>
    <dependency>
    <groupId>com.digitalpebble.stormcrawler</groupId>
    <artifactId>storm-crawler-tika</artifactId>
    <version>${stormcrawler.version}</version>
    </dependency>
    <dependency>
    <dependency>
    <groupId>com.digitalpebble.stormcrawler</groupId>
    <artifactId>storm-crawler-elasticsearch</artifactId>
    <version>${stormcrawler.version}</version>
    </dependency>

    <dependency>
    <groupId>org.apache.storm</groupId>
    <artifactId>storm-core</artifactId>
    <version>1.2.2</version>
    <scope>provided</scope>
    </dependency>
    <dependency>
    <groupId>org.apache.storm</groupId>
    <artifactId>flux-core</artifactId>
    <version>1.2.2</version>
    </dependency>
    </dependencies>
    </project>

    最佳答案

    好的,所以您实际上正在处理少量不同的主机名。你真的可以在一个带有单个 ES spout 的单个 ES 分片上拥有它。要点是提取器将根据主机名强制执行礼貌,并且抓取速度会相对较慢。您可能也不需要多个 Fe​​tcherBolt 实例。

    由于您正在爬取自己的网站,因此您可能会更加积极地使用爬虫并允许多个获取线程同时从同一主机名中提取,请尝试设置

    fetcher.threads.per.queue:25

    并且还从每个查询中检索更多的 URL 到 ES

    es.status.max.urls.per.bucket:200

    这应该会让你的爬行速度更快。

    顺便说一句:如果您可以在 https://github.com/DigitalPebble/storm-crawler/wiki/Powered-By 中列出,请给我发一封电子邮件
    ?

    其他读者注意:仅当您正在抓取自己的网站时才建议这样做。对第三方网站咄咄逼人是不礼貌且没有效率的,因为您有被列入黑名单的风险。

    关于elasticsearch - Stormcrawler -> Elasticsearch 的最佳设置,如果爬行的礼貌不是问题?,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/55281184/

    25 4 0
    Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
    广告合作:1813099741@qq.com 6ren.com