gpt4 book ai didi

python - 如何避免在 Scrapy 中将媒体重新下载到 S3?

转载 作者:行者123 更新时间:2023-12-03 22:28:33 25 4
gpt4 key购买 nike

我之前问过类似的问题 ( How does Scrapy avoid re-downloading media that was downloaded recently? ),但由于我没有收到明确的答复,所以我会再问一次。

我使用 Scrapy 的文件管道将大量文件下载到 AWS S3 存储桶中。根据文档 ( https://doc.scrapy.org/en/latest/topics/media-pipeline.html#downloading-and-processing-files-and-images ),此管道避免了“重新下载最近下载的媒体”,但它没有说明“最近”是多久以前或如何设置此参数。

查看 FilesPipeline 类的实现 https://github.com/scrapy/scrapy/blob/master/scrapy/pipelines/files.py ,这似乎是从 FILES_EXPIRES 设置中获得的,默认值为 90 天:

class FilesPipeline(MediaPipeline):
"""Abstract pipeline that implement the file downloading
This pipeline tries to minimize network transfers and file processing,
doing stat of the files and determining if file is new, uptodate or
expired.
`new` files are those that pipeline never processed and needs to be
downloaded from supplier site the first time.
`uptodate` files are the ones that the pipeline processed and are still
valid files.
`expired` files are those that pipeline already processed but the last
modification was made long time ago, so a reprocessing is recommended to
refresh it in case of change.
"""

MEDIA_NAME = "file"
EXPIRES = 90
STORE_SCHEMES = {
'': FSFilesStore,
'file': FSFilesStore,
's3': S3FilesStore,
}
DEFAULT_FILES_URLS_FIELD = 'file_urls'
DEFAULT_FILES_RESULT_FIELD = 'files'

def __init__(self, store_uri, download_func=None, settings=None):
if not store_uri:
raise NotConfigured

if isinstance(settings, dict) or settings is None:
settings = Settings(settings)

cls_name = "FilesPipeline"
self.store = self._get_store(store_uri)
resolve = functools.partial(self._key_for_pipe,
base_class_name=cls_name,
settings=settings)
self.expires = settings.getint(
resolve('FILES_EXPIRES'), self.EXPIRES
)
if not hasattr(self, "FILES_URLS_FIELD"):
self.FILES_URLS_FIELD = self.DEFAULT_FILES_URLS_FIELD
if not hasattr(self, "FILES_RESULT_FIELD"):
self.FILES_RESULT_FIELD = self.DEFAULT_FILES_RESULT_FIELD
self.files_urls_field = settings.get(
resolve('FILES_URLS_FIELD'), self.FILES_URLS_FIELD
)
self.files_result_field = settings.get(
resolve('FILES_RESULT_FIELD'), self.FILES_RESULT_FIELD
)

super(FilesPipeline, self).__init__(download_func=download_func, settings=settings)

@classmethod
def from_settings(cls, settings):
s3store = cls.STORE_SCHEMES['s3']
s3store.AWS_ACCESS_KEY_ID = settings['AWS_ACCESS_KEY_ID']
s3store.AWS_SECRET_ACCESS_KEY = settings['AWS_SECRET_ACCESS_KEY']
s3store.POLICY = settings['FILES_STORE_S3_ACL']

store_uri = settings['FILES_STORE']
return cls(store_uri, settings=settings)

def _get_store(self, uri):
if os.path.isabs(uri): # to support win32 paths like: C:\\some\dir
scheme = 'file'
else:
scheme = urlparse(uri).scheme
store_cls = self.STORE_SCHEMES[scheme]
return store_cls(uri)

def media_to_download(self, request, info):
def _onsuccess(result):
if not result:
return # returning None force download

last_modified = result.get('last_modified', None)
if not last_modified:
return # returning None force download

age_seconds = time.time() - last_modified
age_days = age_seconds / 60 / 60 / 24
if age_days > self.expires:
return # returning None force download

我的理解正确吗?此外,我在 S3FilesStore 类中没有看到与 age_days 类似的 bool 语句;是否也对 S3 上的文件实现了年龄检查? (我也找不到任何测试 S3 年龄检查功能的测试)。

最佳答案

FILES_EXPIRES确实是告诉 FilesPipeline 文件在(再次)下载之前可以有多“旧”的设置。

代码的关键部分在media_to_download :_onsuccess 回调检查管道的 self.store.stat_file 调用的结果,对于您的问题,它特别查找“last_modified”信息。如果上次修改早于“过期天数”,则触发下载。

可以查看how the S3store gets the "last modified" information .这取决于 botocore 是否可用。

关于python - 如何避免在 Scrapy 中将媒体重新下载到 S3?,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/44824013/

25 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com