python - 如何有效地将带有 BZ2 JSON twitter 文件的大型 (30GB+) TAR 文件读入 PostgreSQL-6ren

python - 如何有效地将带有 BZ2 JSON twitter 文件的大型 (30GB+) TAR 文件读入 PostgreSQL

转载作者：太空狗更新时间：2023-10-29 22:22:13

24

4

我正在尝试从 archive.org archive 获取推特数据并将其加载到数据库中。我试图首先加载特定月份的所有推文，然后选择推文并仅展示我感兴趣的推文(例如按区域设置或主题标签)。

我能够运行下面描述的脚本来完成我正在寻找的事情，但我有一个问题，它非常慢。它运行了大约半小时，并且只读取了一个 TAR 文件中的 ~ 6/50,000 个内部 .bz2 文件。

示例 TAR 文件的一些统计信息:

总大小:~ 30-40GB
内部 .bz2 文件数量(按文件夹排列):50,000
一个 .bz2 文件的大小:~600kb
一个提取的 JSON 文件的大小:~5 MB，~3600 条推文。

在优化此过程以提高速度时我应该注意什么？

我应该将文件提取到磁盘而不是用 Python 缓冲它们吗？
我是否应该将多线程视为流程的一部分？流程的哪一部分对此是最佳的？
或者，对于这样的脚本，我目前获得的速度是否相对正常？

脚本目前使用了 ~ 3% 的 CPU 和 ~ 6% 的 RAM 内存。

非常感谢任何帮助。

import tarfile
import dataset # Using dataset as I'm still iteratively developing the table structure(s)
import json
import datetime


def scrape_tar_contents(filename):
    """Iterates over an input TAR filename, retrieving each .bz2 container:
       extracts & retrieves JSON contents; stores JSON contents in a postgreSQL database"""
    tar = tarfile.open(filename, 'r')
    inner_files = [filename for filename in tar.getnames() if filename.endswith('.bz2')]

    num_bz2_files = len(inner_files)
    bz2_count = 1
    print('Starting work on file... ' + filename[-20:])
    for bz2_filename in inner_files: # Loop over all files in the TAR archive
        print('Starting work on inner file... ' + bz2_filename[-20:] + ': ' + str(bz2_count) + '/' + str(num_bz2_files))
        t_extract = tar.extractfile(bz2_filename)
        data = t_extract.read()
        txt = bz2.decompress(data)

        tweet_errors = 0
        current_line = 1
        num_lines = len(txt.split('\n'))
        for line in txt.split('\n'):  # Loop over the lines in the resulting text file.
            if current_line % 100 == 0:
                print('Working on line ' + str(current_line) + '/' + str(num_lines))
                try:
                    tweet = json.loads(line)
                except ValueError, e:
                    error_log = {'Date_time': datetime.datetime.now(),
                                'File_TAR': filename,
                                'File_BZ2': bz2_filename,
                                'Line_number': current_line,
                                'Line': line,
                                'Error': str(e)}
                    tweet_errors += 1
                    db['error_log'].upsert(error_log, ['File_TAR', 'File_BZ2', 'Line_number'])
                    print('Error occured, now at ' + str(tweet_errors))
                try:
                    tweet_id = tweet['id']
                    tweet_text = tweet['text']
                    tweet_locale = tweet['lang']
                    created_at = tweet['created_at']
                    tweet_json = tweet
                    data = {'tweet_id': tweet_id,
                            'tweet_text': tweet_text,
                            'tweet_locale': tweet_locale,
                            'created_at_str': created_at,
                            'date_loaded': datetime.datetime.now(),
                            'tweet_json': tweet_json}
                    db['tweets'].upsert(data, ['tweet_id'])
                except KeyError, e:
                    error_log = {'Date_time': datetime.datetime.now(),
                                'File_TAR': filename,
                                'File_BZ2': bz2_filename,
                                'Line_number': current_line,
                                'Line': line,
                                'Error': str(e)}
                    tweet_errors += 1
                    db['error_log'].upsert(error_log, ['File_TAR', 'File_BZ2', 'Line_number'])
                    print('Error occured, now at ' + str(tweet_errors))
                    continue

if __name__ == "__main__":
    with open("postgresConnecString.txt", 'r') as f:
        db_connectionstring = f.readline()
    db = dataset.connect(db_connectionstring)

    filename = r'H:/Twitter datastream/Sourcefiles/archiveteam-twitter-stream-2013-01.tar'
    scrape_tar_contents(filename)

最佳答案

tar 文件不包含文件所在位置的索引。此外，一个 tar 文件可以包含 more than one copy of the same file .因此，当您提取一个文件时，必须读取整个 tar 文件。即使在找到该文件之后，仍必须读取 tar 文件的其余部分以检查是否存在后续副本。

这使得提取一个文件与提取所有文件一样昂贵。

因此，切勿在大型 tar 文件上使用 tar.extractfile(...)(除非您只需要一个文件或没有足够的空间来提取所有内容)。

如果您有足够的空间(考虑到现代硬盘驱动器的大小，您几乎肯定有)，请使用 tar.extractall 提取所有内容或者通过系统调用 tar xf ...，然后处理提取的文件。

关于python - 如何有效地将带有 BZ2 JSON twitter 文件的大型 (30GB+) TAR 文件读入 PostgreSQL，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/27838842/

24

4

0

文章推荐： python:扁平化为列表列表但仅此而已

文章推荐： c# - Visual Studio 2015 SQL Server Data Tools 缺少 "Add Table"选项

文章推荐： c# - Rotativa - 奇怪的 header 输出

twitter - 在 Twitter 上使用 Twitter api 注销？
用户使用 oauth 登录我的应用程序，注销我的应用程序后，但 twitter 无法执行，问题是用户 twitter 帐户处于事件状态。当注销我的应用程序的同时注销 Twitter twitter
twitter - Twitter 意图和 Twitter 共享 URL 之间的区别
我在 Twitter 的文本查询字符串参数方面遇到了一些字符编码问题。 a) http://www.twitter.com/share?url=http://www.example.com&text=
twitter - Twitter 是帮助你成为更好的开发者还是分散你的注意力？
就目前而言，这个问题不适合我们的问答形式。我们希望答案得到事实、引用或专业知识的支持，但这个问题可能会引起辩论、争论、投票或扩展讨论。如果您觉得这个问题可以改进并可能重新打开，visit the he
twitter - Twitter API授权未能在浏览器中进行CORS预检
我正在尝试执行3-legged authorization来在浏览器中调用Twitter API。该过程首先通过将签名的请求发布到 /oauth/request_token 来获得请求 token (
twitter - 检查用户是否通过 twitter api 在 twitter 中验证
我正在做一个项目来识别用户是否是 Twitter 中的名人。有什么方法可以检查 Twitter 中的用户是否被验证为名人？我知道名人会在推特个人资料中用蓝色徽章来识别。但是我如何通过 Twitter
twitter - twitter 有停用词列表吗？
我想对推文进行一些挖掘。是否有更具体的推文停用词列表，例如删除“lol”和其他推特笑脸？最佳答案我想你应该合并普通的停用词列表，例如 this one或that ，带有特定的首字母缩略词词典，例如
twitter - Twitter 热点话题提取
我正在为我的期末项目建立一个网站，用于查找和显示 Twitter 上当前 HitTest 门的主题。有谁知道如何从上周或一天内的大量推文中提取主题？我还想知道如何在 http://tweet3d.co
twitter - 使用 twitter api 获取 Twitter userId
我可以使用获取所有用户的详细信息 https://api.twitter.com/1/account/verify_credentials.json 但我只想通过使用 api 获取 ID 如何获得它。
twitter - Twitter 可以查看您的密码吗？
我见过多个“允许此应用程序与 twitter 一起运行”的内容，但没有一个: 查看您的 Twitter 密码在“此应用程序将能够”下示例: 最佳答案没有 Twitter 永远不会允许人们看到您的
twitter - 如何嵌入而不是将照片上传到 Twitter？
我注意到最近的一些推文有与之相关的媒体，例如来自 TwitPic 或 Flickr 的照片以及来自 Youtube 的视频。你可以直接在 Twitter 网站上看到它们，所以它不仅仅是一个链接。我的想
twitter - Twitter 上的水合物是什么意思？
在 Twitter API 中，有一个 status_lookup 方法可以“水化”推文。文档不清楚这意味着什么。那么我什么时候需要补充推文呢？如果我有来自 /statuses/user_timel
twitter - Twitter 消息末尾出现奇怪的斜杠
我使用以下代码来显示一个带有已填充消息的 Twitter 框的页面: Click me 但是，在页面上，我在 Twitter 框中得到了这个: myMessage/ 注意结尾的斜杠。有什么想法可以解决
twitter - 可以保存 Twitter 密码以便通过 Twitter API 轻松登录吗？
关闭。这个问题不符合Stack Overflow guidelines .它目前不接受答案。这个问题似乎与 help center 中定义的范围内的编程无关。 . 关闭 7 年前。 Improve
twitter - Twitter 主题标签中允许使用哪些字符？
在开发包含 Twitter 客户端的 iOS 应用程序时，我必须允许用户生成主题标签(可以在应用程序内的其他位置创建，而不仅仅是在推文正文中创建)。我想确保任何此类主题标签对于 Twitter 都有
twitter - Twitter 源的聚类
我是集群新手，之前刚刚实现了一些算法。我需要根据推文的相似性对推文进行聚类。一种方法是仅使用哈希标签，但我认为这不会提供那么多信息。因此应该分析完整的推文。此外，我还在网上搜索聚类提要的算法。我遇
twitter - 在 ios7 中使用 twitter 登录并获取 twitter 用户配置文件
我想在 ios 7 中集成 twitter 并希望实现以下功能。1. 从 iOS 应用程序使用 Twitter 登录。2. 获取用户资料信息我尝试了几个解决方案，但没有一个对我有用。请帮忙。最佳答
twitter - 使用 Twitter 用户 ID 构建 Twitter 个人资料图像 url
是否有任何方法可以使用用户 ID 或屏幕名称构建个人资料图像 URL？我将用户 ID 存储在数据库中，但我不想存储个人资料图像 url。编辑: 我也不想进行 api 调用。我想将 user_id 放
iphone - 您如何使用 Twitter.framework 指导用户将他们的 Twitter 帐户添加到 Twitter 设置？
在 iOS5 上，是否可以提示用户并将其引导至 Twitter Settings.app 区域，以便他们可以将自己的 Twitter 帐户添加到手机中？如果是，你是怎么做到的？作为解决方法，我可以指
twitter - 如何获取所有 Twitter 链接？
有许多网站为 Twitter 提供附加服务: hashtags.org tweetmeme.com repeets.com dailyrt.com backtweets.com 他们都有一个共同点:他
twitter-bootstrap - twitter bootstrap的后台打印问题
我正在使用 Twitter Bootstrap 并尝试使用背景打印页面。我尝试了网络浏览器中的所有选项，但它不起作用。如果我不包括 twitter bootstrap，则背景的打印效果很好。 (顺

首页

博学

6Ren·AI

商城

python - 如何有效地将带有 BZ2 JSON twitter 文件的大型 (30GB+) TAR 文件读入 PostgreSQL