python - 如何在 python 中优化搜索两个元组中的大 tsv 文件？-6ren

python - 如何在 python 中优化搜索两个元组中的大 tsv 文件？

转载作者：太空宇宙更新时间：2023-11-03 18:56:08

如何在 python 中优化搜索两个元组中的大型 tsv 文件？

你好。我是一个 python 新手，一直致力于使用两个单独的元组来搜索匹配的元组元素。我使用的文件最多有 3M 行，但速度非常慢。我已经阅读帖子数周了，但似乎没有正确地将代码拼凑在一起。这是我到目前为止所拥有的。 (为了清晰起见，数据已被编辑和简化)。举例来说，我有:

authList = (jennifer, 35, 20),(john, 20, 34), (fred, 34, 89)  # this is a tuple of
#unique tweet authors with their x, y coordinates exported from MS Access in the form
#of a txt file.

rtAuthors = (larry, 57, 24, simon), (jeremy, 24, 15, john), (sandra, 39, 24, fred) 
# this is a tuple of tuples including the author, their x,y coordinates, and the
#author whom they are retweeting (taken from the "RT @ portion of their tweet)

我正在尝试创建一个新的元组 (rtAuthList)，它从 rtAuthors 中任何转发的作者的 authList 中提取 x、y 坐标。

所以我会有一个像这样的新元组:

 rtAuthList = (jeremy, 24, 15, john, 20, 34),(sandra, 39, 24, fred, 34, 89)

我的问题确实有两个部分，所以我不确定是否应该发布两个问题或重新命名我的问题以包含两个问题。首先，按照我编写的方式运行此过程大约需要一个小时。一定有更快的方法。

我的问题的另一部分是为什么它只完成了最终元组的大约一半？对于我当前的数据集，经过这两个步骤后，authList 中有大约 250,000 行，rtAuthors 中有 500,000 行。但是当我处理第三步并在最后打开 rtAuthList 时，它只查看了我的前 10 天的数据，忽略了最后 20 天——我正在处理一个月的推文)。我不知道为什么它没有检查整个 rtAuthors 列表。

我在下面包含了我的整个代码，以便您了解我想要做什么，但在创建 authList 和 rtAuthors 元组之后，我确实在步骤 3 中寻求帮助。请理解，我对编程完全陌生，所以写下答案就好像我什么都不知道一样，尽管当您查看我的代码时这可能是显而易见的。

import csv
import sys
import os

authors= ""

class TwitterFields:             ### associated with monthly tweets from Twitter API
    def __init__(self, ID, COORD1, COORD2,TIME, AUTH, TEXT): 
        self.ID = ID
        self.COORD1 = COORD1
        self.COORD2 = COORD2
        self.TIME = TIME
        self.AUTH=AUTH
        self.TEXT=TEXT
        self.RTAUTH=""
        self.RTX=""
        self.RTY=""

        description="Twitter Data Class: holds twitter data fields from API "
        author=""

class AuthorFields:             ## associated with the txt file exported from MS Access
    def __init__(self, AUTH, COORD1, COORD2):
        self.AUTH=AUTH
        self.COORD1 = COORD1
        self.COORD2 = COORD2
        self.RTAUTH=""
        self.RTX=""
        self.RTY=""

        description="Author Data Class: holds author data fields from MS Access export"
        author=""


tw = [] #empty list to hold data from class TwitterFields
rt = [] #empty list to hold data from class AuthorFields


authList = ()        ## tuple for holding auth, x, and y from tw list
rtAuthors = ()      ## tuple for holding tuples from rt where "RT @" is in tweet text
rtAuthList =()      ## tuple for holding results of set intersection 

e = ()                  # tuple for authList
b=()                    # tuple for rtAuthors
c=()                    # tuple for rtAuthList
bad_data = []      #A container for bad data 

with open(r'C:\Users\Amy\Desktop\Code\Merge2.txt') as g:   #open MS Access export file
    for line in g:                                             
        strLine = line.rstrip('\r\n').split("\t")
        tw.append(AuthorFields( str(strLine[0]),   #reads author name       
                                 strLine[1],       # x coordinate
                                 strLine[2]))      # y coordinate


## Step 1 ##
# Loop through the unique author dataset (tw) and make a list of all authors,x, y
try:
    for i in range(1, len(tw)): 
                e=((tw[i].AUTH[:tw[i].AUTH.index(" (")], tw[i].COORD1,tw[i].COORD2))
                authList = authList +(e,)
except:
    bad_data.append(i)

print "length of authList = ", len(authList)    


# Loop through tweet txt file from MS Access 

with open(r'C:\Users\Amy\Desktop\Code\Syria_2012_08UTCedits3.txt') as f:
    for line in f:
        strLine=line.rstrip('\r\n').split('\t') # parse each line for tab spaces
        rt.append(TwitterFields(str(strLine[0]) ,      #reads tweet ID              
                              strLine[5],                         # x coordinate
                              strLine[6],                         # y coordinate
                              strLine[8],                         # time stamp
                              strLine[9],                         # author
                              strLine[12] ))                    # tweet text

## Step 2 ##
## Loop through new list (rt) to find all instances of "RT @" and retrieve author name

for i in range(1, len(rt)):        # creates tuple of (authors, x, y, rtauth, rtx, rty)
    if (rt[i].TEXT[:4] == 'RT @'): # finds author in tweet text between "RT @" and ":"
            end = rt[i].TEXT.find(":")
            rt[i].RTAUTH=rt[i].TEXT[4:end]
            b = ((rt[i].AUTH, rt[i].COORD1, rt[i].COORD2, rt[i].TIME, rt[i].RTAUTH))
            rtAuthors = rtAuthors + (b,)
    else:
        pass

print "length of rtAuthors = ", len(rtAuthors)


## Step 3 ##

## Loop through new rtAuthors tuple and find where rt[i].RTAUTH matches tw[i].AUTH in
## authList.


set1 = set(k[4] for k in rtAuthors).intersection(x[0] for x in authList)
#e = iter(set1).next()
set2 = list(set1)


print "Length of first set = ", len(set2)

# For each match, grab the x and y from authList and copy to rt[i].RTX and rt[i].RTY

for i in range(1, len(rtAuthors)):
    if rt[i].RTAUTH in set2:
        authListIndex = [x[0] for x in authList].index(rt[i].RTAUTH) #get record # 
        rt[i].RTX= authList[authListIndex][1] # grab the x 
        rt[i].RTY = authList[authListIndex][2] # grab the y
        c = ((rt[i].AUTH, rt[i].COORD1, rt[i].COORD2, rt[i].TIME, rt[i].RTAUTH,
        rt[i].RTX, rt[i].RTY))
        rtAuthList = rtAuthList + (c,)   # create new tuple of tuples with matches

else:
    pass

print "length of rtAuthList = ", len(rtAuthList)

最佳答案

在第 3 步中，您将使用 O(n²) 算法来匹配元组。如果您为 authList 构建查找字典，则可以在 O(n) 时间内完成...

>>> authList = ('jennifer', 35, 20), ('john', 20, 34), ('fred', 34, 89)
>>> rtAuthors = ('larry', 57, 24, 'simon'), ('jeremy', 24, 15, 'john'), ('sandra', 39, 24, 'fred')
>>> authDict = {t[0]: t[1:] for t in authList}
>>> rtAuthList = [t + authDict[t[-1]] for t in rtAuthors if t[-1] in authDict]
>>> print rtAuthList
[('jeremy', 24, 15, 'john', 20, 34), ('sandra', 39, 24, 'fred', 34, 89)]

关于python - 如何在 python 中优化搜索两个元组中的大 tsv 文件？，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/17219831/

文章推荐： scheme - 将代码从 Lisp 转换为 SCHEME

文章推荐： emacs - 为什么 slime 的 "package"和劣等的 lisp 不一样？

文章推荐： python - Python 3.2 上的 PDF 生成

python - 使用列表/元组/等。从键入与直接将类型引用为列表/元组/等
typing模块中使用List、Tuple等有什么区别: from typing import Tuple def f(points: Tuple): return map(do_stuff,
python - 迭代每个 N 元素，放入一个元素(元组)，然后每个 N 元素，放入另一个元素(元组)
如何遍历列表的每 5 个元素并将它们组成一个元组，然后将同一列表的第 6 个元素作为第二个元组 - 然后对接下来的 5 个元素和第 6 个元素执行相同的操作。我读过 operator.itemget
Scala groupby 元组
我有一个 Seq[((元组 A),(元组 B))] 有没有一种简单的方法来对元组 A 进行分组，以便我得到 Seq[(Tuple A, Seq[Tuple B])] 我试过 groupby(x =>
scala - 内存中相同值的列表/元组
如果我有以下内容 val A = List(1,2,3) val B = List(1,2,3) 这两个变量是否有相同的内存地址？最佳答案它们不会有相同的内存地址，可以使用 eq 方法确认，com
arrays - 元组/数组对列表
我实际上是在尝试创建一个配对列表，但事实证明这非常困难在有人提到 Hashtables 之前请注意，会有我不关心的重复项。例如，如果我这样做 $b = @{"dog" = "cat"} 我明白了
要通过删除空对象进行映射的 Terraform 元组？
我正在尝试为其他资源中的 for_each 循环创建局部变量，但无法按预期制作局部映射。以下是我试过的。 (地形 0.12) 预期映射到循环 temple_list = { "test2-role"
Haskell 列表理解顺序元素/元组
我目前正在学习 Haskell，在 FP 方面我绝对是初学者。现在我正在尝试使用列表推导式进行不同的操作。 listComprehension = [(a,b,c) | a <- xs, b <
要通过删除空对象进行映射的 Terraform 元组？
我正在尝试为其他资源中的 for_each 循环创建局部变量，但无法按预期制作局部映射。以下是我试过的。 (地形 0.12) 预期映射到循环 temple_list = { "test2-role"
Java 元组/对
关闭。此题需要details or clarity 。目前不接受答案。想要改进这个问题吗？通过 editing this post 添加详细信息并澄清问题. 已关闭 9 年前。 Improve th
python - “元组”对象不可调用
关闭。这个问题是not reproducible or was caused by typos .它目前不接受答案。这个问题是由于错别字或无法再重现的问题引起的。虽然类似的问题可能是on-topi
list - 过滤我自己类型的列表 - 元组？
如何通过元组中的第三项过滤此类型的列表: type Car = (String, [String], Int [String]) 我看到了 snd和 fst方法，但在这里我认为这行不通，我不确定如何在
无需创建多个类型参数的 Java 元组
有没有办法创建 Tuple 在 Java 中，无需创建多个类？例如，可以为每种不同类型的元组创建不同的类，每个类具有不同数量的 Type Parameters : public class Sing
c++ - 将类型转换扩展到可转换类型的对/元组
我必须处理一堆二维点类型:pair , pair , pair ，并且只要存在坐标转换，我就允许点之间的隐式转换。像这样: template inline operator pair ( pair t
来自并行文件的 Python 元组
这个问题在这里已经有了答案: How do I iterate through two lists in parallel? (8 个答案) How do I iterate over the tu
Python 序列(元组)
编写一个函数 square_odd_terms 接受一个元组作为参数并返回一个元组中奇数项的平方的元组。即使是条款也将保持不变。我的尝试是: def square_odd_termms(tpl):
Python - 元组 - 检索元组列表中的唯一元素
更新: 我选择了这个: set(item[1] for item in id) 谢谢你们，你们的想法对我有帮助。我正在处理一个元组列表: 以下面这行代码为例。我的 list 可以是任何长度。但是，我
python - 从两个不同大小的列表创建一个列表(元组？)
我一直在尝试执行此任务，在尝试时我不禁想到会有比我一直尝试的方式更好的编码方式。我有一行文字和一个关键字。我想在每个列表中的每个字符下创建一个新列表。关键字将重复自身直到列表末尾。如果有任何非字母字
python - “元组”不可调用错误
我现在这个问题已经被问过好几次了。但是，答案似乎并没有解决我的问题。我收到类型错误，“元组”对象不可调用。即使列表中的元组以正确的方式用逗号分隔，我也得到了这个: def aiMove(b):
swift - Swift 元组
嘿，所以我花了两个多小时试图解决这个问题，但我就是做不对。我猜我犯了一个非常简单的错误，所以如果有人能指出我正确的方向，我将非常感激，谢谢!顺便说一句，这是一门树屋类(class)。 “目前我们的问候
c++ - 元组 - 单独标题的原因
这不是一个严格的编程问题，但为什么是tuple在单独的 header 中定义，而不是添加到连同 pair ？它看起来更自然，不那么困惑等。最佳答案在具有细粒度的 header 和只有一个 hea

太空宇宙

个人简介

我是一名优秀的程序员,十分优秀！

作者热门文章

滴滴打车优惠券免费领取

全站热门文章

首页

博学

6Ren·AI

商城

python - 如何在 python 中优化搜索两个元组中的大 tsv 文件？