python - 对大型数据集进行操作-6ren

python - 对大型数据集进行操作

转载作者：太空狗更新时间：2023-10-30 01:37:56

我必须对包含 DNA 序列片段信息的 PSL 记录进行一些分析。基本上我必须在相同的重叠群中找到来自相同读取的条目(这些都是 PSL 条目中的值)。问题是 PSL 记录很大(10-30 Mb 文本文档)。我写了一个程序，在给定足够时间的情况下，它可以处理短记录和长记录，但它花费的时间比指定的时间长。有人告诉我该程序不应超过 15 秒。我的花了超过 15 分钟。

PSL 记录如下所示:

275 11  0   0   0   0   0   0   -   M02034:35:000000000-A7UU0:1:1101:19443:1992/2   286 0   286 NODE_406138_length_13407_cov_13.425076  13465   408 694 1   286,    0,  408,

171 5   0   0   0   0   0   0   +   M02034:35:000000000-A7UU0:1:1101:13497:2001/2   294 0   176 NODE_500869_length_34598_cov_30.643419  34656   34334   34510   1   176,    0,  34334,

188 14  0   10  0   0   0   0   +   M02034:35:000000000-A7UU0:1:1101:18225:2002/1   257 45  257 NODE_455027_length_12018_cov_13.759444  12076   11322   11534   1   212,    45, 11322,

我的代码是这样的:

import sys
class PSLreader :
    '''
    Class to provide reading of a file containing psl alignments
    formatted sequences:
    object instantiation:
    myPSLreader = PSLreader(<file name>):

    object attributes:
    fname: the initial file name

    methods:
    readPSL() : reads psl file, yielding those alignments that are within the first or last
                1000 nt

    readPSLpairs() : yields psl pairs that support a circular hypothesis 

    Author: David Bernick
    Date: May 12, 2013
    '''

    def __init__ (self, fname=''):
        '''contructor: saves attribute fname '''

        self.fname = fname

    def doOpen (self):
        if self.fname is '':
            return sys.stdin
        else:
            return open(self.fname)

    def readPSL (self):
        '''
        using filename given in init, returns each filtered psl records
        that contain alignments that are within the terminal 1000nt of
        the target. Incomplete psl records are discarded.
        If filename was not provided, stdin is used.

        This method selects for alignments that could may be part of a
        circle.

        Illumina pairs aligned to the top strand would have read1(+) and read2(-).
        For the bottoms trand, read1(-) and read2(+).

        For potential circularity,
        these are the conditions that can support circularity:
        read1(+) near the 3' terminus
        read1(-) near the 5' terminus
        read2(-) near the 5' terminus
        read2(+) near the 3' terminus

        so...
        any read(+) near the 3', or
        any read(-) near the 5'

        '''

        nearEnd = 1000   # this constant determines "near the end"
        with self.doOpen() as fileH:

            for line in fileH:
                pslList = line.split()
                if len(pslList) < 17:
                    continue
                tSize = int(pslList[14])
                tStart = int(pslList[15])
                strand = str(pslList[8])

                if strand.startswith('+') and (tSize - tStart > nearEnd):
                    continue
                elif strand.startswith('-') and (tStart > nearEnd):
                    continue

                yield line

    def readPSLpairs (self):
        read1 = []
        read2 = []

        for psl in self.readPSL():
            parsed_psl = psl.split()
            strand = parsed_psl[9][-1]
            if strand == '1':
                read1.append(parsed_psl)
            elif strand == '2':
                read2.append(parsed_psl)

        output = {}
        for psl1 in read1:
            name1 = psl1[9][:-1]
            contig1 = psl1[13]
            for psl2 in read2:
                name2 = psl2[9][:-1]
                contig2 = psl2[13]
                if  name1 == name2 and contig1 == contig2:
                    try:
                        output[contig1] += 1
                        break
                    except:
                        output[contig1] = 1
                        break

        print(output)


PSL_obj = PSLreader('EEV14-Vf.filtered.psl')
PSL_obj.readPSLpairs()

我得到了一些示例代码，如下所示:

def doSomethingPairwise (a):
    for leftItem in a[1]:
        for rightItem in a[2]:
            if leftItem[1] is rightItem[1]:
                print (a)
thisStream = [['David', 'guitar', 1], ['David', 'guitar', 2],
['John', 'violin', 1], ['John', 'oboe', 2],
['Patrick', 'theremin', 1], ['Patrick', 'lute',2] ]
thisGroup = None
thisGroupList = [ [], [], [] ]

for name, instrument, num in thisStream:
    if name != thisGroup:

        doSomethingPairwise(thisGroupList)

        thisGroup = name
        thisGroupList = [ [], [], [] ]

    thisGroupList[num].append([name, instrument, num])
doSomethingPairwise(thisGroupList)

但是当我尝试实现它时，我的程序仍然花费了很长时间。我在想这个错误的方式吗？我意识到嵌套循环很慢，但我看不到替代方法。

编辑:我想通了，数据是预先分类的，这使得我的蛮力解决方案非常不切实际和不必要。

最佳答案

希望能帮到你，因为这个问题需要一个最好的输入示例文件

#is better create PSLRecord class
class PSLRecord:
  def __init__(self, line):
    pslList = line.split()
    properties = ("matches", "misMatches", "repMatches", "nCount",
                 "qNumInsert", "qBaseInsert", "tNumInsert",
                 "tBaseInsert", "strand", "qName", "qSize", "qStart",
                 "qEnd", "tName", "tSize", "tStart", "tEnd", "blockCount",
                 "blockSizes", "qStarts", "tStarts")
    self.__dict__.update(dict(zip(properties, pslList)))

class PSLreader :
  def __init__ (self, fname=''):
    self.fname = fname

  def doOpen (self):
    if self.fname is '':
      return sys.stdin
    else:
      return open(self.fname)

  def readPSL (self):
    with self.doOpen() as fileH:
      for line in fileH:
        pslrc = PSLRecord(line)
        yield pslrc

  #return a dictionary with all psl records group by qName and tName
  def readPSLpairs (self):
    dictpsl = {}
    for pslrc in self.readPSL():
      #OP requirement, remove '1' or '2' char, in pslrc.qName[:-1]
      key = (pslrc.qName[:-1], pslrc.tName)
      if not key in dictpsl:
        dictpsl[key] = []
      dictpsl[key].append(pslrc)
    return dictpsl

#Function filter .... is better out and self-contained
def f_filter(pslrec, nearEnd = 1000):
  if (pslrec.strand.startswith('+') and  
     (int(pslrec.tSize) - int(pslrec.tStart) > nearEnd)):
    return False
  if (pslrec.strand.startswith('-') and 
     (int(pslrec.tStart) > nearEnd)):
    return False
  return True

PSL_obj = PSLreader('EEV14-Vf.filtered.psl')

#read dictionary of pairs
dictpsl = PSL_obj.readPSLpairs()

from itertools import product
#product from itertools
#(1) x (2,3) = (1,2),(1,3)

output = {}
for key, v in dictpsl.items():
  name, contig = key
  #i get filters aligns in principal strand
  strand_princ = [pslrec for pslrec in v if f_filter(pslrec) and
                 pslrec.qName[-1] == '1']
  #i get filters aligns in secondary strand
  strand_sec = [pslrec for pslrec in v if f_filter(pslrec) and
               pslrec.qName[-1] == '2']
  for pslrec_princ, pslrec_sec in product(strand_princ, strand_sec):
    #This For has fewer comparisons, since I was grouped before
    if not contig in output:
      output[contig] = 1
    output[contig] += 1

注意:如果你问我，10-30 Mb 不是大文件

关于python - 对大型数据集进行操作，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/30448434/

文章推荐： python - 在 Python 中结合 reprlib 和 pprint？

文章推荐： sql - 当存在其他唯一字段时，为什么要使用自动递增的主键？

java - Struts2 操作 > JSP > 操作
我正在努力做到这一点在我的操作中从数据库获取对象列表(确定) 在 JSP 上打印(确定) 此列表作为 JSP 中的可编辑表出现。我想修改然后将其提交回同一操作以将其保存在我的数据库中(失败。当我使用
linq - 不支持嵌套查询。操作 1 ='UnionAll' 操作 2 ='MultiStreamNest'
我有以下形式的 Linq to Entities 查询: var x = from a in SomeData where ... some conditions ... select
c# - 不支持嵌套查询。操作 1 ='UnionAll' 操作 2 ='MultiStreamNest'
我有以下查询。 var query = Repository.Query() .Where(p => !p.IsDeleted && p.Article.ArticleSections.Cou
java - Jtable ListSelectionListener 不响应 jtable 操作，而是响应同一个类中的另一个 jtable 操作
我正在编写一个应用程序包，其中包含一个主类，其中主方法与GUI类分开，GUI类包含一个带有jtabbedpane的jframe，它有两个选项卡，第一个选项卡包含一个jtable，称为jtable1，第
c# - LINQ 嵌套数组和三元运算符。不支持嵌套查询。操作 1 ='Case' 操作 2 ='Collect'
以下代码产生错误 The nested query is not supported. Operation1='Case' Operation2='Collect' 问题是我做错了什么？我该如何解决？
Redis哨兵中的C#操作
我已经为 HA redis 集群(2 个副本、1 个主节点、3 个哨兵)设置了本地 docker 环境。只有哨兵暴露端口(10021、10022、10023)。我使用的是 stackexchange
液体模板过滤器中的日期数学/操作
我正在 Desk.com 中构建一个“集成 URL”，它使用 Shopify Liquid 模板过滤器语法。对于开始日期为 7 天前而结束日期为现在的查询，此 URL 需要包含“开始日期”和“结束日期
Python为什么不支持 i++/i--操作
你一定想过。然而情况却不理想，python中只能使用类似于 i++/i--等操作。 python中的自增操作下面代码几乎是所有程序员在python中进行自增(减)操作的常用
GitHub 操作 - 将分支名称显示为构建名称
我需要在每个使用 github 操作的手动构建中显示分支。例如:https://gyazo.com/2131bf83b0df1e2157480e5be842d4fb 我应该显示分支而不是一个。最佳答
Perl qr//操作
我有一个关于 Perl qr 运算符的问题: #!/usr/bin/perl -w &mysplit("a:b:c", /:/); sub mysplit { my($str, $patt
uml - ArgoUML 操作
我已经使用 ArgoUML 创建了一个 ERD(实体关系图)，我希望在一个类中创建两个操作，它们都具有 void 返回类型。但是，我只能创建一个返回 void 类型的操作。例如: 我能够将 book
关于拉取请求和主分支的 Github 操作
Github 操作仍处于测试阶段并且很新，但我希望有人可以提供帮助。我认为可以在主分支和拉取请求上运行 github 操作，如下所示: on: pull_request push: b
用于记录的 Twilio 操作
我正在尝试创建一个 Twilio 工作流来调用电话并记录用户所说的内容。为此，我正在使用 Record，但我不确定要在 action 参数中放置什么。尽管我知道 Twilio 会发送有关调用该 UR
OpenGL 模板缓冲区 OR 操作？
我不确定这是否可行，但值得一试。我正在使用模板缓冲区来减少使用此算法的延迟渲染器中光体积的过度绘制(当相机位于体积之外时): 使用廉价的着色器，将深度测试设置为 LEQUAL 绘制背面，将它们标记在模
用于复制和重命名文件的 GitHub 操作
有没有聪明的方法来复制和重命名文件通过 GitHub 操作？我想将一些自述文件复制到 /docs文件夹(:= 同一个 repo，不是远程的!)，它们将根据它们的 frontmatter 重命名
PowerShell CSV 操作
我有一个 .csv 文件，其中第一列包含用户名。它们采用 FirstName LastName 的形式。我想获取 FirstName 并将 LastName 的第一个字符添加到它上面，然后删除空格。然
Sitecore - 操作 URL
Sitecore 根据 Sitecore 树中定义的项目名称生成 URL， http://samplewebsite/Pages/Sample Page 但我们的客户有兴趣降低所有 URL(页面/示例
单击按钮时的 Angularjs 操作
我正在尝试进行一些计算，但是一旦我输入金额，它就会完成。我只是希望通过单击按钮而不是自动发生这种情况。到目前为止我做了什么: Angular JS - programming-fr
将文件从一个存储库复制到另一个存储库的 github 操作
我的公司创建了一种在环境之间移动文件的复杂方法，现在我们希望将某些构建的 JS 文件(已转换和缩小)从一个 github 存储库移动到另一个。使用 github 操作可以实现这一点吗？最佳答案最简
java - JSONArray 操作
在我的代码中，我创建了一个 JSONArray 对象。并向 JSONArray 对象添加了两个 JSONObject。我使用的是 json-simple-1.1.jar。我的代码是 package j

太空狗

个人简介

我是一名优秀的程序员,十分优秀！

作者热门文章

滴滴打车优惠券免费领取

全站热门文章

首页

博学

6Ren·AI

商城

python - 对大型数据集进行操作