python - 使用python减少表格爆炸输出文件中的命中数-6ren

python - 使用python减少表格爆炸输出文件中的命中数

转载作者：行者123 更新时间：2023-11-30 22:54:38

27

4

我有一个表格格式的大型blast文件，其中目标序列的数量不受限制，因此解析需要很长时间。我想将每个查询序列的命中数减少到前 10 个。我的 python 是基本的，但这是我到目前为止所拥有的

import sys

blastfile = open(sys.argv[1],"r")

column1list=[]

for line in blastfile:
    b = line.split()[0]
    column1list.append(b)

uniqcolumn1 = list(set(column1list))

counter = 0

for val in uniqcolumn1:
    #print val
    for line in blastfile:
        #print line
        while counter <= 10:
            if line.startswith(val):
                print line
                counter =+ 1

这是blast输出文件的一行示例，查询序列的标题位于第一列，在本例中为“c8208_g1_i2”

c8208_g1_i2 gi|851252702|ref|WP_048131971.1|    79.30   797 165 0   4881    2491    1   797 0.0 1336    acetyl-CoA decarbonylase/synthase complex subunit alpha [Methanosaeta concilii]

我认为代码的第一部分工作正常，直到'uniqcolumn1 = list(set(column1list))'，那么我无法让它打印以列表中每个字符串开头的前十行。

最佳答案

这里的问题似乎是您正在迭代文件对象两次。在 Python 中，文件对象的工作方式很像读取每一行的指针。如果您不向后移动指针，则没有任何内容可读取。

您需要做的是使用.seek函数将此指针移回到开头。例如，假设您有一个 file_to_read.txt 和 python_script.py。

file_to_read.txt

Hello! My name is Bob and I can't think of anything to
put in this file so I'm blabbering on about nonsense
in hopes that you won't realise that this text is not
important but the code in the actually file, though I
think that you wouldn't mind reading this long file.

python_script.py

f = open("file_to_read.txt", "r")
for line in f: print line
for line in f: print line

如果您要运行此代码(并且不会发生有关目录的错误)，您只会打印一次file_to_read.txt。要解决这个问题，您只需在读取之间添加 f.seek(0, 0) 即可。例如:

f = open("file_to_read.txt", "r")
for line in f: print line
f.seek(0, 0)
for lien in f: print line

现在，回到您的上下文，您可以看到这如何应用于您的代码:

import sys
# Here is your reading of file
blastfile = open(sys.argv[1],"r")
column1list = []
# Here is the first time you read the file
for line in blastfile:
    b = line.split()[0]
    column1list.append(b)
# Add a line to move back to the start before the
# next reading
blastfile.seek(0, 0)

uniqcolumn1 = list(set(column1list))

for val in uniqcolumn1:
    # Move the counter inside to refresh it after every iteration
    counter = 0
    # Here is the second time you read your file
    for line in blastfile:
        while counter <= 10:
            if line.startswith(val):
                print line
                counter += 1
    # Since you are going to read the file the next iteration,
    # .seek the file
    blastfile.seek(0, 0)

编辑

这是代码的后半部分，已修复。您可以这样做:

for val in uniqcolumn1:
    # Move the counter in
    counter = 0
    # Move the while loop out
    while counter <= 10:
        for line in blastfile:
            if line.startswith(val):
                print line,
                counter += 1
    blastfile.seek(0, 0)

这样做的好处是 for 循环提前终止，它不会读取整个文件。

另一种方法是使用这个:

for val in uniqcolumn1:
    # Move counter in
    counter = 0
    # Remove while statement
    for line in blastfile:
        # Add additional condition to if statement
        if line.startswith(val) and counter <= 10:
            print line,
            counter += 1
        elif counter > 10:
            break
    blastfile.seek(0, 0)

这样做的好处是看起来更简单。

关于python - 使用python减少表格爆炸输出文件中的命中数，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/37703222/

27

4

0

文章推荐： c# - 异步使用 MySQL 数据库

文章推荐： c# - 片段内的膨胀布局

文章推荐： javascript - AJAX、PHP、SQL newsfeed 每次都会出错

APC 命中/未命中和配置
关于 APC 操作码缓存，什么是“命中与未命中”？我已经安装了 APC 并且它运行良好，但我有“一些”失误，我想知道这是否“不好”。此外，我正在运行 Openx，因此，我很快就会填满“缓存完整计数”。
python - 如何验证一个函数是否被 pytest 命中
我试过这个: def test_send_confirm_hit(monkeypatch): hit = False def called(): global hit
Javascript 正则表达式 - 插入单词(命中)
是否可以将找到的单词插入到替换中？ $(function() { content = 'hallo mein name ist peter und ich komme aus berlin.
php - 命中 'back' 时重置复选框值
我有一个允许用户将文件上传到文件夹的网站。首先，我检查文件是否存在，如果存在，然后检查复选框的值以确定用户是否要覆盖现有文件。如果点击上传并且未选中该框，我会执行一个带有消息和后退按钮的 die()
c++ - 命中 GDB 断点时自动调用应用程序代码中的函数
我有多个不同的进程通过 IPC 进行通信，当使用 gdb 调试单个进程时，每当遇到断点时，我都会尝试向其他进程发送消息。有没有一种方法可以自动在遇到断点时自动调用一个函数/一段代码(NotifyAll
jq - 解析多个 json 文件并输出针对具有关联文件名的正则表达式的匹配/命中
目前，通过管道传输到 jq 的 cat 命令帮助我解析工作目录中的多个 JSON 文件，并根据正则表达式模式匹配文件中所有可用的电子邮件 ID。但是，我很想识别正则表达式模式被命中/匹配的文件名 ca
service - 命中 HeadlessService 的端点 - Kubernetes
我们希望将 podname 解析为 IP，以在 akka 集群中配置种子节点。这是通过在 Kubernetes 中使用 headless (headless)服务和有状态集的概念来实现的。但是，如何在
java - 命中 'mvn test' 后运行哪些测试？
Maven 项目具有以下文件夹结构: src/main/java src/main/resources src/test/java src/test/resources 如果我们导航到 Maven 项
c - 二十一点程序的多个问题(发牌、错误、命中)
我只使用 c 几个星期，所以很可能会出现我忽略的明显错误。我看过其他线程，但我不明白我正在读的很多内容。该程序假设有一个无限大的牌组。已知问题: clearBuffer 当前未使用，我正在尝试不同的
Android AdMob -onReceivedAd 命中，但没有显示广告
我已将我的 AdMob 代码实现到我的 XML 文件中，如下所示: 在我的 Activity 的 onCreate 方法中: // load ads
c - 通过内存访问跟踪文件确定缓存读/写/命中/未命中
我的作业是通过示例程序确定给定跟踪文件的缓存读/写/未命中/命中次数。举例来说，这是示例跟踪输出的前 10 行。 0x37c852: W 0xbfd4b18c 0x37cfe0: W 0xbfd
javascript - 为什么这里的仪表板状态没有被 $state.go ('dashboard' ... 命中？
https://plnkr.co/edit/2h6fV5yTjeUqLP3SvbvO?p=preview 预期登录后应用程序重定向到 $state container，其中包含 dashboard
elasticsearch - 将聚合限制在 elasticsearch 中的前 X 命中
ElasticSearch 独立于 from 和 size 参数，基于查询的所有命中构建聚合结果。在大多数情况下，这是我们想要的，但我有一个特殊情况，我需要将聚合限制为前 N 个命中。 limits
c++ - 没有意义的 CPU 测量(缓存未命中/命中)
我使用 Intel PCM 进行细粒度的 CPU 测量。在我的代码中，我试图测量缓存效率。基本上，我首先将一个小数组放入 L1 缓存(通过多次遍历)，然后启动计时器，再遍历数组一次(希望使用缓存)，
javascript - 定义 Javascript slider 命中/滚动区域
我在为 javascript 滑动元素定义点击区域时遇到问题。参见示例: http://www.warface.co.uk/clients/warface.co.uk/ 请滑过右侧的灰色框以显示按钮
python - 获取 os.walk 命中 abspath
我正在尝试在 foldersystem 中使用 os.walk() 找到几个 'my_file.bat'，如果文件名匹配它应该用 subprocess.call() 或 .run() 调用。问题是 o
ios - 我应该如何处理 Siesta 中的部分 EntityCache 命中？
我有一个端点，我可以在其中请求我使用 Siesta 查询的多条数据(例如 https://example.com/things?ids=1,2,3) .如果我只缓存了一些 things ，我试图弄清楚
php - 命中 blockcypher api laravel 时出错
这是我的代码: public static function test(){ try{ $apiContext = ApiContext::create(
php - 命中 "enter"不会在 IE8 中发布表单
我使用 PHP 在需要时传递登录表单，代码如下: $htmlForm = ''.''; switch(LOGIN_METHOD) { case 'both': $htmlFor
nginx - 命中 url 时找不到 404 页面，但从索引页面上的链接打开时正确提供
我正在使用 nginx-lua带有 redis 的模块提供 ember-app 的静态文件. index文件内容存储在redis作为 value由 nginx 正确提供服务当(根)domain/IP被

首页

博学

6Ren·AI

商城

python - 使用python减少表格爆炸输出文件中的命中数