gpt4 book ai didi

python - 解析通过最大值和最小值选择冗余索引值的文件的算法

转载 作者:塔克拉玛干 更新时间:2023-11-03 05:34:01 25 4
gpt4 key购买 nike

我正在尝试编写一个 Python 程序来读取以下格式的文件:

ID  chrom   txStart txEnd   score   strandENSMUSG00000042429  chr1    1   100 0   -ENSMUSG00000042429  chr1    110 500 0   -ENSMUSG00000042500  chr2    12  40  0   -ENSMUSG00000042500  chr2    200 10000   0   -ENSMUSG00000042500  chr2    4   50  0   -ENSMUSG00000042429  chr3    40  33  0   -ENSMUSG00000025909  chr3    10000   200000  0   -ENSMUSG00000025909  chr3    1   5   0   -ENSMUSG00000025909  chr3    400 2000    0   -

Then it outputs a file in the same structure, BUT if the ID is redundant, it combines rows, selecting the minimum value of txStart and the maximum value of txEnd.

For instance, for ENSMUSG00000042429, since it appears twice, it will select txStart as 1 and txEnd as 500 (these are the minimum and maximum respectively). The expected output of the above data would be:

ID  chrom   txStart txEnd   score   strandENSMUSG00000042429  chr1    1   500 0   -ENSMUSG00000042500  chr2    4   10000   0   -ENSMUSG00000042429  chr3    40  33  0   -ENSMUSG00000025909  chr3    1   200000  0   -

I can't figure out how to get this done. I started by reading files in python using pandas, and assigning the first column as an index using:

data = pd.read_table("Input.txt", sep="\t")

然后我想到创建字典,其中键是索引,值是剩余的行。那将是:

dictionary = {}
for item in data.index:
k, v = data.ix[item], data.ix[item, c("chrom", "txStart", "txEnd", "score", "strand"]

这导致了一个错误,我不知道从这里去哪里......获得所需输出的最佳算法是什么?

最佳答案

您使用字典的想法,以记录 ID 作为键,似乎是个好主意。这是一个大纲。

records = {}

# Open file and deal with the header line.
with open(...) as fh:
header = next(fh)

# Process the input data.
for line in fh:

# Parse the line and get the ID. You might need
# more robust parsing logic, depending on the messiness
# of the data.
fields = line.split()
rec_id = fields[0]

# Either add a new record, or modify an existing record
# based on the logic you need.
if rec_id in records:
# Modify records[rec_id]
else:
records[rec_id] = fields

该方法假定您可以将整个文件保存在内存中。如果不是,您需要更加小心,一次处理文件一个 block ,并确保获取共享公共(public) ID 的所有连续行(假设这些行确实是连续的)。以下是该策略的概要:

def file_chunks(path):
with open(path) as fh:
header = next(fh)
chunk = []
prev_id = None

for line in fh:
fields = line.split()
rec_id = fields[0]
if chunk and rec_id != prev_id:
yield chunk
chunk = []
chunk.append(fields)
prev_id = rec_id

if chunk:
yield chunk

def main():
records = {}

for chunk in file_chunks(...):
# Process the chunk of lines having the same ID.

main()

关于python - 解析通过最大值和最小值选择冗余索引值的文件的算法,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/34669622/

25 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com