gpt4 book ai didi

python - Snakemake:通配符不存在时出错

转载 作者:太空宇宙 更新时间:2023-11-03 14:07:45 24 4
gpt4 key购买 nike

Edit 2 : I figure it out. I posted my answer as reply.

Edit 1 : I added a beginning of solution at the end of the question following @bli advices and https://stackoverflow.com/a/41185568/1025741

我正在编写一个 Snakemake 文件,在其中解析示例表文件(在 yaml 配置文件中定义),以便连接此示例表中列出的文件。

示例表如下所示:

sample  unit    fq1 fq2
A lane1 A.l1.1.R1.txt A.l1.1.R2.txt
A lane1 A.l1.2.R1.txt A.l1.2.R2.txt
A lane2 A.l2.R1.txt A.l2.R2.txt
B lane1 B.l1.R1.txt B.l1.R2.txt

这个想法是连接来自相同样本和样本单元的文件(在 fq1 和 fq2 中列出)。在这种情况下:

  • A.l1.1.R1.txtA.l2.2.R1.txt 将连接
  • A.l1.1.R2.txtA.l2.2.R2.txt 将连接

其他文件不会被串联,但也会在此目录结构中报告:

{sample}/
{sample}_{unit}_merged_R1.txt
{sample}_{unit}_merged_R2.txt

所以在这个例子的最后我应该:

A/
A_lane1_merged_R1.txt
A_lane1_merged_R2.txt
A_lane2_merged_R1.txt
A_lane2_merged_R2.txt
B/
B_lane1_merged_R1.txt
B_lane1_merged_R2.txt

这是我执行此类任务的 Snakemake 文件:

import pandas as pd
shell.executable("bash")

configfile: "config.yaml"

# open samplesheet
units = pd.read_table(config["units"], dtype=str)
units = units.set_index(["sample", "unit"])


rule all:
input:
expand("{sample}/{sample}_{unit}_merge_R1.txt",
sample=units.index.get_level_values('sample').unique(),
unit=units.index.get_level_values('unit').unique()),
expand("{sample}/{sample}_{unit}_merge_R2.txt",
sample=units.index.get_level_values('sample').unique(),
unit=units.index.get_level_values('unit').unique())


def get_fastq_r1(wildcards):
return units.loc[(wildcards.sample, wildcards.unit), ["fq1"]].dropna().values.flatten()

def get_fastq_r2(wildcards):
return units.loc[(wildcards.sample, wildcards.unit), ["fq2"]].dropna().values.flatten()


rule merge:
input:
r1 = get_fastq_r1,
r2 = get_fastq_r2
output:
"{sample}/{sample}_{unit}_merge_R1.txt",
"{sample}/{sample}_{unit}_merge_R2.txt"
shell:
"""
echo {input.r1} > {sample}/{sample}_{unit}_merge_R1.txt
echo {input.r2} > {sample}/{sample}_{unit}_merge_R2.txt
"""

和 config.yaml :

units: units.tsv

但我有一个错误,因为我没有单位 = lane2 的示例 B:

InputFunctionException in line 29 of /home/nrosewick/Documents/analysis/pilot_data_ADX17009/workflow/test_snakemake/Snakefile:
KeyError: ('B', 'lane2')
Wildcards:
sample=B
unit=lane2

有没有办法避免这种错误?谢谢

Beginning of solution

按照@bli的建议,我使用了itertools.product的过滤版本,将其包装在高阶生成器中,检查生成的通配符组合是否在预先建立的列表中:

import pandas as pd
shell.executable("bash")

configfile: "config.yaml"

###
from itertools import product

def filter_combinator(combinator, inlist):
def filtered_combinator(*args, **kwargs):
for wc_comb in combinator(*args, **kwargs):
# Use frozenset instead of tuple
# in order to accomodate
# unpredictable wildcard order
if frozenset(wc_comb) in inlist:
yield wc_comb
return filtered_combinator

# open samplesheet
units = pd.read_table(config["units"], dtype=str)

# list of pair sample-unit included in the samplesheet
inList={
frozenset({("sample", "A"), ("unit", "lane1")}),
frozenset({("sample", "A"), ("unit", "lane2")}),
frozenset({("sample", "B"), ("unit", "lane1")})}

# set df index
units = units.set_index(["sample", "unit"])

# build new iterator
filtered_product = filter_combinator(product, inList)

rule all:
input:
expand("{sample}/{sample}_{unit}_merge_R1.txt",
filtered_product,
sample=units.index.get_level_values('sample').unique().values,
unit=units.index.get_level_values('unit').unique().values),
expand("{sample}/{sample}_{unit}_merge_R2.txt",
filtered_product,
sample=units.index.get_level_values('sample').unique().values,
unit=units.index.get_level_values('unit').unique().values)


def get_fastq_r1(wildcards):
return units.loc[(wildcards.sample, wildcards.unit), ["fq1"]].dropna().values.flatten()

def get_fastq_r2(wildcards):
return units.loc[(wildcards.sample, wildcards.unit), ["fq2"]].dropna().values.flatten()

rule merge:
input:
r1 = get_fastq_r1,
r2 = get_fastq_r2
output:
"{sample}/{sample}_{unit}_merge_R1.txt",
"{sample}/{sample}_{unit}_merge_R2.txt"
message:
"test"
shell:
"""
cat {input.r1} > {sample}/{sample}_{unit}_merge_R1.txt
cat {input.r2} > {sample}/{sample}_{unit}_merge_R2.txt
"""

但在运行 snakemake -n 时它返回一个错误:

Job 1: test

RuleException in line 53 of /home/nrosewick/Documents/analysis/pilot_data_ADX17009/workflow/test_snakemake/Snakefile:
NameError: The name 'sample' is unknown in this context. Please make sure that you defined that variable. Also note that braces not used for variable access have to be escaped by repeating them, i.e. {{print $1}}

有什么线索吗?

最佳答案

这是我根据https://stackoverflow.com/a/41185568/1025741找到的解决方案:

import pandas as pd
shell.executable("bash")

configfile: "config.yaml"

###
from itertools import product

def filter_combinator(combinator, inlist):
def filtered_combinator(*args, **kwargs):
for wc_comb in combinator(*args, **kwargs):
# Use frozenset instead of tuple
# in order to accomodate
# unpredictable wildcard order
if frozenset(wc_comb) in inlist:
yield wc_comb
return filtered_combinator

# open samplesheet
units = pd.read_table(config["units"], dtype=str)

# list of pair sample-unit
#inList=units[["sample","unit"]].drop_duplicates().to_dict('r')
inList={
frozenset({("sample", "A"), ("unit", "lane1")}),
frozenset({("sample", "A"), ("unit", "lane2")}),
frozenset({("sample", "B"), ("unit", "lane1")})}

# set df index
units=units.set_index(["sample","unit"])

# build new iterator
filtered_product = filter_combinator(product, inList)

rule all:
input:
expand("{sample}/{sample}_{unit}_merge_R1.txt",
filtered_product,
sample=units.index.get_level_values('sample').unique().values,
unit=units.index.get_level_values('unit').unique().values),
expand("{sample}/{sample}_{unit}_merge_R2.txt",
filtered_product,
sample=units.index.get_level_values('sample').unique().values,
unit=units.index.get_level_values('unit').unique().values)


def get_fastq_r1(wildcards):
return units.loc[(wildcards.sample, wildcards.unit), ["fq1"]].dropna().values.flatten()

def get_fastq_r2(wildcards):
return units.loc[(wildcards.sample, wildcards.unit), ["fq2"]].dropna().values.flatten()

rule merge:
input:
r1=get_fastq_r1,
r2=get_fastq_r2
output:
r1_o="{sample}/{sample}_{unit}_merge_R1.txt",
r2_o="{sample}/{sample}_{unit}_merge_R2.txt"
message:
"test"
shell:
"""
cat {input.r1} > {output.r1_o}
cat {input.r2} > {output.r2_o}
"""

关于python - Snakemake:通配符不存在时出错,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/48747790/

24 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com