python - Snakemake:通配符不存在时出错-6ren

python - Snakemake:通配符不存在时出错

转载作者：太空宇宙更新时间：2023-11-03 14:07:45

Edit 2 : I figure it out. I posted my answer as reply.

Edit 1 : I added a beginning of solution at the end of the question following @bli advices and https://stackoverflow.com/a/41185568/1025741

我正在编写一个 Snakemake 文件，在其中解析示例表文件(在 yaml 配置文件中定义)，以便连接此示例表中列出的文件。

示例表如下所示:

sample  unit    fq1 fq2
A   lane1   A.l1.1.R1.txt   A.l1.1.R2.txt
A   lane1   A.l1.2.R1.txt   A.l1.2.R2.txt
A   lane2   A.l2.R1.txt A.l2.R2.txt
B   lane1   B.l1.R1.txt B.l1.R2.txt

这个想法是连接来自相同样本和样本单元的文件(在 fq1 和 fq2 中列出)。在这种情况下:

A.l1.1.R1.txt 和 A.l2.2.R1.txt 将连接
A.l1.1.R2.txt 和 A.l2.2.R2.txt 将连接

其他文件不会被串联，但也会在此目录结构中报告:

{sample}/
    {sample}_{unit}_merged_R1.txt
    {sample}_{unit}_merged_R2.txt

所以在这个例子的最后我应该:

A/
  A_lane1_merged_R1.txt
  A_lane1_merged_R2.txt
  A_lane2_merged_R1.txt
  A_lane2_merged_R2.txt
B/
  B_lane1_merged_R1.txt
  B_lane1_merged_R2.txt

这是我执行此类任务的 Snakemake 文件:

import pandas as pd
shell.executable("bash")

configfile: "config.yaml"

# open samplesheet
units = pd.read_table(config["units"], dtype=str)
units = units.set_index(["sample", "unit"])


rule all:
    input:
        expand("{sample}/{sample}_{unit}_merge_R1.txt",
            sample=units.index.get_level_values('sample').unique(),
            unit=units.index.get_level_values('unit').unique()),
        expand("{sample}/{sample}_{unit}_merge_R2.txt",
            sample=units.index.get_level_values('sample').unique(),
            unit=units.index.get_level_values('unit').unique())


def get_fastq_r1(wildcards):
    return units.loc[(wildcards.sample, wildcards.unit), ["fq1"]].dropna().values.flatten()

def get_fastq_r2(wildcards):
    return units.loc[(wildcards.sample, wildcards.unit), ["fq2"]].dropna().values.flatten()


rule merge:
    input:
        r1 = get_fastq_r1,
        r2 = get_fastq_r2
    output:
        "{sample}/{sample}_{unit}_merge_R1.txt",
        "{sample}/{sample}_{unit}_merge_R2.txt"
    shell:
        """
        echo {input.r1} > {sample}/{sample}_{unit}_merge_R1.txt
        echo {input.r2} > {sample}/{sample}_{unit}_merge_R2.txt
        """

和 config.yaml :

units: units.tsv

但我有一个错误，因为我没有单位 = lane2 的示例 B:

InputFunctionException in line 29 of /home/nrosewick/Documents/analysis/pilot_data_ADX17009/workflow/test_snakemake/Snakefile:
KeyError: ('B', 'lane2')
Wildcards:
sample=B
unit=lane2

有没有办法避免这种错误？谢谢

Beginning of solution

按照@bli的建议，我使用了itertools.product的过滤版本，将其包装在高阶生成器中，检查生成的通配符组合是否在预先建立的列表中:

import pandas as pd
shell.executable("bash")

configfile: "config.yaml"

### 
from itertools import product

def filter_combinator(combinator, inlist):
    def filtered_combinator(*args, **kwargs):
        for wc_comb in combinator(*args, **kwargs):
            # Use frozenset instead of tuple
            # in order to accomodate
            # unpredictable wildcard order
            if frozenset(wc_comb) in inlist:
                yield wc_comb
    return filtered_combinator

# open samplesheet
units = pd.read_table(config["units"], dtype=str)

# list of pair sample-unit included in the samplesheet
inList={
    frozenset({("sample", "A"), ("unit", "lane1")}),
    frozenset({("sample", "A"), ("unit", "lane2")}),
    frozenset({("sample", "B"), ("unit", "lane1")})}

# set df index
units = units.set_index(["sample", "unit"])

# build new iterator
filtered_product = filter_combinator(product, inList)

rule all:
    input:
        expand("{sample}/{sample}_{unit}_merge_R1.txt",
            filtered_product,
            sample=units.index.get_level_values('sample').unique().values,
            unit=units.index.get_level_values('unit').unique().values),
        expand("{sample}/{sample}_{unit}_merge_R2.txt",
            filtered_product,
            sample=units.index.get_level_values('sample').unique().values,
            unit=units.index.get_level_values('unit').unique().values)


def get_fastq_r1(wildcards):
    return units.loc[(wildcards.sample, wildcards.unit), ["fq1"]].dropna().values.flatten()

def get_fastq_r2(wildcards):
    return units.loc[(wildcards.sample, wildcards.unit), ["fq2"]].dropna().values.flatten()

rule merge:
    input:
        r1 = get_fastq_r1,
        r2 = get_fastq_r2
    output:
        "{sample}/{sample}_{unit}_merge_R1.txt",
        "{sample}/{sample}_{unit}_merge_R2.txt"
    message:
        "test"
    shell:
        """
        cat {input.r1} > {sample}/{sample}_{unit}_merge_R1.txt
        cat {input.r2} > {sample}/{sample}_{unit}_merge_R2.txt
        """

但在运行 snakemake -n 时它返回一个错误:

Job 1: test

RuleException in line 53 of /home/nrosewick/Documents/analysis/pilot_data_ADX17009/workflow/test_snakemake/Snakefile:
NameError: The name 'sample' is unknown in this context. Please make sure that you defined that variable. Also note that braces not used for variable access have to be escaped by repeating them, i.e. {{print $1}}

有什么线索吗？

最佳答案

这是我根据https://stackoverflow.com/a/41185568/1025741找到的解决方案:

import pandas as pd
shell.executable("bash")

configfile: "config.yaml"

### 
from itertools import product

def filter_combinator(combinator, inlist):
    def filtered_combinator(*args, **kwargs):
        for wc_comb in combinator(*args, **kwargs):
            # Use frozenset instead of tuple
            # in order to accomodate
            # unpredictable wildcard order
            if frozenset(wc_comb) in inlist:
                yield wc_comb
    return filtered_combinator

# open samplesheet
units = pd.read_table(config["units"], dtype=str)

# list of pair sample-unit
#inList=units[["sample","unit"]].drop_duplicates().to_dict('r')
inList={
    frozenset({("sample", "A"), ("unit", "lane1")}),
    frozenset({("sample", "A"), ("unit", "lane2")}),
    frozenset({("sample", "B"), ("unit", "lane1")})}

# set df index
units=units.set_index(["sample","unit"])

# build new iterator
filtered_product = filter_combinator(product, inList)

rule all:
    input:
        expand("{sample}/{sample}_{unit}_merge_R1.txt",
            filtered_product,
            sample=units.index.get_level_values('sample').unique().values,
            unit=units.index.get_level_values('unit').unique().values),
        expand("{sample}/{sample}_{unit}_merge_R2.txt",
            filtered_product,
            sample=units.index.get_level_values('sample').unique().values,
            unit=units.index.get_level_values('unit').unique().values)


def get_fastq_r1(wildcards):
    return units.loc[(wildcards.sample, wildcards.unit), ["fq1"]].dropna().values.flatten()

def get_fastq_r2(wildcards):
    return units.loc[(wildcards.sample, wildcards.unit), ["fq2"]].dropna().values.flatten()

rule merge:
    input:
        r1=get_fastq_r1,
        r2=get_fastq_r2
    output:
        r1_o="{sample}/{sample}_{unit}_merge_R1.txt",
        r2_o="{sample}/{sample}_{unit}_merge_R2.txt"
    message:
        "test"
    shell:
        """
        cat {input.r1} > {output.r1_o}
        cat {input.r2} > {output.r2_o}
        """

关于python - Snakemake:通配符不存在时出错，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/48747790/

文章推荐： c# - ZedGraph 规模都搞砸了

文章推荐： ssl - CNAME 记录和 Heroku 域

文章推荐： python - 在python中绘制热图

文章推荐： python - 名称错误 : name 'creds' is not defined

mysql - (mySql)查询匹配/精确字符串/[通配符]/精确字符串/[通配符]等？
我的网址看起来像 '/api/comments/languages/124/component/segment_translation/2' 我知道 url 的哪些部分是静态的；并且是动态的 - 并且
Notepad++ 通配符
如何使用通配符查找和替换主域之后的所有字符(包括“/”字符)？例如，我有以下 4 行: intersport-schaeftlmaier.de/ weymouthhondapowersports.c
Jquery 通配符
我有 3 个控件，其 ID 为 control_1、control_2、control_3。我想隐藏这些控件。目前我正在使用这个: $('#control_1').hide(); $('#cont
MySQL如何LIKE匹配参数+通配符？
我有一个旧歌曲数据库，我想将其转移到新数据库。我的旧数据库看起来像这样，多个值被填充在一个用逗号分隔的字段中 SONG id | title | artist |
MySQL WHERE IN 通配符
首先，我知道downloads表没有标准化。我有这两个表: downloads map | author 1 | Nikola 2 | Nikola George 和 mappers mapper_
详解SQL 通配符
通配符可用于替代字符串中的任何其他字符。 SQL 通配符在 SQL 中，通配符与 SQL LIKE 操作符一起使用。 SQL 通配符用于搜索表中的数据。在 SQL 中，可使用以下通配符：
bash - 由引号和未引号部分组成的字符串中的通配符扩展(通配符)
我在 shell 脚本中有一行看起来像这样: java -jar "$dir/"*.jar ，因为我只想执行该文件夹中恰好命名的 jar 文件。但这并不像我预期的那样有效。我收到错误消息: Error
Powershell:Where-Object 通配符
我想在 Active Directory 用户的所有属性中搜索特定电话号码/分机号。我可以像这样获取所有属性: get-aduser joesmith -Properties * 但我想过滤结果，例
将文件名作为参数传递时的 powershell 通配符
我在运行 Python 3在 Windows 机器上使用 PowerShell .我正在尝试执行一个 Python 文件，然后使用通配符将多个文件(file1.html、file2.html 等)作为
javascript - getElementById() 通配符
我有一个 div，并且有一些处于未定义级别的子节点。现在我必须将每个元素的 ID 更改为一个 div。如何实现？我想，因为它们有向上的ID，所以如果父级是id='path_test_maindiv
Lua 比较运算符(通配符？)
我是 Lua 的新手，所以我现在正在学习运算符部分。在 Lua 中是否有与字符串一起使用的通配符？我有 PHP 背景，我实际上是在尝试编写以下代码: --scan the directory's f
java - 通配符，java中的通用
我在 countList 方法上遇到编译时错误。 public static void countList( List list, int count ){ for( int i =
Java、命名Bean、通配符？
我们需要在运行时检索多个类实例，而无需手动维护所有可用类型的列表。可能的方法: 检索带有@xy注释的每种类型的实例检索每种类型的实例实现接口(interface)iXY 检索每种类型的实例，命名如
用于完成字符串的 Prolog 通配符
我目前陷入了序言问题。到目前为止我有: film(Title) :- movie(Title,_,_).(其中“movie(T,_,_,)”是对我的引用数据库) namesearch(Title,
R gsub 通配符
我想从字符表达式(在 R 中)中删除一个“*”。在阅读帮助页面并尝试谷歌后，我无法充分理解 gsub 的复杂性。有人可以建议我该怎么做吗？谢谢，乔纳森。最佳答案您需要转义两次:一次针对 R，一
选择器中的 jQuery 通配符
在我的 DOM 中，我有一个动态生成对话框的表。 DOM 中的对话框将具有以下形式的 ID: id="page:form:0:dlg" id="page:form:1:dlg" id="page:fo
Java 泛型 - 通配符
我是 Java 新手，并且已经陷入这样一种情况，很明显我误解了它如何处理泛型，但是阅读教程和搜索 stackoverflow 并没有(至少到目前为止)让我清楚我怀疑我滥用了通配符。需要注意的是，我有
图像源中的 jQuery 通配符
我想使用 jQuery 更改单击时图像的 src 属性。这是 HTML: View 2 在 img src 中，我想将“a”替换为“b”，但我的问题是我想忽略它前面的“1”，因为它也可能看起来像这样
字段中的 Mysql 通配符
我有一个 mysql 数据库，我的表是: Name | passcode ---------------------- hi* | 1111 ------------------
特定数量字符的 C# 通配符
我想选择所有在星号所在位置具有确切 4 个“未知”字符的文档:(例如“****”可能是“2018”) foreach (string s in Directory.GetFiles(@"C:\User

太空宇宙

个人简介

我是一名优秀的程序员,十分优秀！

作者热门文章

滴滴打车优惠券免费领取

全站热门文章

首页

博学

6Ren·AI

商城

python - Snakemake:通配符不存在时出错