python - 替换 pandas 数据框中的字符串-6ren

python - 替换 pandas 数据框中的字符串

转载作者：行者123 更新时间：2023-11-30 22:37:34

27

4

我有一个pandas.DataFrame，其中包含 bool 规则，表示 enzyme 是否表达。有些规则很简单(表达取决于 1 个基因)，有些则更复杂(表达取决于多个基因)

>>gprs.head()

Out[362]: 
        Rxn                             rule
0     13DAMPPOX      HGNC:549 or HGNC:550 or HGNC:80
6  24_25VITD2Hm      HGNC:2602
8     25VITD2Hm      HGNC:16354 or (HGNC:249 and HGNC:250) or (HGNC:249 and HGNC:251) or (HGNC:250 and HGNC:251) or HGNC:252 or HGNC:253 or HGNC:255 or HGNC:256

...

字典对象包含有关基因表达的信息:(1=expr，0=not expr)

>>translation

'HGNC:80':1
'HGNC:2602':0
 etc...

我想将“translation”对象中包含的表达式信息替换为我的“gprs”pandas.DataFrame。到目前为止我已经:

for index, row in gprs.iterrows():
    row['rule']=row['rule'].replace(r'(', "")
    row['rule']=row['rule'].replace(r')', "")
    ruleGenes=re.split(" and | or ",(row['rule']))
    for gene in ruleGenes:
        if re.match("HGNC:HGNC:", gene):
            gene=gene[5:]
            try:
               gprs=gprs.replace(gene,translation[gene])
            except:
               print 'error in ', gene
        else:
            try:
                gprs=gprs.replace(gene,translation[gene])
            except:
                print 'error in ', gene

这仅在规则很简单(1 个元素)时有效，但在规则更复杂时会失败:

>>gprs.head()

0     13DAMPPOX  HGNC:549 or HGNC:550 or HGNC:80
6  24_25VITD2Hm                                0
7  24_25VITD3Hm  HGNC:16354 or (HGNC:249 and HGNC:250) or (HGNC:249 and HGNC:251) or (HGNC:250 and HGNC:251) or HGNC:252 or HGNC:253 or HGNC:255 or HGNC:256

最终我想用 max() 函数替换“or”，用 min() 函数替换“and”并评估 bool 规则。

有什么建议吗？

编辑:

使用 EFT 代码时，如果一个字符串是另一个字符串的子字符串，即“HGNC:54”和“HGNC:549”，则会出现问题

>>translation

'HGNC:54':0
'HGNC:549':1

结果:

>>gprs.head(1)

         Rxn                             rule                  translation 
0     13DAMPPOX       HGNC:549 or HGNC:550 or HGNC:80         09 or 1 or 0

如何只替换整个字符串而不替换子字符串？

编辑编辑:

它适用于:

for_eval = {k+'(?![0-9])' : str(v) for k, v in translation.items()}
gprs['translation'] = gprs['rule'].replace(for_eval, regex=True)

感谢 EFT 的建议

最佳答案

输入翻译可以通过

完成

>>>for_eval = {k+'(?![0-9])': str(v) for k, v in translation.items()}
>>>gprs['translation'] = gprs['rule'].replace(for_eval, regex=True)

说明:

第一行

>>>for_eval = {k+'(?![0-9])': str(v) for k, v in translation.items()}

将0和1交换为其字符串形式，分别为'0'和'1'，准备将它们插入到第二行的字符串中。将“(?![0-9])”添加到键中会检查并忽略后面有更多数字的匹配，从而避免仅与键的第一部分匹配。

第二行

>>>gprs['translation'] = gprs['rule'].replace(for_eval, regex=True)

在 pandas 中将替换作为列操作执行，而不是在 python 中迭代每一行，对于较大的数据集(例如本例中的 30 个或更多条目)，速度要慢得多。

如果没有 regex=True，这只能在完全匹配的情况下起作用，这会产生与您在尝试实现较长规则时遇到的相同问题。

示例，测试用例归功于 u/Stephen Rauch:

In [3]:translation = {
    'HGNC:80': 1,
    'HGNC:249': 1,
    'HGNC:250': 1,
    'HGNC:251': 0,
    'HGNC:252': 1,
    'HGNC:253': 0,
    'HGNC:255': 1,
    'HGNC:256': 1,
    'HGNC:549': 0,
    'HGNC:550': 1,
    'HGNC:2602': 0,
    'HGNC:16354': 1,
}

In [4]:gprs = pd.DataFrame([
    ('HGNC:550', 1),
    ('HGNC:2602', 0),
    ('HGNC:253 or HGNC:549', 0),
    ('HGNC:549 or HGNC:550 or HGNC:80', 1),
    ('HGNC:549 or (HGNC:550 and HGNC:2602)', 0),
    ('HGNC:549 or (HGNC:550 and HGNC:16354)', 1),
    ('HGNC:16354 or (HGNC:249 and HGNC:250) or (HGNC:249 and HGNC:251)', 1)
], columns = ['rule', 'target'])

In [5]:for_eval = {k: str(v) for k, v in translation.items()}

In [6]:gprs['translation'] = gprs['rule'].replace(for_eval, regex=True)

In [7]:gprs['translation']

Out[7]:
0                              1
1                              0
2                         0 or 0
3                    0 or 1 or 1
4                 0 or (1 and 0)
5                 0 or (1 and 1)
6    1 or (1 and 1) or (1 and 0)
Name: translation, dtype: object

对于您稍后要查看的第二部分，eval(如 u/Stephen Rauch 的答案中提到和详细说明的)可用于计算生成的字符串中包含的表达式。为此，pd.Series.map 可用于比使用 iterrows 更快地对序列应用元素级操作。在这里，看起来像这样

In [10]:gprs['translation'].map(eval)
Out[10]: 
0    1
1    0
2    0
3    1
4    0
5    1
6    1
Name: translation, dtype: int64

或者，如果试图尽力提高性能，可以选择在输出上使用正则表达式模式匹配而不是映射。它变得更具体地取决于您的规则的措辞方式，但如果它们的格式都像您帖子中的三个一样好，“and”都是成对且带括号的，没有嵌套，那么

# set any 'and' term with a zero in it to zero
>>>ands = gprs['translation'].str.replace('0 and \d|\d and 0', '0')
# if any ones remain, only 'or's and '1 and 1' statements are left
>>>ors = ands.replace('1', 1, regex=True)
# faster to force it to numeric than to search the remaining terms for zeros
>>>out = pd.to_numeric(ors, errors='coerce').fillna(0)
>>>out
0    1.0
1    0.0
2    0.0
3    1.0
4    0.0
5    1.0
6    1.0
Name: translation, dtype: float64

使用 timeit 模块检查，对于数千行以上的情况，速度应该快五倍左右，盈亏平衡点大约为 60 或 70 个条目。

关于python - 替换 pandas 数据框中的字符串，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/43849666/

27

4

0

文章推荐： c# - 在连接到字符串时处理 LINQ 中的空列

文章推荐： php - 从数据库中检索购物车项目

文章推荐： c# - C#中的并行事件处理

文章推荐： php - 一棵大树上的二叉树高度性能

.net - 在sharepoint中使用SPListCollection.Add方法(字符串，字符串，字符串，字符串，Int32，字符串，SPListTemplate.QuickLaunchOptions)
如何使用 SPListCollection.Add(String, String, String, String, Int32, String, SPListTemplate.QuickLaunchO
C++ 字符串 != 字符串
我刚刚开始使用 C++ 并且对 C# 有一些经验，所以我有一些一般的编程经验。然而，似乎我马上就被击落了。我试过在谷歌上寻找，以免浪费任何人的时间，但没有结果。 int main(int argc,
Java 8 : Converting Map>到映射<字符串，字符串[]>
这个问题已经有答案了: In Java 8 how do I transform a Map to another Map using a lambda? (8 个回答) Convert a Map>
node.js - "Type ' 字符串 |字符串[] ' is not assignable to type ' 字符串'
我正在使用 node + typescript 和集成的 swagger 进行 API 调用。我 Swagger 提出以下要求 http://localhost:3033/employees/sear
C++ 映射<字符串， vector <对<字符串，字符串>>> : adding a mapping to an empty vector?
我是 C++ 容器模板的新手。我收集了一些记录。每条记录都有一个唯一的名称，以及一个字段/值对列表。将按名称访问记录。字段/值对的顺序很重要。因此我设计如下: typedef string
java - 谁能帮我创建方法？ mystring.replacefirst(字符串,字符串);并替换(自，直到，字符串)；对于j2me，请
我需要这两种方法，但j2me没有，我找到了一个replaceall();但这是 replaceall(string,string,string); 第二个方法是SringBuffer但在j2me中它没
.net - 字符串 vs 字符串 - 区分大小写的联合
If string is an alias of String in the .net framework为什么会发生这种情况，我应该如何解释它: type JustAString = string
python - 考虑顺序如何检查列表(字符串)是否包含另一个列表(字符串)
我有两个列表(或字符串):一个大，另一个小。我想检查较大的(A)是否包含小的(B)。我的期望如下: 案例 1. B 是 A 的子集 A = [1,2,3] B = [1,2] contains(A
javascript - Jquery 字符串 + 对象 + 字符串
我有一个似乎无法解决的小问题。这里...我有一个像这样创建的输入... var input = $(''); 如果我这样做......一切都很好 $(this).append(input); 如果我
c# - ienumerable <字符串>到列表<字符串>
我有以下代码片段 string[] lines = objects.Split(new string[] { "\r\n", "\n" }, StringSplitOptions.No
Java printf(字符串、Val、字符串)
这可能真的很简单，但我已经坚持了一段时间了。我正在尝试输出一个字符串，然后输出一个带有两位小数的 double ，后跟另一个字符串，这是我的代码。 System.out.printf("成本:%.2
firebase - Cloud Firestore 字符串 >= 字符串
以下是 Cloud Firestore 列表查询中的示例之一 citiesRef.where("state", ">=", "CA").where("state", "= 字符串，我们在Stack O
javascript - if(字符串.包含(字符串))。那可能吗？
我正在尝试检查一个字符串是否包含在另一个字符串中。后面的代码非常简单。我怎样才能在 jquery 中做到这一点？ function deleteRow(locName, locID) { if
C++ 字符串 (int) + 字符串 (int)
这个问题在这里已经有了答案: How to implement big int in C++ (14 个答案) 关闭 9 年前。我有 2 个字符串，都只包含数字。这些数字大于 uint64_t 的
java - 带有自定义转换器的推土机双向映射(字符串，字符串)不可能吗？
我有一个带有自定义转换器的 Dozer 映射: com.xyz.Customer com.xyz.CustomerDAO customerName
java - 字符串 a == 字符串 b 的规则
这个问题在这里已经有了答案: How do I compare strings in Java? (23 个回答) 关闭 6 年前。我想了解字符串池的工作原理以及一个字符串等于另一个字符串的规则是
Swift 字符串 vs. 字符串!与字符串？
我已阅读 this问题和其他一些问题。但它们与我的问题有些无关对于 UILabel 如果你不指定 ? 或 ! 你会得到这样的错误: @IBOutlet property has non-option
c - 字符串 [x] 与 *字符串++
这两种方法中哪一种在理论上更快，为什么？ (指向字符串的指针必须是常量。) destination[count] 和 *destination++ 之间的确切区别是什么？ destination[co
.net - String.Format与“字符串” +“字符串”还是StringBuilder？
This question already has answers here: Closed 11 years ago. Possible Duplicates: Is String.Format a
java - 流<字符串> 到映射<字符串、整数>
我有一个Stream一个文件的，现在我想将相同的单词组合成 Map这很重要，这个词在 Stream 中出现的频率. 我知道我必须使用 collect(Collectors.groupingBy(..)

首页

博学

6Ren·AI

商城

python - 替换 pandas 数据框中的字符串