gpt4 book ai didi

python - 使用正则表达式从字符串中提取信息

转载 作者:太空狗 更新时间:2023-10-29 21:50:51 26 4
gpt4 key购买 nike

这是这个问题的后续和复杂化:Extracting contents of a string within parentheses .

在那个问题中我有以下字符串 --

"Will Farrell (Nick Hasley), Rebecca Hall (Samantha)"

我想得到 (actor, character) 形式的元组列表 --

[('Will Farrell', 'Nick Hasley'), ('Rebecca Hall', 'Samantha')]

总的来说,我有一个稍微复杂的字符串,我需要提取相同的信息。我的字符串是 --

"Will Ferrell (Nick Halsey), Rebecca Hall (Samantha), Glenn Howerton (Gary), 
with Stephen Root and Laura Dern (Delilah)"

我需要将其格式化如下:

[('Will Farrell', 'Nick Hasley'), ('Rebecca Hall', 'Samantha'), ('Glenn Howerton', 'Gary'),
('Stephen Root',''), ('Lauren Dern', 'Delilah')]

我知道我可以替换填充词(用、and、& 等),但不太清楚如何添加空白条目 -- '' -- 如果有不是 Actor 的角色名称(在本例中为 Stephen Root)。执行此操作的最佳方法是什么?

最后,我需要考虑一个 Actor 是否有多个角色,并为 Actor 的每个角色构建一个元组。我得到的最后一个字符串是:

"Will Ferrell (Nick Halsey), Rebecca Hall (Samantha), Glenn Howerton (Gary, Brad), with
Stephen Root and Laura Dern (Delilah, Stacy)"

我需要构建一个元组列表,如下所示:

[('Will Farrell', 'Nick Hasley'), ('Rebecca Hall', 'Samantha'), ('Glenn Howerton', 'Gary'),    
('Glenn Howerton', 'Brad'), ('Stephen Root',''), ('Lauren Dern', 'Delilah'), ('Lauren Dern', 'Stacy')]

谢谢。

最佳答案

import re
credits = """Will Ferrell (Nick Halsey), Rebecca Hall (Samantha), Glenn Howerton (Gary, Brad), with
Stephen Root and Laura Dern (Delilah, Stacy)"""

# split on commas (only if outside of parentheses), "with" or "and"
splitre = re.compile(r"\s*(?:,(?![^()]*\))|\bwith\b|\band\b)\s*")

# match the part before the parentheses (1) and what's inside the parens (2)
# (only if parentheses are present)
matchre = re.compile(r"([^(]*)(?:\(([^)]*)\))?")

# split the parts inside the parentheses on commas
splitparts = re.compile(r"\s*,\s*")

characters = splitre.split(credits)
pairs = []
for character in characters:
if character:
match = matchre.match(character)
if match:
actor = match.group(1).strip()
if match.group(2):
parts = splitparts.split(match.group(2))
for part in parts:
pairs.append((actor, part))
else:
pairs.append((actor, ""))

print(pairs)

输出:

[('Will Ferrell', 'Nick Halsey'), ('Rebecca Hall', 'Samantha'), 
('Glenn Howerton', 'Gary'), ('Glenn Howerton', 'Brad'), ('Stephen Root', ''),
('Laura Dern', 'Delilah'), ('Laura Dern', 'Stacy')]

关于python - 使用正则表达式从字符串中提取信息,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/7010672/

26 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com