gpt4 book ai didi

python - 如何使用递归在 BeautifulSoup 中进行抓取?

转载 作者:行者123 更新时间:2023-12-01 06:58:03 26 4
gpt4 key购买 nike

我正在尝试使用以下代码来抓取 xml 文件,该代码工作得很好:-

    f = open("sample_data.xml", "r")
contents = f.read()
soup = BeautifulSoup(contents, features="xml")
for component in soup.find_all("component"):
for section in component.find_all("section"):
for entry in section.find_all("entry"):
for encounter in entry.find_all("encounter"):
for participant in encounter.find_all("participant"):
for participantRole in participant.find_all("participantRole"):
for playingEntity in participantRole.find_all("playingEntity"):
for name in playingEntity.find_all("name"):
print(name.text)

但是我不想使用这么多 for 循环,而是想将其放入递归中。为此,我创建了一个列表,它将作为我们的遍历路径来查找所需的元素,如下所示:-

traversal_path = ['component', 'section', 'entry', 'encounter', 'participant', 'participantRole', 'playingEntity', 'name']

为了作为递归函数的断点,我们可以使用遍历路径的最后一项,在我们的例子中是name。当我们继续遍历 traversal_path 时,列表中的第一项将被删除,直到只剩下最后一项。根据这个,现在我的功能变成了这样:-

f = open("sample_data.xml", "r")
contents = f.read()
soup = BeautifulSoup(contents, features="xml")
traversal_path = ['component', 'section', 'entry', 'encounter', 'participant', 'participantRole', 'playingEntity', 'name']

def rec(traversal_path, soup):
print(traversal_path)
if len(traversal_path) == 1:
for last_item in soup.find_all(traversal_path[0]):
print(last_item.text)
else:
t = traversal_path.pop(0)
for first_item in soup.find_all(t):
return rec(traversal_path, first_item)

rec(traversal_path, soup)

我得到的输出只是打印的遍历路径,如下所示:-

['component', 'section', 'entry', 'encounter', 'participant', 'participantRole', 'playingEntity', 'name']
['section', 'entry', 'encounter', 'participant', 'participantRole', 'playingEntity', 'name']
['entry', 'encounter', 'participant', 'participantRole', 'playingEntity', 'name']
['encounter', 'participant', 'participantRole', 'playingEntity', 'name']

当我打印soup而不是traversal_path时,我只打印输出soup,直到entry

此外,我的函数中的问题似乎出在 else 部分,它不会进入递归。非常感谢任何有关此事的帮助。

最佳答案

 def rec(traversal_path, soup):
if len(traversal_path) == 1:
for last_item in soup.find_all(traversal_path[0]):
print(last_item.text)
else:
try:
for first_item in soup.find_all(traversal_path[0]):
rec(traversal_path, first_item)
t = traversal_path.pop(0)
except Exception as e:
pass

rec(field['traversal_path'].split(" "), soup)

只需删除 return 语句并处理异常

关于python - 如何使用递归在 BeautifulSoup 中进行抓取?,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/58724656/

26 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com