gpt4 book ai didi

python - Beautiful Soup 无法处理大文件

转载 作者:行者123 更新时间:2023-12-01 05:23:19 38 4
gpt4 key购买 nike

我有一个巨大的 XML 文件(1.2 G),其中包含数百万个 MusicAlbums 的信息,每个 MusicAlbums 都具有如下简单格式

    <MusicAlbum>
<MusicType>P</MusicType>
<Title>22 Exitos de Oro [Brentwood]</Title>
<Performer>Chayito Valdéz</Performer>
</MusicAlbum>
...
<MusicAlbum>
<MusicType>A</MusicType>
<Title>Bye Bye</Title>
<Performer>Emma Aster</Performer>
</MusicAlbum>

我可以在 Python 中很好地读取和加载文件,但是当我将它传递给 Beautifulsoup 时

html = FID.read()
print "Converting to Soup"
soup = BeautifulSoup(html)
print "Conversion Completed"

我明白了

Converting to Soup
Killed

显然 Killed 是 Beautifulsoup 打印的内容。
一种解决方案是将 html 分解为每个包含 info"MusicAlbum"和 "/MusicAlbum" block 的 block ,然后将它们传递给 Beautifulsoup,但只是想确定是否有更简单的解决方案。

最佳答案

检查这是否适合您,它不会很快,但不应使用超出您需要的内存:

# encoding:utf-8
import re

data = """ <MusicAlbum>
<MusicType>P</MusicType>
<Title>22 Exitos de Oro [Brentwood]</Title>
<Performer>Chayito Valdéz</Performer>
</MusicAlbum>
...
<MusicAlbum>
<MusicType>A</MusicType>
<Title>Bye Bye</Title>
<Performer>Emma Aster</Performer>
</MusicAlbum>"""

MA = re.compile(r'<MusicAlbum>(.*?)</MusicAlbum>', re.DOTALL)
TY = re.compile(r'<MusicType>(.*)</MusicType>')
TI = re.compile(r'<Title>(.*)</Title>')
P = re.compile(r'<Performer>(.*)</Performer>')

albums = []
for album in re.findall(MA, data):
albums.append({
'type': re.search(TY, album).group(),
'title': re.search(TI, album).group(),
'performer': re.search(P, album).group()})


print albums

关于python - Beautiful Soup 无法处理大文件,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/21886386/

38 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com