gpt4 book ai didi

python - 使用 Beautiful Soup 检索结束和开始 html 标签之间的所有内容

转载 作者:太空宇宙 更新时间:2023-11-04 06:37:30 26 4
gpt4 key购买 nike

我正在使用 Python 和 Beautiful Soup 解析内容,然后将其写入 CSV 文件,但在获取特定数据集时遇到了问题。数据通过我精心设计的 TidyHTML 实现运行,然后其他不需要的数据被删除。

问题是我需要检索一组 <h3> 之间的所有数据标签。

示例数据:

<h3><a href="Vol-1-pages-001.pdf">Pages 1-18</a></h3>
<ul><li>September 13 1880. First regular meeting of the faculty;
September 14 1880. Discussion of curricular matters. Students are
debarred from taking algebra until they have completed both mental
and fractional arithmetic; October 4 1880.</li><li>All members present.</li></ul>
<ul><li>Moved the faculty henceforth hold regular weekkly meetings in the
President's room of the University building; 11 October 1880. All
members present; 18 October 1880. Regular meeting 2. Moved that the
President wait on the property holders on 12th street and request
them to abate the nuisance on their property; 25 October 1880.
Moved that the senior and junior classes for rhetoricals be...</li></ul>
<h3><a href="Vol-1-pages-019.pdf">Pages 19-33</a></h3>`

我需要检索第一次关闭 </h3> 之间的所有内容标记和下一个开口 <h3>标签。这应该不难,但我的大脑袋没有建立必要的联系。我可以捕获所有 <ul>标签,但这不起作用,因为 <h3> 之间没有一对一的关系。标签和 <ul>标签。

我希望实现的输出是:

Pages 1-18|Vol-1-pages-001.pdf| 和标签之间的内容。

前两部分没有问题,但一组标签之间的内容对我来说很难。

我目前的代码如下:

import glob, re, os, csv
from BeautifulSoup import BeautifulSoup
from tidylib import tidy_document
from collections import deque

html_path = 'Z:\\Applications\\MAMP\\htdocs\\uoassembly\\AssemblyRecordsVol1'
csv_path = 'Z:\\Applications\\MAMP\\htdocs\\uoassembly\\AssemblyRecordsVol1\\archiveVol1.csv'

html_cleanup = {'\r\r\n':'', '\n\n':'', '\n':'', '\r':'', '\r\r': '', '<img src="UOSymbol1.jpg" alt="" />':''}

for infile in glob.glob( os.path.join(html_path, '*.html') ):
print "current file is: " + infile

html = open(infile).read()

for i, j in html_cleanup.iteritems():
html = html.replace(i, j)

#parse cleaned up html with Beautiful Soup
soup = BeautifulSoup(html)

#print soup
html_to_csv = csv.writer(open(csv_path, 'a'), delimiter='|',
quoting=csv.QUOTE_NONE, escapechar=' ')
#retrieve the string that has the page range and file name
volume = deque()
fileName = deque()
summary = deque()
i = 0
for title in soup.findAll('a'):
if title['href'].startswith('V'):
#print title.string
volume.append(title.string)
i+=1
#print soup('a')[i]['href']
fileName.append(soup('a')[i]['href'])
#print html_to_csv
#html_to_csv.writerow([volume, fileName])

#retrieve the summary of each archive and store
#for body in soup.findAll('ul') or soup.findAll('ol'):
# summary.append(body)
for body in soup.findAll('h3'):
body.findNextSibling(text=True)
summary.append(body)

#print out each field into the csv file
for c in range(i):
pages = volume.popleft()
path = fileName.popleft()
notes = summary
if not summary:
notes = "help"
if summary:
notes = summary.popleft()
html_to_csv.writerow([pages, path, notes])

最佳答案

提取 </h3> 之间的内容和 <h3>标签:

from itertools import takewhile

h3s = soup('h3') # find all <h3> elements
for h3, h3next in zip(h3s, h3s[1:]):
# get elements in between
between_it = takewhile(lambda el: el is not h3next, h3.nextSiblingGenerator())
# extract text
print(''.join(getattr(el, 'text', el) for el in between_it))

代码假定所有 <h3>元素是 sibling 。如果不是这种情况,那么您可以使用 h3.nextGenerator()而不是 h3.nextSiblingGenerator() .

关于python - 使用 Beautiful Soup 检索结束和开始 html 标签之间的所有内容,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/8731848/

26 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com