我想从以下网站的所有页面中提取城市数据。我有下面的代码,但循环不断运行并一遍又一遍地提取数据。看起来我错过了一些东西,你能帮忙
cities = []
with requests.Session() as session:
session.headers = {
'x-requested-with': 'XMLHttpRequest'
}
page = 1
while True:
url = f'https://www.kununu.com/de/volkswagen/kommentare/{page}'
response = session.get(url)
soup = BeautifulSoup(response.text, 'html.parser')
new_comments = [
cities.find_next_sibling('div').text.strip()
for cities in soup.find_all('div', text=re.compile('Stadt'))
]
cities += new_comments
print(cities)
page += 1
#print(cities)
您没有退出条件。您需要在某个时刻中断
循环。
例如:
cities = []
with requests.Session() as session:
session.headers = {
'x-requested-with': 'XMLHttpRequest'
}
page = 1
while True:
if page >= 99:
break
url = f'https://www.kununu.com/de/volkswagen/kommentare/{page}'
response = session.get(url)
soup = BeautifulSoup(response.text, 'html.parser')
new_comments = [
cities.find_next_sibling('div').text.strip()
for cities in soup.find_all('div', text=re.compile('Stadt'))
]
cities += new_comments
print(cities)
page += 1
print(cities) # this will print after 98 pages
我是一名优秀的程序员,十分优秀!