gpt4 book ai didi

python - 请求异常.MissingSchema : Invalid URL 'h' : No schema supplied

转载 作者:行者123 更新时间:2023-11-30 22:24:10 25 4
gpt4 key购买 nike

我正在开发一个网页抓取项目,并遇到了以下错误。

requests.exceptions.MissingSchema:无效的 URL“h”:未提供架构。也许你的意思是 http://h

下面是我的代码。我从 html 表中检索所有链接,它们按预期打印出来。但是当我尝试使用 request.get 循环遍历它们(链接)时,我收到了上面的错误。

from bs4 import BeautifulSoup
import requests
import unicodedata
from pandas import DataFrame

page = requests.get("http://properties.kimcorealty.com/property/output/find/search4/view:list/")
soup = BeautifulSoup(page.content, 'html.parser')

table = soup.find('table')
for ref in table.find_all('a', href=True):
links = (ref['href'])
print (links)
for link in links:
page = requests.get(link)
soup = BeautifulSoup(page.content, 'html.parser')
table = []
# Find all the divs we need in one go.
divs = soup.find_all('div', {'id':['units_box_1', 'units_box_2', 'units_box_3']})
for div in divs:
# find all the enclosing a tags.
anchors = div.find_all('a')
for anchor in anchors:
# Now we have groups of 3 list items (li) tags
lis = anchor.find_all('li')
# we clean up the text from the group of 3 li tags and add them as a list to our table list.
table.append([unicodedata.normalize("NFKD",lis[0].text).strip(), lis[1].text, lis[2].text.strip()])
# We have all the data so we add it to a DataFrame.
headers = ['Number', 'Tenant', 'Square Footage']
df = DataFrame(table, columns=headers)
print (df)

最佳答案

你的错误是代码中的第二个for循环

for ref in table.find_all('a', href=True):
links = (ref['href'])
print (links)
for link in links:

ref['href'] 为您提供单个网址,但您可以在下一个 for 循环中将其用作列表。

所以你有

for link in ref['href']:

它为您提供了 url http://properties.kimcore... 中的第一个字符,即 h

完整的工作代码

from bs4 import BeautifulSoup
import requests
import unicodedata
from pandas import DataFrame

page = requests.get("http://properties.kimcorealty.com/property/output/find/search4/view:list/")
soup = BeautifulSoup(page.content, 'html.parser')

table = soup.find('table')
for ref in table.find_all('a', href=True):
link = ref['href']
print(link)
page = requests.get(link)
soup = BeautifulSoup(page.content, 'html.parser')
table = []
# Find all the divs we need in one go.
divs = soup.find_all('div', {'id':['units_box_1', 'units_box_2', 'units_box_3']})
for div in divs:
# find all the enclosing a tags.
anchors = div.find_all('a')
for anchor in anchors:
# Now we have groups of 3 list items (li) tags
lis = anchor.find_all('li')
# we clean up the text from the group of 3 li tags and add them as a list to our table list.
table.append([unicodedata.normalize("NFKD",lis[0].text).strip(), lis[1].text, lis[2].text.strip()])
# We have all the data so we add it to a DataFrame.
headers = ['Number', 'Tenant', 'Square Footage']
df = DataFrame(table, columns=headers)
print (df)

顺便说一句:如果您在 (ref['href'], ) 中使用逗号,那么您会得到元组,然后第二个 for 可以正确工作.


编辑:它在开始时创建列表table_data并将所有数据添加到此列表中。最后转换成DataFrame。

但现在我看到它多次读取同一页面 - 因为在每一行中,每一列中都有相同的 url。您必须仅从一列获取 url。

编辑:现在它不会多次读取相同的网址

编辑:现在,当您使用append()时,它会从第一个链接获取文本和hre,并添加到列表中的每个元素。

from bs4 import BeautifulSoup
import requests
import unicodedata
from pandas import DataFrame

page = requests.get("http://properties.kimcorealty.com/property/output/find/search4/view:list/")
soup = BeautifulSoup(page.content, 'html.parser')

table_data = []

# all rows in table except first ([1:]) - headers
rows = soup.select('table tr')[1:]
for row in rows:

# link in first column (td[0]
#link = row.select('td')[0].find('a')
link = row.find('a')

link_href = link['href']
link_text = link.text

print('text:', link_text)
print('href:', link_href)

page = requests.get(link_href)
soup = BeautifulSoup(page.content, 'html.parser')

divs = soup.find_all('div', {'id':['units_box_1', 'units_box_2', 'units_box_3']})
for div in divs:
anchors = div.find_all('a')
for anchor in anchors:
lis = anchor.find_all('li')
item1 = unicodedata.normalize("NFKD", lis[0].text).strip()
item2 = lis[1].text
item3 = lis[2].text.strip()
table_data.append([item1, item2, item3, link_text, link_href])

print('table_data size:', len(table_data))

headers = ['Number', 'Tenant', 'Square Footage', 'Link Text', 'Link Href']
df = DataFrame(table_data, columns=headers)
print(df)

关于python - 请求异常.MissingSchema : Invalid URL 'h' : No schema supplied,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/47898368/

25 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com