python - 请求异常.MissingSchema : Invalid URL 'h' : No schema supplied-6ren

python - 请求异常.MissingSchema : Invalid URL 'h' : No schema supplied

转载作者：行者123 更新时间：2023-11-30 22:24:10

25

4

我正在开发一个网页抓取项目，并遇到了以下错误。

requests.exceptions.MissingSchema:无效的 URL“h”:未提供架构。也许你的意思是 http://h ？

下面是我的代码。我从 html 表中检索所有链接，它们按预期打印出来。但是当我尝试使用 request.get 循环遍历它们(链接)时，我收到了上面的错误。

from bs4 import BeautifulSoup
import requests
import unicodedata
from pandas import DataFrame

page = requests.get("http://properties.kimcorealty.com/property/output/find/search4/view:list/")
soup = BeautifulSoup(page.content, 'html.parser')

table = soup.find('table')
for ref in table.find_all('a', href=True):
    links = (ref['href'])
    print (links)
    for link in links:
        page = requests.get(link)
        soup = BeautifulSoup(page.content, 'html.parser')
        table = []
        # Find all the divs we need in one go.
        divs = soup.find_all('div', {'id':['units_box_1', 'units_box_2', 'units_box_3']})
        for div in divs:
            # find all the enclosing a tags.
            anchors = div.find_all('a')
            for anchor in anchors:
                # Now we have groups of 3 list items (li) tags
                lis = anchor.find_all('li')
                # we clean up the text from the group of 3 li tags and add them as a list to our table list.
                table.append([unicodedata.normalize("NFKD",lis[0].text).strip(), lis[1].text, lis[2].text.strip()])
        # We have all the data so we add it to a DataFrame.
        headers = ['Number', 'Tenant', 'Square Footage']
        df = DataFrame(table, columns=headers)
        print (df)

最佳答案

你的错误是代码中的第二个for循环

for ref in table.find_all('a', href=True):
    links = (ref['href'])
    print (links)
    for link in links:

ref['href'] 为您提供单个网址，但您可以在下一个 for 循环中将其用作列表。

所以你有

for link in ref['href']:

它为您提供了 url http://properties.kimcore... 中的第一个字符，即 h

完整的工作代码

from bs4 import BeautifulSoup
import requests
import unicodedata
from pandas import DataFrame

page = requests.get("http://properties.kimcorealty.com/property/output/find/search4/view:list/")
soup = BeautifulSoup(page.content, 'html.parser')

table = soup.find('table')
for ref in table.find_all('a', href=True):
    link = ref['href']
    print(link)
    page = requests.get(link)
    soup = BeautifulSoup(page.content, 'html.parser')
    table = []
    # Find all the divs we need in one go.
    divs = soup.find_all('div', {'id':['units_box_1', 'units_box_2', 'units_box_3']})
    for div in divs:
        # find all the enclosing a tags.
        anchors = div.find_all('a')
        for anchor in anchors:
            # Now we have groups of 3 list items (li) tags
            lis = anchor.find_all('li')
            # we clean up the text from the group of 3 li tags and add them as a list to our table list.
            table.append([unicodedata.normalize("NFKD",lis[0].text).strip(), lis[1].text, lis[2].text.strip()])
    # We have all the data so we add it to a DataFrame.
    headers = ['Number', 'Tenant', 'Square Footage']
    df = DataFrame(table, columns=headers)
    print (df)

顺便说一句:如果您在 (ref['href'], ) 中使用逗号，那么您会得到元组，然后第二个 for 可以正确工作.

编辑:它在开始时创建列表table_data并将所有数据添加到此列表中。最后转换成DataFrame。

但现在我看到它多次读取同一页面 - 因为在每一行中，每一列中都有相同的 url。您必须仅从一列获取 url。

编辑:现在它不会多次读取相同的网址

编辑:现在，当您使用append()时，它会从第一个链接获取文本和hre，并添加到列表中的每个元素。

from bs4 import BeautifulSoup
import requests
import unicodedata
from pandas import DataFrame

page = requests.get("http://properties.kimcorealty.com/property/output/find/search4/view:list/")
soup = BeautifulSoup(page.content, 'html.parser')

table_data = []

# all rows in table except first ([1:]) - headers
rows = soup.select('table tr')[1:]
for row in rows: 

    # link in first column (td[0]
    #link = row.select('td')[0].find('a')
    link = row.find('a')

    link_href = link['href']
    link_text = link.text

    print('text:', link_text)
    print('href:', link_href)

    page = requests.get(link_href)
    soup = BeautifulSoup(page.content, 'html.parser')

    divs = soup.find_all('div', {'id':['units_box_1', 'units_box_2', 'units_box_3']})
    for div in divs:
        anchors = div.find_all('a')
        for anchor in anchors:
            lis = anchor.find_all('li')
            item1 = unicodedata.normalize("NFKD", lis[0].text).strip()
            item2 = lis[1].text
            item3 = lis[2].text.strip()
            table_data.append([item1, item2, item3, link_text, link_href])

    print('table_data size:', len(table_data))            

headers = ['Number', 'Tenant', 'Square Footage', 'Link Text', 'Link Href']
df = DataFrame(table_data, columns=headers)
print(df)

关于python - 请求异常.MissingSchema : Invalid URL 'h' : No schema supplied，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/47898368/

25

4

0

文章推荐： c# - 如何停靠选项卡控件

文章推荐： c# - 如何使用 "Local Service"而不是 "Local System"？

文章推荐： Python 3 ftplib错误 "Name or service not known"

schema.org - Schema.org中的priceRange属性是什么意思？
Schema.org中的 priceRange 属性是什么意思？ https://schema.org/priceRange 我不明白这是什么意思，我住在哈萨克斯坦，也许我的文化或语言不明确。您能举一
schema.org - schema.org 中的产品类别？
用作引用:https://support.google.com/webmasters/answer/146750?hl=en 您会注意到在“产品”下有一个属性类别，此外页面下方还有一个示例: Too
schema - Doctrine schema.yml 生成器
我对 Doctrine 很陌生。我用 Doctrine 为我自己做了两个小项目，但现在我要为我的客户创建大项目。该项目将有50多个表。有没有办法生成schema.yml？我尝试了 DB Designe
mongoose-schema - Mongoose Schema 静态与方法
在 Mongoose 模式中，我们可以通过两种方式创建方法 SchemaName.methods.functionName 和 SchemaName.statics.functionName . 静态
schema.org - Schema.org 中的产品列表
我读了这个Google doc .它说我们不使用列表中的产品。那么对于产品列表(具有多页的类似产品的类别，如“鞋子”)，推荐使用哪种模式？我用这个: { "@context": "htt
schema.org - schema.org 数据集和维基数据之间是否存在映射？
我目前在做DBpedia数据集，想通过wikidata实现schema.org和DBpedia的映射。因此我想知道 schema.org 和 wikidata 之间是否存在任何映射。最佳答案我认为
schema.org - schema.org 中的多个作者或贡献者
如果看一下 Movie在 schema.org 中输入，actor 和 actors 属性都是允许的(actor 取代 actors)。但是 author 和 contributor 属性没有等效项。
schema - 将 Vertica Schema 或架构中的所有表从一个物理集群复制到另一个物理集群
我正在尝试将 Vertica 架构从一个物理集群导出和导入到另一个物理集群。我的测试实例有一个集群，我的生产实例有 3 个集群。我探索了以下选项，但它们仅限于在一个物理 Vertica 实例上移动数
schema.org - Schema.org 中的多家餐厅
我们有一些餐厅有多个地点或分支机构。我想包含正确的 Schema.org 标记，但找不到任何允许列出多个餐厅的内容。每家餐厅都有自己的地址、电子邮件、电话和营业时间，甚至可能是“分店名称”。两个分
schema.org - Schema.org 的多个综合评级
我在一个页面中有多个综合评分片段。有没有办法让其中之一成为默认值？将显示在搜索引擎结果中的那个？谢谢大家! 更新:该网页本质上是品牌的页面。它包含品牌评论的总评分及其产品列表(每个产品的总评分)。
XML Schema 来验证 XML Schema？
有谁知道是否可以用另一个 XML 模式验证一个 XML 模式？如果是这样，那里有引用实现吗？我想使用 JAXB 解析模式文档。最佳答案当然。大多数时候，您只需将浏览器指向用作 XML 文档 nam
spring-boot - Swagger 声明 schema = @Schema(implementation = Map.class) 在 swagger-ui 中将 Schema 表示为 String
我正在尝试创建 springdoc swagger 文档，我想代表一个具有数据类型 Map 的请求正文以更好的方式为客户提供可读性。但是当我声明 @io.swagger.v3.oas.annotati
snowflake-schema - Snowflake 数据库中的 PUBLIC Schema
当我们创建数据库时会创建一个公共(public)模式，如果我们不指定任何模式，则会在公共(public)模式下创建表。如果您在从数据库中删除公共(public)模式时看到或遇到任何问题，能否告诉我，因
schema.org - 个人网站是否应该将根页面标记为 schema.org 'Person' ？
网站的根页面(即 http://example.com/ )的特殊之处在于它是默认的着陆页。它可能包含许多不同的对象类型。它可能被认为是一个网站，或者一个博客等... 但它是否也应该被标记为给定对象
schema.org - 如何将面包屑添加到 Schema.org 中的产品？
我网站的产品页面有面包屑导航。 Product 类型没有breadcrumb。我这样做: "@type": "Webpage", "breadcrumb": {... "mainEntity":
schema.org - OpenGraph 还是 Schema.org？
关闭。这个问题是opinion-based 。目前不接受答案。想要改进这个问题吗？更新问题，以便 editing this post 可以用事实和引文来回答它。 . 已关闭 1 年前。社区1年前审
java - 从 Confluence Schema 注册表中检索 Schema
我正在尝试从模式注册表中检索给定 kafka 主题的模式主题版本。我可以使用 client.register(schema-name, schema) 成功发布新版本，但我不确定如何检索版本。我在下面
schema.org - 我应该使用什么标记来描述使用 schema.org 的社交媒体链接？
我有一个地方/本地企业，它有各种字段，可以很好地映射到 schema.org 条目。有一个字段我不知道如何标记。我们有指向该企业社交媒体帐户的链接，例如他们的 Twitter 帐户、Facebook
go - 如何将范围设置为 terraform schema.schema 字段？
我在 schema.schema 中有一个 number_of_servers 字段，我需要为其设置一个范围。有什么办法吗？ Schema: map[string]*schema.Schema{
schema.org microdata : Do as schema. org 说的，还是谷歌说的？
在向我的站点添加微数据时，我使用了 schema.org 上的词汇表。目前，我使用 http://schema.org/SoftwareApplication 来标记软件。由于 schema.org

首页

博学

6Ren·AI

商城

python - 请求异常.MissingSchema : Invalid URL 'h' : No schema supplied