python - 网页抓取返回空字典-6ren

python - 网页抓取返回空字典

转载作者：行者123 更新时间：2023-12-04 07:20:39

24

4

我试图从这个网站上抓取所有数据 https://ricetta.it/ricette-secondi使用 Python Selenium 。
我想将它们放入字典中，如下面的代码所示。
然而，这只是返回一个空列表。

import pprint
detail_recipes = []
for recipe in list_recipes:
  title = ""
  description = ""
  ingredient = ""
  if(len(recipe.find_elements_by_css_selector(".post-title")) > 0):
    title = recipe.find_elements_by_css_selector(".post-title")[0].text
  if(len(recipe.find_elements_by_css_selector(".post-excerpt")) > 0):
    description = recipe.find_elements_by_css_selector(".post-excerpt")[0].text
  if(len(recipe.find_elements_by_css_selector(".nm-ingr")) > 0):
    ingredient = recipe.find_elements_by_css_selector(".nm-ingr")[0].text

  detail_recipes.append({'title': title,
                        'description': description,
                        'ingredient': ingredient
                        })

len(detail_recipes)
pprint.pprint(detail_recipes[0:10])

最佳答案

你可以试试这个:

import requests
import numpy as np
from bs4 import BeautifulSoup as bs
import pandas as pd

url="https://ricetta.it/ricette-secondi"

page=requests.get(url)
soup=bs(page.content,'lxml')

df={'title': [],'description': [],'ingredient':[]}

for div in soup.find_all("div",class_="post-bordered"):
    df["title"].append(div.find(class_="post-title").text)
    try:
        df["description"].append(div.find(class_="post-excerpt").text)
    except:
        df["description"].append(np.nan)
    i=div.find_all(class_="nm-ingr")
    if len(i)>0:
        df["ingredient"].append([j.text for j in i])
    else:
        df["ingredient"].append(np.nan)

df=pd.DataFrame(df)

df.dropna(axis=0,inplace=True)

print(df)

输出:

                               title  ...                                         ingredient
0       Polpette di pane e formaggio  ...  [uovo, pane, pangrattato, parmigiano, latte, s...
1     Torta 7 vasetti alle melanzane  ...  [uovo, olio, latte, yogurt, farina 00, fecola ...
2  Torta a sole con zucchine e speck  ...  [pasta sfoglia, zucchina, ricotta, uovo, speck...
3                    Pesto di limoni  ...  [limone, pinoli, parmigiano, basilico, prezzem...
4                    Bombe di patate  ...  [patata, farina 00, uovo, parmigiano, sale e p...
5             Polpettone di zucchine  ...  [zucchina, uovo, parmigiano, pangrattato, pros...
6                  Insalata di pollo  ...  [petto di pollo, zucchina, pomodorino, insalat...
7                      Club sandwich  ...  [pane, petto di pollo, pomodoro, lattuga, maio...
8                Crostata di verdure  ...  [farina 00, burro, acqua, sale, zucchina, pomo...
9              Pesto di barbabietola  ...  [barbabietola, parmigiano, pinoli, olio, sale,...

[10 rows x 3 columns]

我不知道您是否使用这些库，但该网站不使用 javascript 加载数据，因此我们可以使用 requests 抓取该网站和 bs4 .如果网站不使用 javascript 加载数据，大多数人更喜欢使用这些库。它比 Selenium 更容易和更快。为了显示/显示数据，我正在使用 pandas with 也是处理表等数据的首选库。它准确地以表格结构打印数据，您可以将抓取的数据保存在 csv 中。 , excel file还。
如果您还想从下一页抓取所有数据，请尝试以下操作:

df={'title': [],'description': [],'ingredient':[]}

for i in range(0,108):
    url=f"https://ricetta.it/ricette-secondi?page={i}"
    page=requests.get(url)
    soup=bs(page.content,'lxml')

    for div in soup.find_all("div",class_="post-bordered"):
        df["title"].append(div.find(class_="post-title").text)
        try:
            df["description"].append(div.find(class_="post-excerpt").text)
        except:
            df["description"].append(np.nan)
        i=div.find_all(class_="nm-ingr")
        if len(i)>0:
            df["ingredient"].append([j.text for j in i])
        else:
            df["ingredient"].append(np.nan)

它将从该网站上抓取所有 107 页的数据。
您可以保存此 df至 csv或 excel file通过使用 :

df.to_csv("<filename.csv>")
# or for excel:
df.to_excel("<filename.xlsx>")

编辑:
当您问要抓取所有食谱的链接时，我想出了两件事，首先只需将标题空间替换为 -这是该配方的链接，另一个是从那里抓取的链接，为此您可以使用这段代码:

div.find(class_="post-title")["href"]

它将返回该配方的链接。对于另一种方法，您可以这样做:

df["links"]=df["title"].apply(lambda x: "https://ricetta.it/"+x.replace(" ","-").lower())
#.lower() is just to not make like a random text but it you remove it also it works.

但我个人建议你只是从网站上抓取链接，同时让链接成为我们自己的链接，我们可能会犯错误。

关于python - 网页抓取返回空字典，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/68522447/

24

4

0

文章推荐： python - 如何将 BinaryRelevance.predict 结果转换为标签名称？

文章推荐： windows - Cygwin:创建 Windows 通知并在单击通知时聚焦程序

文章推荐： phpstorm - 如何防止 PhpStorm 打破 ReflectionExceptions？

java - JGroups:发送(空，空，消息)与发送(地址，空，消息)
我已经为使用 JGroups 编写了简单的测试。有两个像这样的简单应用程序 import org.jgroups.*; import org.jgroups.conf.ConfiguratorFact
javascript - 空/空 json 如何检查它而不输出？
我有一个通过 ajax 检索的 json 编码数据集。我尝试检索的一些数据点将返回 null 或空。但是，我不希望将那些 null 或空值显示给最终用户，或传递给其他函数。我现在正在做的是检查
c# - 如果(值==空)与如果(空==值)
这个问题在这里已经有了答案: 关闭 11 年前。 Possible Duplicate: Why does one often see “null != variable” instead of “
java - 如果(空!=变量)为什么不如果(变量!=空)
嗨在我们公司，他们遵循与空值进行比较的严格规则。当我编码 if(variable!=null) 在代码审查中，我收到了对此的评论，将其更改为 if(null!=variable)。上面的代码对性能有影
typescript - Cordova 插件-qrscanner : error: no suitable constructor found for DefaultDecoderFactory(ArrayList, <空>，<空>)
我正在尝试使用 native Cordova QR 扫描仪插件编译项目，但是我不断收到此错误。据我了解，这是代码编写方式的问题，它向构造函数发送了错误的值，或者根本就没有找到构造函数。那么我该如何解决
Apache Nutch 错误 : Injector: java. io.IOException:命令字符串中的(空)条目:空 chmod 0644
我在装有 Java 1.8 的 Windows 10 上使用 Apache Nutch 1.14。我已按照 https://wiki.apache.org/nutch/NutchTutorial 中提
SQL为空且=空
这个问题已经有答案了: 已关闭11 年前。 Possible Duplicate: what is “=null” and “ IS NULL” Is there any difference bet
空-三眼乌鸦
Three-EyedRaven 内网渗透初期，我们都希望可以豪无遗漏的尽最大可能打开目标内网攻击面，故，设计该工具的初衷是解决某些工具内网探测速率慢、运行卡死、服务爆破误报率高以及socks流
Scala-空(？)作为命名Int参数的默认值
我想在Scala中像在Java中那样做: public void recv(String from) { recv(from, null); } public void recv(String
python - 空/无SIFT描述符和在python中生成的关键点
我正在尝试从一组图像补丁中创建一个密码本。我已将图像(Caltech 101)分成20 X 20图像块。我想为每个补丁创建一个SIFT描述符。但是对于某些图像补丁，它不返回任何描述符/关键点。我尝试使
spring - @Autowire注释的问题(空)
我在验证器类中自动连接的两个服务有问题。这些服务工作正常，因为在我的 Controller 中是自动连接的。我有一个 applicationContext.xml 文件和 MyApp-servlet.
java - 空 while 循环的线程问题
已关闭。此问题不符合Stack Overflow guidelines 。目前不接受答案。已关闭10 年前。问题必须表现出对要解决的问题的最低程度的了解。告诉我们您尝试过做什么，为什么不起作用，以
php - mysql_num_rows 空
大家好，我正在对数据库进行正常的选择，但是 mysql_num_rowsis 为空，我不知道为什么，我有 7 行选择。如果您发现问题，请告诉我。真的谢谢。代码如下: function get_b
ios - 打印出连接的字符串显示(空)
我想以以下格式创建一个字符串:id[]=%@&stringdata[]=%@&id[]=%@&stringdata[]=%@&id[]=%@&stringdata[]=%@&等，在for循环中，我得到
ios - stringWithContentsOfURL返回(空)
我正在尝试使用以下代码将URL转换为字符串: NSURL *urlOfOpenedFile = _service.myURLRequest.URL; NSString *fileThatWasOpen
iphone - 将UInt32传递给NSData对象返回(空)
我正在尝试将NSNumber传递到正在工作的UInt32中。然后，我试图将UInt32填充到NSData对象中。但是，这在这里变得有些时髦... 当我尝试将NSData对象中的内容写成它返回的字符串(
java - 服务器cookie=空
我正在进行身份验证并收到空 cookie。我想存储这个 cookie，但服务器没有返回给我 cookie。但响应代码是 200 ok。 httpConn.setRequestProperty(
java - 简单的菜鸟应用程序每次都会崩溃..(空)？
我认为 Button bTutorial1 = (Button) findViewById(R.layout.tutorial1); bTutorial1.setOnClickListener
jsp - 如何在JSTL中查找HashMap是否为空/空？
我的 Controller 中有这样的东西: model.attribute("hiringManagerMap",hiringManagerMap); 我正在访问此 hiringManagerMap
jQuery 空() ListView
我想知道如何以正确的方式清空列表。在 div 中有一个列表然后清空 div 或列表更好吗？我知道这是一个蹩脚的问题，但请帮助我理解这个 empty() 函数:) 案例)如果我运行这个脚本会发生什么:

首页

博学

6Ren·AI

商城

python - 网页抓取返回空字典