python - 用python抓取图片和文字-6ren

python - 用python抓取图片和文字

转载作者：太空宇宙更新时间：2023-11-03 18:38:35

使用Python，你会如何从网站上抓取图片和文本。例如，假设我想同时抓取图片和文本 here ，我会使用哪些 python 工具/库？有教程吗？

最佳答案

请不要使用正则表达式，它不是为解析 html 而设计的。

通常我会使用以下工具组合:

请求模块
lxml.html
beautifulsoup4 检测网站编码

一种方法看起来像这样，我希望你明白(代码只是说明了这个概念，未经测试，不会起作用):

import lxml.html
import requests
from cssselect import HTMLTranslator, SelectorError
from bs4 import UnicodeDammit

# first do the http request with requests module like
r = requests.get('http://example.com')
html = r.read()

# Try to parse/decode the HTML result with lxml and beautifoulsoup4
try:
    doc = UnicodeDammit(html, is_html=True)
    parser = lxml.html.HTMLParser(encoding=doc.declared_html_encoding)
    dom = lxml.html.document_fromstring(html, parser=parser)
    dom.resolve_base_href()
except Exception as e:
    print('Some error occured while lxml tried to parse: {}'.format(e.msg))
    return False

# Try to extract all data that we are interested in with CSS selectors!
try:
    results = dom.xpath(HTMLTranslator().css_to_xpath('some css selector to target the DOM'))
    for e in results:
        # access elements like
        print(e.get('href')) # access href attribute
        print(e.text_content()) # the content as text
        # or process further
        found = e.xpath(HTMLTranslator().css_to_xpath('h3.r > a:first-child'))
except Exception as e:
    print(e.__cause__)

关于python - 用python抓取图片和文字，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/21069984/

文章推荐： html - 链接的 css 和 javascripts 以内联形式出现

文章推荐： functional-programming - xappings, xectors, xets

文章推荐： python - 用 python 抓取网页

太空宇宙

个人简介

我是一名优秀的程序员,十分优秀！

作者热门文章

滴滴打车优惠券免费领取

全站热门文章

首页

博学

6Ren·AI

商城

python - 用python抓取图片和文字