使用Python,你会如何从网站上抓取图片和文本。例如,假设我想同时抓取图片和文本 here ,我会使用哪些 python 工具/库?有教程吗?
请不要使用正则表达式,它不是为解析 html 而设计的。
通常我会使用以下工具组合:
- 请求模块
- lxml.html
- beautifulsoup4 检测网站编码
一种方法看起来像这样,我希望你明白(代码只是说明了这个概念,未经测试,不会起作用):
import lxml.html
import requests
from cssselect import HTMLTranslator, SelectorError
from bs4 import UnicodeDammit
# first do the http request with requests module like
r = requests.get('http://example.com')
html = r.read()
# Try to parse/decode the HTML result with lxml and beautifoulsoup4
try:
doc = UnicodeDammit(html, is_html=True)
parser = lxml.html.HTMLParser(encoding=doc.declared_html_encoding)
dom = lxml.html.document_fromstring(html, parser=parser)
dom.resolve_base_href()
except Exception as e:
print('Some error occured while lxml tried to parse: {}'.format(e.msg))
return False
# Try to extract all data that we are interested in with CSS selectors!
try:
results = dom.xpath(HTMLTranslator().css_to_xpath('some css selector to target the DOM'))
for e in results:
# access elements like
print(e.get('href')) # access href attribute
print(e.text_content()) # the content as text
# or process further
found = e.xpath(HTMLTranslator().css_to_xpath('h3.r > a:first-child'))
except Exception as e:
print(e.__cause__)
我是一名优秀的程序员,十分优秀!