gpt4 book ai didi

python - 如何从 sitemap.xml 文件创建列表以在 python 中提取 url?

转载 作者:太空宇宙 更新时间:2023-11-03 15:45:44 25 4
gpt4 key购买 nike

我需要创建一个代码来从一张图像中提取单词。我将解释,从页面 sitemap.xml ,我的代码必须尝试此 xml 文件中存在的每个链接,如果图像链接内有特定单词,则在每个链接中找到。

站点地图是 adidas = http://www.adidas.it/on/demandware.static/-/Sites-adidas-IT-Library/it_IT/v/sitemap/product/adidas-IT-it-it-product.xml

这是我为搜索包含单词“ZOOM”的图像而创建的代码:

import requests
from bs4 import BeautifulSoup

html = requests.get(
'http://www.adidas.it/scarpe-superstar/C77124.html').text
bs = BeautifulSoup(html)
possible_links = bs.find_all('img')
for link in possible_links:
if link.has_attr('src'):
if link.has_key('src'):
if 'zoom' in link['src']:
print link['src']

但我正在搜索一种自动抓取列表的方法

非常感谢

我尝试为列表执行此操作:

from bs4 import BeautifulSoup
import requests

url = "http://www.adidas.it/on/demandware.static/-/Sites-adidas-IT-Library/it_IT/v/sitemap/product/adidas-IT-it-it-product.xml"

r = requests.get(url)

data = r.text

soup = BeautifulSoup(data)

for url in soup.findAll("loc"):
print url.text

但我无法附加请求..

我可以在 sitemap.xml 中的任何链接中找到“Zoom”一词

非常感谢

最佳答案

import requests
from bs4 import BeautifulSoup
import re

def make_soup(url):
r = requests.get(url)
soup = BeautifulSoup(r.text, 'lxml')
return soup
# put urls in a list
def get_xml_urls(soup):
urls = [loc.string for loc in soup.find_all('loc')]
return urls
# get the img urls
def get_src_contain_str(soup, string):
srcs = [img['src']for img in soup.find_all('img', src=re.compile(string))]
return srcs
if __name__ == '__main__':
xml = 'http://www.adidas.it/on/demandware.static/-/Sites-adidas-IT-Library/it_IT/v/sitemap/product/adidas-IT-it-it-product.xml'
soup = make_soup(xml)
urls = get_xml_urls(soup)
# loop through the urls
for url in urls:
url_soup = make_soup(url)
srcs = get_src_contain_str(url_soup, 'zoom')
print(srcs)

关于python - 如何从 sitemap.xml 文件创建列表以在 python 中提取 url?,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/41781054/

25 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com