gpt4 book ai didi

Python 使用正则表达式在文本中查找所有内容

转载 作者:太空宇宙 更新时间:2023-11-04 03:02:43 25 4
gpt4 key购买 nike

我正在尝试解析网站并获取 4 个视频文件的 URL。链接示例:https://cs510400.vk.me/3/u381845574/videos/e8f1419d5b.720.mp4

首先,我抓取 HTML 代码并找到包含我的链接的标签。并找到我的链接的当前行。

我的代码:

# coding: utf-8
import requests
from bs4 import BeautifulSoup

import re


r = requests.get('https://vk.com/video-63758929_456249306')

soup = BeautifulSoup(r.content,'lxml')
scripts = soup.find_all('script')
current_tag = scripts[-1].string




links = re.findall('^.*source.*$',current_tag,re.MULTILINE)
current_line = []
for x in links:
current_line.append(x)

print(current_line)

我得到了这个结果:

[u'ajax.preload(\'al_video.php\', {"act":"show","video":"-63758929_456249306","module":"direct"}, ["\u041d\u0435\u043c\u043d\u043e\u0433\u043e \u043f\u043e\u0442\u0430\u0441\u043a\u0443\u0445\u0430","<div id=\\"video_box_wrap-63758929_456249306\\" class=\\"video_box_wrap\\">\\n  <video id=\\"video_player\\" poster=\\"https:\\/\\/pp.vk.me\\/c836534\\/v836534929\\/16e40\\/DWpFw6tiZDQ.jpg\\" preload=\\"none\\" controls  onplaying=\\"cur.incViews && cur.incViews()\\">\\n    <source src=\\"https:\\/\\/cs510400.vk.me\\/3\\/u381845574\\/videos\\/e8f1419d5b.720.mp4?extra=hX6mAywSWUELjj_xFMO6YyRMX8rqNcm193BCOJzcZxIFlwHMD5ApeTI1DW9euordOFWDVq3m4ii7OAsbPKO8y901BjBcWz7nv5U-dKOz6i69zJNmaAeQiNEezDylB3s\\" type=\\"video\\/mp4\\"><\\/source><source src=\\"https:\\/\\/cs510400.vk.me\\/3\\/u381845574\\/videos\\/e8f1419d5b.480.mp4?extra=hX6mAywSWUELjj_xFMO6YyRMX8rqNcm193BCOJzcZxIFlwHMD5ApeTI1DW9euordOFWDVq3m4ii7OAsbPKO8y901BjBcWz7nv5U-dKOz6i69zJNmaAeQiNEezDylB3s\\" type=\\"video\\/mp4\\"><\\/source><source src=\\"https:\\/\\/cs510603.vk.me\\/3\\/u381845574\\/videos\\/e8f1419d5b.360.mp4?extra=hX6mAywSWUELjj_xFMO6YyRMX8rqNcm193BCOJzcZxIFlwHMD5ApeTI1DW9euordOFWDVq3m4ii7OAsbPKO8y901BjBcWz7nv5U-dKOz6i69zJNmaAeQiNEezDylB3s\\" type=\\"video\\/mp4\\"><\\/source><source src=\\"https:\\/\\/cs510603.vk.me\\/3\\/u381845574\\/videos\\/e8f1419d5b.240.mp4?extra=hX6mAywSWUELjj_xFMO6YyRMX8rqNcm193BCOJzcZxIFlwHMD5ApeTI1DW9euordOFWDVq3m4ii7OAsbPKO8y901BjBcWz7nv5U-dKOz6i69zJNmaAeQiNEezDylB3s\\" type=\\"video\\/mp4\\"><\\/source>\\n    <div class=\\"video_box_background\\" style=\\"background-image:url(https:\\/\\/pp.vk.me\\/c836534\\/v836534929\\/16e40\\/DWpFw6tiZDQ.jpg);\\"><\\/div>\\n    <div class=\\"video_box_cant_play\\">\u0414\u0430\u043d\u043d\u043e\u0435 \u0432\u0438\u0434\u0435\u043e \u043d\u0435 \u043c\u043e\u0436\u0435\u0442 \u0431\u044b\u0442\u044c \u043f\u0440\u043e\u0438\u0433\u0440\u0430\u043d\u043e \u043d\u0430 \u044d\u0442\u043e\u043c \u0443\u0441\u0442\u0440\u043e\u0439\u0441\u0442\u0432\u0435<\\/div>\\n  <\\/video>\\n<\\/div>","\\naddTemplates({\\"_\\":\\"_\\",\\"audio_row\\":\\"<div class=\\\\\\"audio_row _audio_row _audio_row_%1%_%0% %cls% clear_fix\\\\\\" onclick=\\\\\\"return getAudioPlayer().toggleAudio(this, event)\\\\\\" data-audio=\\\\\\"%serialized%\\\\\\" data-full-id=\\\\\\"%1%_%0%\\\\\\" id=\\\\\\"audio_%1%_%0%\\\\\\">\\\\n  <div class=\\\\\\"audio_play_wrap\\\\\\" data-nodrag=\\\\\\"1\\\\\\"><button class=\\\\\\"audio_play _audio_play\\\\\\" id=\\\\\\"play_%1%_%0%\\\\\\" aria-label=\\\\\\"\\\\\\"><\\\\\\/button><\\\\\\/div>\\\\n  <div class=\\\\\\"audio_info\\\\\\">\\\\n    <div class=\\\\\\"audio_duration_wrap _audio_duration_wrap\\\\\\">\\\\n      <div class=\\\\\\"audio_hq_label\\\\\\"><\\\\\\/div>\\\\n      <div class=\\\\\\"audio_duration _audio_duration\\\\\\">%duration%<\\\\\\/div>\\\\n      <div class=\\\\\\"audio_acts\\\\\\">\\\\n        <div class=\\\\\\"audio_act\\\\\\" id=\\\\\\"recom\\\\\\" onmouseover=\\\\\\"audioShowActionTooltip(this, \'%1%_%0%\')\\\\\\" onclick=\\\\\\"AudioPage(this).showRecoms(this, \'%1%_%0%\', event)\\\\\\"><div><\\\\\\/div><\\\\\\/d
...

但我只需要我的 4 个链接。我做错了什么?如何只从这个大标签中获取链接?

最佳答案

我将您的结果作为字符串包含在内,并添加了正则表达式以提取网址。

正则表达式:

(?<=src\=\\\")(https:\\\/\\\/c[\s\S]*?mp4)

正则表达式演示:https://regex101.com/r/GDMBqH/2

在python中使用Regex时,无需转义\

Python 代码:

import re
results = '''[u'ajax.preload(\'al_video.php\', {"act":"show","video":"-63758929_456249306","module":"direct"}, ["\u041d\u0435\u043c\u043d\u043e\u0433\u043e \u043f\u043e\u0442\u0430\u0441\u043a\u0443\u0445\u0430","<div id=\\"video_box_wrap-63758929_456249306\\" class=\\"video_box_wrap\\">\\n <video id=\\"video_player\\" poster=\\"https:\\/\\/pp.vk.me\\/c836534\\/v836534929\\/16e40\\/DWpFw6tiZDQ.jpg\\" preload=\\"none\\" controls onplaying=\\"cur.incViews && cur.incViews()\\">\\n <source src=\\"https:\\/\\/cs510400.vk.me\\/3\\/u381845574\\/videos\\/e8f1419d5b.720.mp4?extra=hX6mAywSWUELjj_xFMO6YyRMX8rqNcm193BCOJzcZxIFlwHMD5ApeTI1DW9euordOFWDVq3m4ii7OAsbPKO8y901BjBcWz7nv5U-dKOz6i69zJNmaAeQiNEezDylB3s\\" type=\\"video\\/mp4\\"><\\/source><source src=\\"https:\\/\\/cs510400.vk.me\\/3\\/u381845574\\/videos\\/e8f1419d5b.480.mp4?extra=hX6mAywSWUELjj_xFMO6YyRMX8rqNcm193BCOJzcZxIFlwHMD5ApeTI1DW9euordOFWDVq3m4ii7OAsbPKO8y901BjBcWz7nv5U-dKOz6i69zJNmaAeQiNEezDylB3s\\" type=\\"video\\/mp4\\"><\\/source><source src=\\"https:\\/\\/cs510603.vk.me\\/3\\/u381845574\\/videos\\/e8f1419d5b.360.mp4?extra=hX6mAywSWUELjj_xFMO6YyRMX8rqNcm193BCOJzcZxIFlwHMD5ApeTI1DW9euordOFWDVq3m4ii7OAsbPKO8y901BjBcWz7nv5U-dKOz6i69zJNmaAeQiNEezDylB3s\\" type=\\"video\\/mp4\\"><\\/source><source src=\\"https:\\/\\/cs510603.vk.me\\/3\\/u381845574\\/videos\\/e8f1419d5b.240.mp4?extra=hX6mAywSWUELjj_xFMO6YyRMX8rqNcm193BCOJzcZxIFlwHMD5ApeTI1DW9euordOFWDVq3m4ii7OAsbPKO8y901BjBcWz7nv5U-dKOz6i69zJNmaAeQiNEezDylB3s\\" type=\\"video\\/mp4\\"><\\/source>\\n <div class=\\"video_box_background\\" style=\\"background-image:url(https:\\/\\/pp.vk.me\\/c836534\\/v836534929\\/16e40\\/DWpFw6tiZDQ.jpg);\\"><\\/div>\\n <div class=\\"video_box_cant_play\\">\u0414\u0430\u043d\u043d\u043e\u0435 \u0432\u0438\u0434\u0435\u043e \u043d\u0435 \u043c\u043e\u0436\u0435\u0442 \u0431\u044b\u0442\u044c \u043f\u0440\u043e\u0438\u0433\u0440\u0430\u043d\u043e \u043d\u0430 \u044d\u0442\u043e\u043c \u0443\u0441\u0442\u0440\u043e\u0439\u0441\u0442\u0432\u0435<\\/div>\\n <\\/video>\\n<\\/div>","\\naddTemplates({\\"_\\":\\"_\\",\\"audio_row\\":\\"<div class=\\\\\\"audio_row _audio_row _audio_row_%1%_%0% %cls% clear_fix\\\\\\" onclick=\\\\\\"return getAudioPlayer().toggleAudio(this, event)\\\\\\" data-audio=\\\\\\"%serialized%\\\\\\" data-full-id=\\\\\\"%1%_%0%\\\\\\" id=\\\\\\"audio_%1%_%0%\\\\\\">\\\\n <div class=\\\\\\"audio_play_wrap\\\\\\" data-nodrag=\\\\\\"1\\\\\\"><button class=\\\\\\"audio_play _audio_play\\\\\\" id=\\\\\\"play_%1%_%0%\\\\\\" aria-label=\\\\\\"\\\\\\"><\\\\\\/button><\\\\\\/div>\\\\n <div class=\\\\\\"audio_info\\\\\\">\\\\n <div class=\\\\\\"audio_duration_wrap _audio_duration_wrap\\\\\\">\\\\n <div class=\\\\\\"audio_hq_label\\\\\\"><\\\\\\/div>\\\\n <div class=\\\\\\"audio_duration _audio_duration\\\\\\">%duration%<\\\\\\/div>\\\\n <div class=\\\\\\"audio_acts\\\\\\">\\\\n <div class=\\\\\\"audio_act\\\\\\" id=\\\\\\"recom\\\\\\" onmouseover=\\\\\\"audioShowActionTooltip(this, \'%1%_%0%\')\\\\\\" onclick=\\\\\\"AudioPage(this).showRecoms(this, \'%1%_%0%\', event)\\\\\\"><div><\\\\\\/div><\\\\\\/d'''
for m in re.finditer(r"(https:\\/\\/c[\s\S]*?mp4)", results):
print('%s' % (m.group(0)))

演示 https://repl.it/EQkR/1

关于Python 使用正则表达式在文本中查找所有内容,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/40463214/

25 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com