gpt4 book ai didi

python - HTML 标签中内容的正则表达式模式

转载 作者:行者123 更新时间:2023-11-28 17:50:49 25 4
gpt4 key购买 nike

我编写了连接到特定网站并获取所有链接的简单 Python 脚本那里。

import urllib2
import re


request = urllib2.urlopen('http://www.securitytube.net/')
content = request.read()
match = re.findall(r'<a href=".\w+.\d+">.+</a>', content)
if match:
for i in match:
print i + "\n"

else:
print 'Not Found!'

结果:

<a href="/video/3878"><img class="corner iradius20  ishadow33" width="100" heigh
t="75" src="http://videothumbs.securitytube.net.s3.amazonaws.com/3878.jpg" alt=
"avatar" /></a>

<a href="/video/3878">NodeZero Linux Review</a>

<a href="/video/3877"><img class="corner iradius20 ishadow33" width="100" heigh
t="75" src="http://videothumbs.securitytube.net.s3.amazonaws.com/3877.jpg" alt=
"avatar" /></a>

<a href="/video/3877">Post Attack Uploading Shell in Real Time</a>

<a href="/video/3867"><img class="corner iradius20 ishadow33" width="100" heigh
t="75" src="http://videothumbs.securitytube.net.s3.amazonaws.com/3867.jpg" alt=
"avatar" /></a>

<a href="/video/3867">Using SQLMAP in Real Time (SQLinjection WEB)</a>

<a href="/video/3866"><img class="corner iradius20 ishadow33" width="100" heigh
t="75" src="http://videothumbs.securitytube.net.s3.amazonaws.com/3866.jpg" alt=
"avatar" /></a>
....
...
...

我正在尝试获取那些标题易于理解的链接,例如 <a href="/video/3867">Using SQLMAP in Real Time (SQLinjection WEB)</a> .

我的模式是:<a href=".\w+.\d+">.+</a>

最佳答案

如果您真的想使用正则表达式而不是适当的解析器,您可以匹配并稍后访问它们。

参见 http://docs.python.org/library/re.html

(...)

Matches whatever regular expression is inside the parentheses, and indicates the start and end of a group; the contents of a group can be retrieved after a match has been performed

尝试:

request = urllib2.urlopen('http://www.securitytube.net/')
content = request.read()
match = re.findall(r'<a href="(.*?)".*>(.*)</a>', content)
if match:
for link, title in match:
print "link %s -> %s" % (link, title)

这个输出:

link /video/3822 -> SecurityTube SpeakUp: Cloud Computing
link /video/3587 ->
link /video/3587 -> Securitytube Speak Up: Antivirus Evasion attacks
link /video/3489 ->
link /video/3489 -> SecurityTube SpeakUp: ThePirateBay LOSS
link /video/3375 ->
link /video/3375 -> SecurityTube SpeakUp: .COM and .NET Domain Seizures
link /video/3130 ->
link /video/3130 -> SecurityTube Speak Up: The MS12-020 Fiasco!
...etc

您当然可以过滤链接,以便只考虑具有匹配标题的链接。您也会想要丢弃以 # 开头的链接……您看,合适的解析器会给您带来更好的结果。

关于python - HTML 标签中内容的正则表达式模式,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/10248963/

25 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com