gpt4 book ai didi

python - 提取子字符串的上下文 URL

转载 作者:太空宇宙 更新时间:2023-11-03 16:43:09 25 4
gpt4 key购买 nike

我正在构建一个 scrapy 应用程序,如果 URL 中的子字符串匹配,我需要提取完整的 URL。

例如:

假设某个页面具有我感兴趣的以下 URL:

  • /public/flag?cat=Computers/Programming/Languages/Python/Books&url=http://www.pearsonhighered.com/educator/academic/product/0,,0130260363,00%2Ben-USS_01DBC.html
  • /public/flag?cat=Computers/Programming/Languages/Python/Books&url=http://www.brpreiss.com/books/opus7/html/book.html
  • /public/flag?cat=Computers/Programming/Languages/Python/Books&url=http://www.diveintopython.net/
  • /public/flag?cat=Computers/Programming/Languages/Python/Books&url=http://rhodesmill.org/brandon/2011/foundations-of-python-network-programming/
  • [还有 18 个]

但我的搜索字符串是 flag?cat=Computers/Programming/Languages/Python/Books

仅返回 URL 的匹配部分,而不返回完整的 URL。如何获取上面列出的完整 URL?

这是一个基于示例的简单 scrapy 测试用例:

from scrapy.spiders import Spider
from scrapy.selector import Selector
import scrapy

class DmozSpider(Spider):
name = "dmoz"
allowed_domains = ["dmoz.org"]
start_urls = [
"http://www.dmoz.org/Computers/Programming/Languages/Python/Books/",
]

def parse(self, response):
#scrapy.shell.inspect_response( response, self )
results = response.xpath('//body').re('(flag\?cat=Computers%2FProgramming%2FLanguages%2FPython%2FBooks)')
print results

输出:

[
u'flag?cat=Computers%2FProgramming%2FLanguages%2FPython%2FBooks',
u'flag?cat=Computers%2FProgramming%2FLanguages%2FPython%2FBooks',
u'flag?cat=Computers%2FProgramming%2FLanguages%2FPython%2FBooks',
u'flag?cat=Computers%2FProgramming%2FLanguages%2FPython%2FBooks',
u'flag?cat=Computers%2FProgramming%2FLanguages%2FPython%2FBooks',
u'flag?cat=Computers%2FProgramming%2FLanguages%2FPython%2FBooks',
u'flag?cat=Computers%2FProgramming%2FLanguages%2FPython%2FBooks',
u'flag?cat=Computers%2FProgramming%2FLanguages%2FPython%2FBooks',
u'flag?cat=Computers%2FProgramming%2FLanguages%2FPython%2FBooks',
u'flag?cat=Computers%2FProgramming%2FLanguages%2FPython%2FBooks',
u'flag?cat=Computers%2FProgramming%2FLanguages%2FPython%2FBooks',
u'flag?cat=Computers%2FProgramming%2FLanguages%2FPython%2FBooks',
u'flag?cat=Computers%2FProgramming%2FLanguages%2FPython%2FBooks',
u'flag?cat=Computers%2FProgramming%2FLanguages%2FPython%2FBooks',
u'flag?cat=Computers%2FProgramming%2FLanguages%2FPython%2FBooks',
u'flag?cat=Computers%2FProgramming%2FLanguages%2FPython%2FBooks',
u'flag?cat=Computers%2FProgramming%2FLanguages%2FPython%2FBooks',
u'flag?cat=Computers%2FProgramming%2FLanguages%2FPython%2FBooks',
u'flag?cat=Computers%2FProgramming%2FLanguages%2FPython%2FBooks',
u'flag?cat=Computers%2FProgramming%2FLanguages%2FPython%2FBooks',
u'flag?cat=Computers%2FProgramming%2FLanguages%2FPython%2FBooks',
u'flag?cat=Computers%2FProgramming%2FLanguages%2FPython%2FBooks'
]

预期输出:

[
u'/public/flag?cat=Computers%2FProgramming%2FLanguages%2FPython%2FBooks&amp;url=http%3A%2F%2Fwww.pearsonhighered.com%2Feducator%2Facademic%2Fproduct%2F0%2C%2C0130260363%2C00%252Ben-USS_01DBC.html"><img src="/img/flag.png" alt="[!]" title="report an issue with this listing'
u'/public/flag?cat=Computers%2FProgramming%2FLanguages%2FPython%2FBooks&amp;url=http%3A%2F%2Fwww.brpreiss.com%2Fbooks%2Fopus7%2Fhtml%2Fbook.html"><img src="/img/flag.png" alt="[!]" title="report an issue with this listing'
u'/public/flag?cat=Computers%2FProgramming%2FLanguages%2FPython%2FBooks&amp;url=http%3A%2F%2Fwww.diveintopython.net%2F"><img src="/img/flag.png" alt="[!]" title="report an issue with this listing'
u'/public/flag?cat=Computers%2FProgramming%2FLanguages%2FPython%2FBooks&amp;url=http%3A%2F%2Frhodesmill.org%2Fbrandon%2F2011%2Ffoundations-of-python-network-programming%2F"><img src="/img/flag.png" alt="[!]" title="report an issue with this listing'
u'/public/flag?cat=Computers%2FProgramming%2FLanguages%2FPython%2FBooks&amp;url=http%3A%2F%2Fwww.techbooksforfree.com%2Fperlpython.shtml"><img src="/img/flag.png" alt="[!]" title="report an issue with this listing'
u'/public/flag?cat=Computers%2FProgramming%2FLanguages%2FPython%2FBooks&amp;url=http%3A%2F%2Fwww.freetechbooks.com%2Fpython-f6.html"><img src="/img/flag.png" alt="[!]" title="report an issue with this listing'
u'/public/flag?cat=Computers%2FProgramming%2FLanguages%2FPython%2FBooks&amp;url=http%3A%2F%2Fgreenteapress.com%2Fthinkpython%2F"><img src="/img/flag.png" alt="[!]" title="report an issue with this listing'
u'/public/flag?cat=Computers%2FProgramming%2FLanguages%2FPython%2FBooks&amp;url=http%3A%2F%2Fwww.network-theory.co.uk%2Fpython%2Fintro%2F"><img src="/img/flag.png" alt="[!]" title="report an issue with this listing'
u'/public/flag?cat=Computers%2FProgramming%2FLanguages%2FPython%2FBooks&amp;url=http%3A%2F%2Fwww.freenetpages.co.uk%2Fhp%2Falan.gauld%2F"><img src="/img/flag.png" alt="[!]" title="report an issue with this listing'
u'/public/flag?cat=Computers%2FProgramming%2FLanguages%2FPython%2FBooks&amp;url=http%3A%2F%2Fwww.wiley.com%2FWileyCDA%2FWileyTitle%2FproductCd-0471219754.html"><img src="/img/flag.png" alt="[!]" title="report an issue with this listing'
u'/public/flag?cat=Computers%2FProgramming%2FLanguages%2FPython%2FBooks&amp;url=http%3A%2F%2Fhetland.org%2Fwriting%2Fpractical-python%2F"><img src="/img/flag.png" alt="[!]" title="report an issue with this listing'
u'/public/flag?cat=Computers%2FProgramming%2FLanguages%2FPython%2FBooks&amp;url=http%3A%2F%2Fsysadminpy.com%2F"><img src="/img/flag.png" alt="[!]" title="report an issue with this listing'
u'/public/flag?cat=Computers%2FProgramming%2FLanguages%2FPython%2FBooks&amp;url=http%3A%2F%2Fwww.qtrac.eu%2Fpy3book.html"><img src="/img/flag.png" alt="[!]" title="report an issue with this listing'
u'/public/flag?cat=Computers%2FProgramming%2FLanguages%2FPython%2FBooks&amp;url=http%3A%2F%2Fwww.wiley.com%2FWileyCDA%2FWileyTitle%2FproductCd-0764548077.html"><img src="/img/flag.png" alt="[!]" title="report an issue with this listing'
u'/public/flag?cat=Computers%2FProgramming%2FLanguages%2FPython%2FBooks&amp;url=https%3A%2F%2Fwww.packtpub.com%2Fpython-3-object-oriented-programming%2Fbook"><img src="/img/flag.png" alt="[!]" title="report an issue with this listing'
u'/public/flag?cat=Computers%2FProgramming%2FLanguages%2FPython%2FBooks&amp;url=http%3A%2F%2Fwww.network-theory.co.uk%2Fpython%2Flanguage%2F"><img src="/img/flag.png" alt="[!]" title="report an issue with this listing'
u'/public/flag?cat=Computers%2FProgramming%2FLanguages%2FPython%2FBooks&amp;url=http%3A%2F%2Fwww.pearsonhighered.com%2Feducator%2Facademic%2Fproduct%2F0%2C%2C0130409561%2C00%252Ben-USS_01DBC.html"><img src="/img/flag.png" alt="[!]" title="report an issue with this listing'
u'/public/flag?cat=Computers%2FProgramming%2FLanguages%2FPython%2FBooks&amp;url=http%3A%2F%2Fwww.informit.com%2Fstore%2Fproduct.aspx%3Fisbn%3D0201616165%26redir%3D1"><img src="/img/flag.png" alt="[!]" title="report an issue with this listing'
u'/public/flag?cat=Computers%2FProgramming%2FLanguages%2FPython%2FBooks&amp;url=http%3A%2F%2Fwww.pearsonhighered.com%2Feducator%2Facademic%2Fproduct%2F0%2C%2C0201748843%2C00%252Ben-USS_01DBC.html"><img src="/img/flag.png" alt="[!]" title="report an issue with this listing'
u'/public/flag?cat=Computers%2FProgramming%2FLanguages%2FPython%2FBooks&amp;url=http%3A%2F%2Fwww.informit.com%2Fstore%2Fproduct.aspx%3Fisbn%3D0672317354"><img src="/img/flag.png" alt="[!]" title="report an issue with this listing'
u'/public/flag?cat=Computers%2FProgramming%2FLanguages%2FPython%2FBooks&amp;url=http%3A%2F%2Fgnosis.cx%2FTPiP%2F"><img src="/img/flag.png" alt="[!]" title="report an issue with this listing'
u'/public/flag?cat=Computers%2FProgramming%2FLanguages%2FPython%2FBooks&amp;url=http%3A%2F%2Fwww.informit.com%2Fstore%2Fproduct.aspx%3Fisbn%3D0130211192"><img src="/img/flag.png" alt="[!]" title="report an issue with this listing'
]

最佳答案

问题是 .re() 只会返回与表达式匹配的部分。相反,如果您想继续使用正则表达式检查,请使用 re:test()钩子(Hook):

response.xpath('//body//a/@href[re:test(., "flag\?cat=Computers%2FProgramming%2FLanguages%2FPython%2FBooks")]').extract()

在我这边产生以下内容:

[
u'/public/flag?cat=Computers%2FProgramming%2FLanguages%2FPython%2FBooks&url=http%3A%2F%2Fwww.pearsonhighered.com%2Feducator%2Facademic%2Fproduct%2F0%2C%2C0130260363%2C00%252Ben-USS_01DBC.html',
u'/public/flag?cat=Computers%2FProgramming%2FLanguages%2FPython%2FBooks&url=http%3A%2F%2Fwww.brpreiss.com%2Fbooks%2Fopus7%2Fhtml%2Fbook.html',
...
]

关于python - 提取子字符串的上下文 URL,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/36604507/

25 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com