gpt4 book ai didi

python - 匹配 re.compile 中的单个关键字,该关键字具有关键字列表

转载 作者:太空宇宙 更新时间:2023-11-03 15:13:41 25 4
gpt4 key购买 nike

我有这样的关键字

cat="AUTHORISATION,FORTHCOMING BOARD MEETINGS,PREVIOUS BOARD MEETINGS,BOARD MEETINGS,BOARD MEETING,MINUTES,BOARD PAPERS,AGENDA,COMMUNITY PROFILES,FORTHCOMING GOVERNOR MEETINGS,PREVIOUS GOVERNOR MEETINGS,GOVERNOR MEETINGS,GOVERNOR MEETING,GOVERNOR,COUNCIL OF GOVERNORS,GOVERNING BODY MEETINGS,COMPARISON,APC SUMMARY OF DECISIONS"

我有一些像这样的预处理

cat_list=cat.split(',')
cat_list=filter(None, cat_list)
cat_list=[s.strip() for s in cat_list]
cat_list=[re.sub('\r\n' , ' ', s) for s in cat_list]
cat_list=[re.sub(r'([^\s])\s([^\s])', r'\1+(.)+\2',x) for x in cat_list]
cat_list=[re.sub(r'([a-z][a-z]+)', r'(\1)',a,flags=re.I) for a in cat_list]
regexes_cat=[re.compile((r'(?:%s)' % '|'.join(cat_list)),re.IGNORECASE),]

它在列表中提供了 re.compile 表达式,供我执行 re.search所以处理后的最终正则表达式模式如下所示

(?:(AUTHORISATION)|(FORTHCOMING)+(.)+(BOARD)+(.)+(MEETINGS)|(PREVIOUS)+(.)+(BOARD)+(.)+(MEETINGS)|(BOARD)+(.)+(MEETINGS)|(BOARD)+(.)+(MEETING)|(MINUTES)|(BOARD)+(.)+(PAPERS)|(AGENDA)|(COMMUNITY)+(.)+(PROFILES)|(FORTHCOMING)+(.)+(GOVERNOR)+(.)+(MEETINGS)|(PREVIOUS)+(.)+(GOVERNOR)+(.)+(MEETINGS)|(GOVERNOR)+(.)+(MEETINGS)|(GOVERNOR)+(.)+(MEETING)|(GOVERNOR)|(COUNCIL)+(.)+(OF)+(.)+(GOVERNORS)|(GOVERNING)+(.)+(BODY)+(.)+(MEETINGS)|(COMPARISON)|(APC)+(.)+(SUMMARY)+(.)+(OF)+(.)+(DECISIONS))

但是如果我打印 group(0) 我会得到这样的结果

GOVERNORS-MEETINGS.ASP?P=GOVERNORS%27.COUNCIL.MEETINGS

所以我搜索并发现我必须使用 ? 使其非贪婪,但我无法获得所需的输出应该是

GOVERNORS-MEETINGS

我正在对网页上出现的 URL 和文本进行研究

http://www.qehkl.nhs.uk/governors-meetings.asp?p=governors%27.council.meetings&s=main&ss=becoming.a.foundation.trust

最佳答案

我建议的解决方案基于以下假设:

  • 正则表达式匹配应该发生在路径最后一个子部分(即在文件部分,在任何最终查询字符串之前)
  • 查询字符串是可选的

因此,解决方案是首先使用 urlparse 解析 URL,仅获取要运行正则表达式的字符串,而无需考虑环视。只需使用惰性 (.*?) 来匹配尽可能少的任何 0+ 字符,而不是 (.)+:

import re
from urlparse import urlparse

cat="AUTHORISATION,FORTHCOMING BOARD MEETINGS,PREVIOUS BOARD MEETINGS,BOARD MEETINGS,BOARD MEETING,MINUTES,BOARD PAPERS,AGENDA,COMMUNITY PROFILES,FORTHCOMING GOVERNOR MEETINGS,PREVIOUS GOVERNOR MEETINGS,GOVERNOR MEETINGS,GOVERNOR MEETING,GOVERNOR,COUNCIL OF GOVERNORS,GOVERNING BODY MEETINGS,COMPARISON,APC SUMMARY OF DECISIONS"
cat_list=cat.split(',')
cat_list=filter(None, cat_list)
cat_list=[s.strip() for s in cat_list]
cat_list=[re.sub('\r\n' , ' ', s) for s in cat_list]
cat_list=[re.sub(r'([^\s])\s([^\s])', r'\1(.*?)\2',x) for x in cat_list] # Allow anything in between the keywords, but as few as possible
cat_list=[re.sub(r'([a-z][a-z]+)', r'(\1)', a, flags=re.I) for a in cat_list]
regex_cat=re.compile(r"(?:{})".format('|'.join(cat_list)),re.IGNORECASE)
#print(regex_cat.pattern)
urls = "GOVERNORS/GOVERNORS-MEETINGS.ASP?P=GOVERNORS%27.COUNCIL.MEETINGS "
o = urlparse(urls) # Parse the URL
last_subpart = o.path.split('/').pop() # Get the last subpart
m = regex_cat.search(last_subpart) # Run the regex search
if m: # If there is a match...
print(m.group()) # Print or do anything with the value

请参阅Python demo

关于python - 匹配 re.compile 中的单个关键字,该关键字具有关键字列表,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/44042109/

25 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com