gpt4 book ai didi

python - 正则表达式捕获特定的百分比/小数

转载 作者:行者123 更新时间:2023-12-01 01:48:33 28 4
gpt4 key购买 nike

我正在尝试在多个网站上获取利率。数据相当非结构化,但形式足够接近。我想要捕捉的内容:

x.xx% 至 xx.xx%

数据示例:

由 FDIC 成员 WebBank 发放的所有贷款。您的实际利率取决于信用评分、贷款金额、贷款期限以及信用使用情况和历史记录。 年利率范围为 5.98% 至 35.89%。例如,您可以获得一笔 6,000 美元的贷款,利率为 7.99%,利率为 5.00%,启动费为 300 美元,年利率为 11.51%。在此示例中,您将收到 5,700 美元,并每月支付 36 美元,金额为 187.99 美元。应付总额为 6,767.64 美元。您的年利率将根据您申请时的信用来确定。截至 2017 年第一季度,启动费从 1% 到 6% 不等,平均启动费为 5.49%。没有首付,也没有预付款罚金。您的贷款的结束取决于您是否同意 www.lendingclub.com 网站上所有必需的协议(protocol)和披露。通过 LendingClub 提供的所有贷款的最低还款期限为 36 个月或更长。

3.09% – 14.24%*

固定费率:年利率 6.99% 至 24.99%锁定您的费率。您的每月付款永远不会改变。

我已经将想要捕捉的内容加粗了。我当前的正则表达式如下所示:

(re.findall('(?i)(\d\.\d\d% (?:to|-) \d\d\.\d\d%)

实际报价如下:

plcompetitors = ['https://www.lendingclub.com/loans/personal-loans',
'https://www.marcus.com/us/en/personal-loans',
'https://www.discover.com/personal-loans/',
'https://www.lightstream.com/',
'https://www.prosper.com/']

#cycle through links in array until it finds APR rates/fixed or variable using regex
for link in plcompetitors:
cdate = datetime.date.today()
l = r.get(link)
l.encoding = 'utf-8'
data = l.text
soup = bs(data, 'html.parser')
paragraph = soup.find_all(text=re.compile('[0-9]%'))
for n in paragraph:
matches = []
matches.extend(re.findall('(?i)(\d\.\d\d% (?:to|-) \d\d\.\d\d%)', n.string))
matches.append(cdate.isoformat())
matches.append(link)
print(matches)
paragraph.append(cdate.isoformat())
paragraph.append(link)

新输出:

['5.98% to 35.89%', '2018-06-22', 'https://www.lendingclub.com/loans/personal-loans']
['2018-06-22', 'https://www.lendingclub.com/loans/personal-loans']
['6.99% to 24.99%', '6.99% to 24.99%', '6.99% to 24.99%', '6.99% to 24.99%', '2018-06-22', 'https://www.marcus.com/us/en/personal-loans']
['2018-06-22', 'https://www.marcus.com/us/en/personal-loans']
['2018-06-22', 'https://www.discover.com/personal-loans/']
['2018-06-22', 'https://www.discover.com/personal-loans/']
['2018-06-22', 'https://www.discover.com/personal-loans/']
['6.99% to 24.99%', '2018-06-22', 'https://www.discover.com/personal-loans/']
['2018-06-22', 'https://www.discover.com/personal-loans/']
['2018-06-22', 'https://www.discover.com/personal-loans/']
['2018-06-22', 'https://www.discover.com/personal-loans/']
['2018-06-22', 'https://www.discover.com/personal-loans/']
['2018-06-22', 'https://www.lightstream.com/']
['2018-06-22', 'https://www.lightstream.com/']
['2018-06-22', 'https://www.lightstream.com/']
['2018-06-22', 'https://www.lightstream.com/']
['2018-06-22', 'https://www.lightstream.com/']
['2018-06-22', 'https://www.lightstream.com/']
['2018-06-22', 'https://www.lightstream.com/']
['2018-06-22', 'https://www.prosper.com/']
['2018-06-22', 'https://www.prosper.com/']
['2018-06-22', 'https://www.prosper.com/']
['2018-06-22', 'https://www.prosper.com/']
['2018-06-22', 'https://www.prosper.com/']
['2018-06-22', 'https://www.prosper.com/']
['2018-06-22', 'https://www.prosper.com/']
['2018-06-22', 'https://www.prosper.com/']
['2018-06-22', 'https://www.prosper.com/']
['2018-06-22', 'https://www.prosper.com/']
['2018-06-22', 'https://www.prosper.com/']

最佳答案

paragraph = soup.find_all(text=re.compile('(?i)(\d\.\d\d% (?:to|-) \d\d\.\d\d%)')) line 获取所有值与您的模式匹配的节点。您实际上需要从这些段落中提取匹配项。

使用类似的东西

matches=[]
for n in paragraph:
matches.extend(re.findall(pattern, n.string))

至于模式本身,您可以使用

(?i)\d+(?:\.\d+)?%\s*(?:to|-)\s*\d+(?:\.\d+)?%

请参阅regex demo 。详情:

  • (?i) - 不区分大小写的处理已开启
  • \d+(?:\.\d+)? - 1+ 位数字,可选后跟 .和 1+ 位数字
  • % - 一个%标志
  • \s* - 0+ 空格
  • (?:to|-) -to-
  • \s*\d+(?:\.\d+)?% - 参见上文(简而言之,空格、int 或 float 值,后跟 % )。

关于python - 正则表达式捕获特定的百分比/小数,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/50977971/

28 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com