gpt4 book ai didi

python - pandas extractall() 没有提取给定正则表达式的所有案例?

转载 作者:太空狗 更新时间:2023-10-30 00:16:23 24 4
gpt4 key购买 nike

我有一个嵌套的字符串列表,我想从中提取日期。日期格式为:

Two numbers (from 01 to 12) hyphen tree letters (a valid month) hyphen two numbers, for example: 08-Jan—07 or 03-Oct—01

我尝试使用以下正则表达式:

r'\d{2}(—|-)(?:Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec)-\d{2,4}'

然后我测试如下:

import pandas as pd
df = pd.DataFrame({'blobs':['6-Feb- 1 4 Facebook’s virtual-reality division created a 3-EBÚ7 11 network of 500 free demo stations in Best Buy stores to give people a taste of VR using the Oculus Rift 90 GT 48 headset. But according to a Wednesday report from Business Insider, about 200 of the demo stations will close after low interest from consumers. 17-Feb-2014',
'I think in a store environment getting people to sit down and go through that experience of getting a headset on and getting set up is quite a difficult thing to achieve,” said Geoff Blaber, a CCS Insight analyst. 29—Oct-2012 Blaber 32 FAX 2978 expects that it will get easier when companies can convince 18-Oct-12 credit cards. '
]})
df

然后:

df['blobs'].str.extractall(r'\d{2}(—|-)(?:Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec)-\d{2,4}')

然而,他们没有工作。之前的正则表达式没有给我任何东西(即只是连字符 -):

    Col
0 NaN
1 -
2 -
3 NaN
4 NaN
5 -
...
n -

如何修复它们才能获得?

           Col
0 6-Feb-14, 17-Feb-2014
1 29—Oct-2012, 18-Oct-12

更新

我也尝试过:

import re
df['col'] = df.blobs.apply(lambda x: re.findall('\d{2}(—|-)(?:Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec)-\d{2,4}',x))
s = df.apply(lambda x: pd.Series(x['col']),axis=1).stack().reset_index(level=1, drop=True)
s.name = "col"
df = df.drop('col')
df

不过我也得到了:

ValueError                                Traceback (most recent call last)
<ipython-input-4-5e9a34bd159f> in <module>()
3 s = df.apply(lambda x: pd.Series(x['col']),axis=1).stack().reset_index(level=1, drop=True)
4 s.name = "col"
----> 5 df = df.drop('col')
6 df

/usr/local/lib/python3.5/site-packages/pandas/core/generic.py in drop(self, labels, axis, level, inplace, errors)
1905 new_axis = axis.drop(labels, level=level, errors=errors)
1906 else:
-> 1907 new_axis = axis.drop(labels, errors=errors)
1908 dropped = self.reindex(**{axis_name: new_axis})
1909 try:

/usr/local/lib/python3.5/site-packages/pandas/indexes/base.py in drop(self, labels, errors)
3260 if errors != 'ignore':
3261 raise ValueError('labels %s not contained in axis' %
-> 3262 labels[mask])
3263 indexer = indexer[~mask]
3264 return self.delete(indexer)

ValueError: labels ['col'] not contained in axis

最佳答案

当您使用 Series.str.extract 时或 Series.str.extractall ,返回的是 捕获的 子字符串,而不是整个匹配项。因此,您需要确保捕获(即添加 ())您需要捕获的模式部分。

现在,您的行中的几个预期匹配项使使用 extractall 变得更加困难,看来您可以使用 Series.str.findall如果模式中没有定义捕获组,则可能会返回全部匹配项

使用

rx = r'\b\d{1,2}[-–—](?:Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec)[-–—](?:\d{4}|\d{2})\b'
df['Col'] = df['blobs'].str.findall(rx).apply(','.join)

.apply(','.join) 会将列表转换为 Col 列中以逗号分隔的字符串。

模式意味着:

  • \b - 单词边界
  • \d{1,2} - 1 或 2 位数字
  • [-–—] - 连字符、em- 或 en-dash
  • (?:Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec) - 任何 12 个月的缩写
  • [-–—] - 连字符、em- 或 en-dash
  • (?:\d{4}|\d{2}) - 4 或 2 位数字
  • \b - 单词边界

关于python - pandas extractall() 没有提取给定正则表达式的所有案例?,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/42254384/

24 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com