gpt4 book ai didi

Python Regex 将评论拆分为数据框

转载 作者:行者123 更新时间:2023-12-04 15:31:51 26 4
gpt4 key购买 nike

我有一堆用户输入的各种评论串连在一起的字符串。有时,如果有多天的评论,他们会输入一个日期。我试图找到一种方法来拆分每个日期和相应的评论。文本评论可能如下所示:

raw_text = ['3/30: The dog is red. 4/01: The dog is blue', 'there is a green door', '3-25:Foobar baz'] 

我想将该文本转换为:

df = pd.DataFrame([[0,'3/30','The dog is red.'],[0,'4/01','The dog is blue'],[1,np.nan,'there is a green door'],[2,'3-25','Foobar baz']],columns = 'row_id','date','text')

print(df)

row_id date text
0 0 3/30 The dog is red.
1 0 4/01 The dog is blue
2 1 NaN there is a green door
3 2 3-25 Foobar baz

我想我需要做的是找到分号,然后回到分号之前的第一个数字来识别日期(有时他们使用/分隔,有时使用 -)。

任何有关如何使用正则表达式处理此问题的想法都将不胜感激 - 这超出了我简单的拆分/查找知识范围。

谢谢!

最佳答案

我不太了解正则表达式(所以可能有更好的解决方案)但这似乎有效...

# sample list
raw_text = ['10-30: The dog is red. 4/01: The dog is blue', 'there is a green door',
'3-25:Foobar baz', '11-25:Foobar baz. 12/20: something else']

# create regex (e.g., the variable 'n' in the comment below represents a number)
# if 'nn/nn' OR 'nn-nn' OR ' n-nn' OR ' n/nn' OR ' nn-nn' OR ' nn/nn' OR string starts with a number
regex = r'(?=\d\d/\d\d:)|(?=\d\d-\d\d:)|(?= \d-\d\d:)|(?= \d/\d\d:)|(?= \d\d-\d\d:)|(?= \d\d/\d\d:)|(?=^\d)'
# if string starts with alpha characters or there is a ':'
regex2 = r'(?=^\D)|:'

# create a Series by splitting on regex and explode
s = pd.DataFrame(raw_text)[0].str.split(regex).explode()
# boolean indexing to remove blanks
s2 = s[(s != '') & (s != ' ')]

# strip leading or trailing white space then split on regex2
df = s2.str.strip().str.split(regex2, expand=True).reset_index()
# rename columns
df.columns = ['row_id', 'date', 'text']


row_id date text
0 0 10-30 The dog is red until 5/15.
1 0 4/01 The dog is blue
2 1 there is a green door
3 2 3-25 Foobar baz
4 3 11-25 Foobar baz.
5 3 12/20 something else

关于Python Regex 将评论拆分为数据框,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/61129366/

26 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com