gpt4 book ai didi

python - 用于提取特定变量和值的正则表达式

转载 作者:太空宇宙 更新时间:2023-11-04 04:12:26 31 4
gpt4 key购买 nike

我正在使用 Google Vision API 从申请表的图像中提取文本(手写和计算机书写)。响应是一个长字符串,如下所示。

字符串:

"A. Bank Challan
Bank Branch
ca
ABC muitce
Deposit ID VOSSÁETM-0055
Deposit Date 16 al 19
ate
B. Personal Information: Use CAPITAL letters and leave spaces between words.
Name: MUHAMMAD HANIE
Father's Name: MUHAMMAD Y AQOOB
Computerized NIC No. 44 603-5 284 355-3
D D M m rrrr
Gender: Male Age: (in years) 22 Date of Birth ( 4-08-1999
Domicile (District): Mirpuskhas Contact No. 0333-7078758
(Please do not mention converted No.)
Postal Address: Raheel Book Depo Naukot Taluka jhuddo Disstri mes.
Sindh.
Are You Government Servant: Yes
(If yes, please attach NOC)
No

Religion: Muslim

Non-Muslimo
C. Academic Information:
B
Intermediate/HSSC ENG Mirpuskhas Bise Match
Seience BISEmirpuskhas Match
2016
2014
Matric/SSC"

整个响应对我来说没有用,但是我需要解析响应以获取特定字段,例如姓名、父亲姓名、NIC 号码、性别、年龄、DoB、住所和联系号码。

我正在使用 Python 中的正则表达式库 (re) 为每个字段定义模式。例如:

import re
name ='Name: \w+\s\w+'
fatherName = 'Father\'s Name: \w+\s\w+\s\w+'
age ='Age: \D+\d+'

print(re.search(name,string).group())
print(re.search(fatherName, string).group())
print(re.search(age,string).group())

输出:

"Name: MUHAMMAD HANIE
Father's Name: MUHAMMAD Y AQOOB
Age: (in years) 22"

但是这些都不是可靠的模式,我不知道这种方法是否好。我也无法提取同一行的字段,例如性别和年龄。

我该如何解决这个问题?

最佳答案

它可能不够稳健,但是可以设计一个表达式来提取您想要的三个参数。 This tool可以帮助您这样做。也许,您可能想要一个具有多个边界的表达式:

(?=[A-Z])((Name:[A-Z-a-z\s]+\n|\s)|(Father\x27s\sName[A-Z-a-z\s\.]+\n|\s)|(Age:\s\(in\syears\)\s[0-9]+))

最好将注意力集中在您希望提取的文本上。

差异

  • 年龄:这个变量似乎很容易提取
  • 姓名和父亲姓名:您可能想检查这两个变量中的值可能是什么样子,以便将其添加到字符列表中。我只是假设,这可能是一个字符列表:[A-Z-a-z\s\.]。但是,您可以根据需要更改/简化它。

enter image description here

正则表达式描述图

link帮助您形象化表达:

enter image description here

Python 测试

# -*- coding: UTF-8 -*-
import re

string = """
A. Bank Challan
Bank Branch
ca
ABC muitce
Deposit ID VOSSÁETM-0055
Deposit Date 16 al 19
ate
B. Personal Information: Use CAPITAL letters and leave spaces between words.
Name: MUHAMMAD HANIE
Father's Name: MUHAMMAD Y AQOOB
Computerized NIC No. 44 603-5 284 355-3
D D M m rrrr
Gender: Male Age: (in years) 22 Date of Birth ( 4-08-1999
Domicile (District): Mirpuskhas Contact No. 0333-7078758
(Please do not mention converted No.)
Postal Address: Raheel Book Depo Naukot Taluka jhuddo Disstri mes.
Sindh.
Are You Government Servant: Yes
(If yes, please attach NOC)
No

Religion: Muslim

Non-Muslimo
C. Academic Information:
B
Intermediate/HSSC ENG Mirpuskhas Bise Match
Seience BISEmirpuskhas Match
2016
2014
Matric/SSC"""
expression = r'(?=[A-Z])((Name:[A-Z-a-z\s]+\n|\s)|(Father\x27s\sName[A-Z-a-z\s\.]+\n|\s)|(Age:\s\(in\syears\)\s[0-9]+))'
match = re.search(expression, string)
if match:
print("YAAAY! \"" + match.group(2) + "\" is a match 💚💚💚 ")
else:
print('🙀 Sorry! No matches!')

输出

YAAAY! "Name: MUHAMMAD HANIE" is a match 💚💚💚

关于python - 用于提取特定变量和值的正则表达式,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/56094441/

31 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com