gpt4 book ai didi

python使用正则表达式根据前一行读取接下来的n行

转载 作者:行者123 更新时间:2023-12-01 08:43:29 26 4
gpt4 key购买 nike

CREATE TABLE `cluster_diagnostic_report`(
`run_id` string COMMENT 'format: <hostname>_<datetime> - to uniquely identify the a particular execution instance of Cluster Diag job',
`execution_hostname` string COMMENT 'Machine Name from where Test Case Executed',
`module` string COMMENT 'Test Case Module',
`expected_result` string COMMENT 'Test Case Module expected Result',
`actual_result` string COMMENT 'Test Case Module actual Result',
`validation_result` string COMMENT 'Test Case Module validation Result',
`start_time` string COMMENT 'Test Case Module Start Time',
`end_time` string COMMENT 'Test Case Module Elapsed Time',
`elapsed_time` string COMMENT 'from deserializer',
`total_time_seconds` int COMMENT 'total elapsed time for this step')
PARTITIONED BY (
`cluster_name` string,
`rptg_dt` string)
ROW FORMAT SERDE
'org.apache.hadoop.hive.ql.io.orc.OrcSerde'

根据上面的内容,我只需要获取分区列名称和类型。对于上面的示例,我想获取如下详细信息:

col_name = cluster_name, type = string
rptg_dt= cluster_name, type = string

我尝试过的在下面给出,它返回 None:

partitionResult = re.match(r"PARTITIONED\s\w+\s\((\n){2}",line)
if partitionResult == None:
pass
else:
print(partitionResult.group(1),sep='\t')

有人可以建议该怎么做吗?

最佳答案

这是一个使用 \G (从开始或上一个匹配继续)来匹配任意数量的簇列/类型的解决方案:

Online Test (需要在PCRE中运行)

示例代码(需要替代 regex Python 包)

import regex as re

regex = r"(?|PARTITIONED\s+BY\s+\(\s+`(\w+)`\s+(\w+),?|\G\s*`(\w+)`\s+(\w+),?)\K"

test_str = ("CREATE TABLE `cluster_diagnostic_report`(\n"
" `run_id` string COMMENT 'format: <hostname>_<datetime> - to uniquely identify the a particular execution instance of Cluster Diag job',\n"
" `execution_hostname` string COMMENT 'Machine Name from where Test Case Executed',\n"
" `module` string COMMENT 'Test Case Module',\n"
" `expected_result` string COMMENT 'Test Case Module expected Result',\n"
" `actual_result` string COMMENT 'Test Case Module actual Result',\n"
" `validation_result` string COMMENT 'Test Case Module validation Result',\n"
" `start_time` string COMMENT 'Test Case Module Start Time',\n"
" `end_time` string COMMENT 'Test Case Module Elapsed Time',\n"
" `elapsed_time` string COMMENT 'from deserializer',\n"
" `total_time_seconds` int COMMENT 'total elapsed time for this step')\n"
"PARTITIONED BY (\n"
" `cluster_name` string,\n"
" `cluster_name2` string,`rptg_dt` string,\n"
"`cluster_name2` string,)\n"
"ROW FORMAT SERDE\n"
" 'org.apache.hadoop.hive.ql.io.orc.OrcSerde'")

matches = re.finditer(regex, test_str, re.MULTILINE)

for matchNum, match in enumerate(matches):
for groupNum in range(0, len(match.groups())):
groupNum = groupNum + 1

print ("Group {groupNum} found at {start}-{end}: {group}".format(groupNum = groupNum, start = match.start(groupNum), end = match.end(groupNum), group = match.group(groupNum)))

输出:

Group 1 found at 789-801: cluster_name
Group 2 found at 803-809: string
Group 1 found at 813-826: cluster_name2
Group 2 found at 828-834: string
Group 1 found at 836-843: rptg_dt
Group 2 found at 845-851: string
Group 1 found at 854-867: cluster_name2
Group 2 found at 869-875: string

关于python使用正则表达式根据前一行读取接下来的n行,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/53398469/

26 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com