gpt4 book ai didi

python - 读取带有不正确标记的字典的文件

转载 作者:行者123 更新时间:2023-12-01 06:07:41 27 4
gpt4 key购买 nike

我有一个包含字典列表的文件,其中大多数都错误地用引号标记。示例如下:

{game:Available,player:Available,location:"Chelsea, London, England",time:Available}
{"game":"Available","player":"Available","location":"Chelsea, London, England","time":"Available","date":"Available"}

正如您所看到的,字典之间的键也可能有所不同。

我尝试使用 json 模块或 csv 模块的 DictReader 来读取该内容,但每次我都会遇到困难,因为“”始终出现在位置值中,但并不总是出现在其他键或值中。到目前为止,我看到两种可能性:

  1. 将“,”替换为“;”在位置值中,并删除所有引号。
  2. 为每个值和键添加引号(位置除外)。
PS:我的最后一点是能够格式化所有这些字典来创建一个 SQL 表,其中的列是所有字典的并集,每一行都是我的字典之一,当缺少值时为空白。

最佳答案

我认为这是一个非常完整的代码。

首先我创建了以下文件:

{surprise : "perturbating at start  ", game:Available Universal Dices Game,
player:FTROE875574,location
:"Lakeview School, Kingsmere Boulevard, Saskatoon, Saskatchewan , Canada",time:15h18}

{"game":"Available"," player":"LOI4531",
"location": "Perth, Australia","time":"08h13","date":"Available"}

{"game":Available,player:PLLI874,location:"Chelsea, London, England",time:20h35}

{special:"midnight happening",game:"Available","player":YTR44,
"location":"Paris, France","time":"02h24"
,
"date":"Available"}

{game:Available,surprise:" hretyuuhuhu ",player:FT875,location
:,"time":11h22}

{"game":"Available","player":"LOI4531","location":
"Damas,Syria","time":"unavailable","date":"Available"}

{"surprise " : GARAMANANATALA Tower , game:Available Dices,player :
PuLuLu874,location:" Westminster, London, England ",time:20h01}

{"game":"Available",special:"overnight", "player":YTR44,"location":
"Madrid, Spain" , "time":
"12h33",
date:"Available"
}

.

.

然后以下代码分两个阶段处理文件的内容:

  • 首先,遍历内容,收集所有字典中的所有介入键

  • 推导字典posis,它为每个键给出其对应值在一行中必须占据的位置

  • 其次,由于再次运行该文件,行被逐个构建并收集在列表

顺便请注意,与键位置“位置”关联的值的条件受到尊重。

.

import re

dicreg = re.compile('(?<=\{)[^}]*}')

kvregx = re.compile('[ \r\n]*'
'(" *)?((location)|[^:]+?)(?(1) *")'
'[ \r\n]*'
':'
'[ \r\n]*'
'(?(3)|(" *)?)([^:]*?)(?(4) *")'
'[ \r\n]*(?:,(?=[^,]+?:)|\})')


checking_dict = {}
checking_list = []

filename = 'zzz.txt'

with open(filename) as f:

######## First part: to gather all the keys in all the dictionaries

prec,chunk = '','go'
ecr = []
while chunk:
chunk = f.read(120)
ss = ''.join((prec,chunk))
ecr.append('\n\n------------------------------------------------------------\nss == %r' %ss)
mat_dic = None
for mat_dic in dicreg.finditer(ss):
ecr.append('\nmmmmmmm dictionary found in ss mmmmmmmmmmmmmm')
for mat_kv in kvregx.finditer(mat_dic.group()):
k,v = mat_kv.group(2,5)
ecr.append('%s : %s' % (k,v))
if k in checking_list:
checking_dict[k] += 1
else:
checking_list.append(k)
checking_dict[k] = 1
if mat_dic:
prec = ss[mat_dic.end():]
else:
prec += chunk

print '\n'.join(ecr)
print '\n\n\nchecking_dict == %s\n\nchecking_list == %s' %(checking_dict,checking_list)

######## The keys are sorted in order that the less frequent ones are at the end
checking_list.sort(key=lambda k: checking_dict[k], reverse=True)
posis = dict((k,i) for i,k in enumerate(checking_list))
print '\nchecking_list sorted == %s\n\nposis == %s' % (checking_list,posis)



######## Now, the file is read again to build a list of rows

f.seek(0,0) # the file's pointer is move backed to the beginning of the file

prec,chunk = '','go'
base = [ '' for i in xrange(len(checking_list))]
rows = []
while chunk:
chunk = f.read(110)
ss = ''.join((prec,chunk))
mat_dic = None
for mat_dic in dicreg.finditer(ss):
li = base[:]
for mat_kv in kvregx.finditer(mat_dic.group()):
k,v = mat_kv.group(2,5)
li[posis[k]] = v
rows.append(li)
if mat_dic:
prec = ss[mat_dic.end():]
else:
prec += chunk


print '\n\n%s\n%s' % (checking_list,30*'___')
print '\n'.join(str(li) for li in rows)

结果

------------------------------------------------------------
ss == '{surprise : "perturbating at start ", game:Available Universal Dices Game,\n player:FTROE875574,location\n:"Lakeview S'


------------------------------------------------------------
ss == '{surprise : "perturbating at start ", game:Available Universal Dices Game,\n player:FTROE875574,location\n:"Lakeview School, Kingsmere Boulevard, Saskatoon, Saskatchewan , Canada",time:15h18}\n\n{"game":"Available"," player":"LOI4531",\n"l'

mmmmmmm dictionary found in ss mmmmmmmmmmmmmm
surprise : perturbating at start
game : Available Universal Dices Game
player : FTROE875574
location : "Lakeview School, Kingsmere Boulevard, Saskatoon, Saskatchewan , Canada"
time : 15h18


------------------------------------------------------------
ss == '\n\n{"game":"Available"," player":"LOI4531",\n"location": "Perth, Australia","time":"08h13","date":"Available"}\n\n{"game":Available,player:PLLI874,location:"Chelsea, Lo'

mmmmmmm dictionary found in ss mmmmmmmmmmmmmm
game : Available
player : LOI4531
location : "Perth, Australia"
time : 08h13
date : Available


------------------------------------------------------------
ss == '\n\n{"game":Available,player:PLLI874,location:"Chelsea, London, England",time:20h35}\n\n{special:"midnight happening",game:"Available","player":YTR44,\n"location":"Paris, France","t'

mmmmmmm dictionary found in ss mmmmmmmmmmmmmm
game : Available
player : PLLI874
location : "Chelsea, London, England"
time : 20h35


------------------------------------------------------------
ss == '\n\n{special:"midnight happening",game:"Available","player":YTR44,\n"location":"Paris, France","time":"02h24"\n,\n"date":"Available"}\n\n{game:Available,surprise:" hretyuuhuhu ",player:FT875,location\n:,"time":11h22}\n\n{"'

mmmmmmm dictionary found in ss mmmmmmmmmmmmmm
special : midnight happening
game : Available
player : YTR44
location : "Paris, France"
time : 02h24
date : Available

mmmmmmm dictionary found in ss mmmmmmmmmmmmmm
game : Available
surprise : hretyuuhuhu
player : FT875
location :
time : 11h22


------------------------------------------------------------
ss == '\n\n{"game":"Available","player":"LOI4531","location":\n"Damas,Syria","time":"unavailable","date":"Available"}\n\n{"surprise " '

mmmmmmm dictionary found in ss mmmmmmmmmmmmmm
game : Available
player : LOI4531
location : "Damas,Syria"
time : unavailable
date : Available


------------------------------------------------------------
ss == '\n\n{"surprise " : GARAMANANATALA Tower , game:Available Dices,player :\n PuLuLu874,location:" Westminster, London, England ",time:20'


------------------------------------------------------------
ss == '\n\n{"surprise " : GARAMANANATALA Tower , game:Available Dices,player :\n PuLuLu874,location:" Westminster, London, England ",time:20h01}\n\n{"game":"Available",special:"overnight", "player":YTR44,"location":\n"Madrid, Spain" , "time":\n"12h33",\nda'

mmmmmmm dictionary found in ss mmmmmmmmmmmmmm
surprise : GARAMANANATALA Tower
game : Available Dices
player : PuLuLu874
location : " Westminster, London, England "
time : 20h01


------------------------------------------------------------
ss == '\n\n{"game":"Available",special:"overnight", "player":YTR44,"location":\n"Madrid, Spain" , "time":\n"12h33",\ndate:"Available"\n}'

mmmmmmm dictionary found in ss mmmmmmmmmmmmmm
game : Available
special : overnight
player : YTR44
location : "Madrid, Spain"
time : 12h33
date : Available


------------------------------------------------------------
ss == ''



checking_dict == {'player': 8, 'game': 8, 'location': 8, 'time': 8, 'date': 4, 'surprise': 3, 'special': 2}

checking_list == ['surprise', 'game', 'player', 'location', 'time', 'date', 'special']

checking_list sorted == ['game', 'player', 'location', 'time', 'date', 'surprise', 'special']

posis == {'player': 1, 'game': 0, 'location': 2, 'time': 3, 'date': 4, 'surprise': 5, 'special': 6}


['game', 'player', 'location', 'time', 'date', 'surprise', 'special']
__________________________________________________________________________________________
['Available Universal Dices Game', 'FTROE875574', '"Lakeview School, Kingsmere Boulevard, Saskatoon, Saskatchewan , Canada"', '15h18', '', 'perturbating at start', '']
['Available', 'LOI4531', '"Perth, Australia"', '08h13', 'Available', '', '']
['Available', 'PLLI874', '"Chelsea, London, England"', '20h35', '', '', '']
['Available', 'YTR44', '"Paris, France"', '02h24', 'Available', '', 'midnight happening']
['Available', 'FT875', '', '11h22', '', 'hretyuuhuhu', '']
['Available', 'LOI4531', '"Damas,Syria"', 'unavailable', 'Available', '', '']
['Available Dices', 'PuLuLu874', '" Westminster, London, England "', '20h01', '', 'GARAMANANATALA Tower', '']
['Available', 'YTR44', '"Madrid, Spain"', '12h33', 'Available', '', 'overnight']

.

.

我把上面的代码写成一个数GB的巨大文件,无法完全读取:对这样一个非常大的文件的处理必须一个又一个地 block 完成。这就是为什么有说明:

while chunk:
chunk = f.read(120)
ss = ''.join((prec,chunk))
ecr.append('\n\n------------------------------------------------------------\nss == %r' %ss)
mat_dic = None
for mat_dic in dicreg.finditer(ss):
............
...............
if mat_dic:
prec = ss[mat_dic.end():]
else:
prec += chunk

但是,显然,如果文件不太大,因此可以一次性读取,则可以简化代码:

import re

dicreg = re.compile('(?<=\{)[^}]*}')

kvregx = re.compile('[ \r\n]*'
'(" *)?((location)|[^:]+?)(?(1) *")'
'[ \r\n]*'
':'
'[ \r\n]*'
'(?(3)|(" *)?)([^:]*?)(?(4) *")'
'[ \r\n]*(?:,(?=[^,]+?:)|\})')


checking_dict = {}
checking_list = []

filename = 'zzz.txt'

with open(filename) as f:
content = f.read()




######## First part: to gather all the keys in all the dictionaries

ecr = []

for mat_dic in dicreg.finditer(content):
ecr.append('\nmmmmmmm dictionary found in ss mmmmmmmmmmmmmm')
for mat_kv in kvregx.finditer(mat_dic.group()):
k,v = mat_kv.group(2,5)
ecr.append('%s : %s' % (k,v))
if k in checking_list:
checking_dict[k] += 1
else:
checking_list.append(k)
checking_dict[k] = 1


print '\n'.join(ecr)
print '\n\n\nchecking_dict == %s\n\nchecking_list == %s' %(checking_dict,checking_list)

######## The keys are sorted in order that the less frequent ones are at the end
checking_list.sort(key=lambda k: checking_dict[k], reverse=True)
posis = dict((k,i) for i,k in enumerate(checking_list))
print '\nchecking_list sorted == %s\n\nposis == %s' % (checking_list,posis)



######## Now, the file is read again to build a list of rows


base = [ '' for i in xrange(len(checking_list))]
rows = []

for mat_dic in dicreg.finditer(content):
li = base[:]
for mat_kv in kvregx.finditer(mat_dic.group()):
k,v = mat_kv.group(2,5)
li[posis[k]] = v
rows.append(li)


print '\n\n%s\n%s' % (checking_list,30*'___')
print '\n'.join(str(li) for li in rows)

关于python - 读取带有不正确标记的字典的文件,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/7368302/

27 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com