gpt4 book ai didi

python - 将包含带有标记部分的 OrderedDict 的元组转换为包含以标记部分命名的列的表

转载 作者:行者123 更新时间:2023-11-28 18:38:38 25 4
gpt4 key购买 nike

标题更完整:Convert tuple containing an OrderedDict with tagged parts to table with columns named from tagged parts (variable number of tagged parts and variable number of occurrences of tags).

我比 python 更了解地址解析,这可能是问题的根本根源。如何做到这一点可能是显而易见的。 usaddress 库有意以这种可能有用的方式返回结果。

我正在使用 usaddress,它“是一个 python 库,用于使用高级 NLP 方法将非结构化地址字符串解析为地址组件”,并且似乎运行良好。这是 the usaddress sourcewebsite .

所以我在一个文件上运行它:

2244 NE 29TH DR
1742 NW 57TH ST
1241 NE EAST DEVILS LAKE RD
4239 SW HWY 101, UNIT 19
1315 NE HARBOR RIDGE
4850 SE 51ST ST
1501 SE EAST DEVILS LAKE RD
1525 NE REGATTA WAY
6458 NE MAST AVE
4009 SW HWY 101
814 SW 9TH ST
1665 SALMON RIVER HWY
3500 NE WEST DEVILS LAKE RD, UNIT 18
1912 NE 56TH DR
3334 NE SURF AVE
2734 SW DUNE CT
2558 NE 33RD ST
2600 NE 33RD ST
5617 NW JETTY AVE

我想将这些结果转换成更像表格的东西(最终是 CSV 或数据库)。

我不确定返回的是什么数据类型。阅读文档,告诉我 tag 方法返回一个元组,其中包含带有标记部分的 OrderedDict。 parse 方法似乎返回一个稍微不同的类型。 This question ,帮助我确定它是一个列表和一个元组(显然带有标签)。搜索 for how to convert a python list with tagged parts to a table没有成功。

搜索如何转换包含 OrderedDict 的元组的结果并不多。 This是我发现的最接近的。我还发现 pandas擅长各种格式化任务,虽然我不清楚如何将 pandas 应用于此。许多我发现的最接近的问题 like the opposite question or one with named tuples分数很低。

我还进行了一些探索性尝试,看看它是否可行(如下)。我能够从这个 Matrix Transpose question 中看到几种访问数据和使用 zip 的方法离表格更近了一点,因为数据和命名标签现在是分​​开的,尽管不统一。有没有办法将这些结果放在包含带标记部分的 OrderedDict 的标记列表或元组中?从返回的结果中是否有相当直接的方法?

解析方法如下:

## Get a library
import usaddress

## Open the file with read only permmission
f = open('address_sample.txt')

## Read the first line
line = f.readline()

## If the file is not empty keep reading line one at a time
## until the file is empty
while line:
## Try the parse method
parsed = usaddress.parse(line)
## See what the parse results look like
zippy = [list(i) for i in zip(*parsed)]
print(zippy)
## read the next line
line = f.readline()

## close the file
f.close()

以及生成的结果(请注意,当标签有多个部分时,它会重复)。

[['2244', 'NE', '29TH', 'DR'], ['AddressNumber', 'StreetNamePreDirectional', 'StreetName', 'StreetNamePostType']]
[['1742', 'NW', '57TH', 'ST'], ['AddressNumber', 'StreetNamePreDirectional', 'StreetName', 'StreetNamePostType']]
[['1241', 'NE', 'EAST', 'DEVILS', 'LAKE', 'RD'], ['AddressNumber', 'StreetNamePreDirectional', 'StreetName', 'StreetName', 'StreetName', 'StreetNamePostType']]
[['4239', 'SW', 'HWY', '101,', 'UNIT', '19'], ['AddressNumber', 'StreetNamePreDirectional', 'StreetNamePreType', 'StreetName', 'OccupancyType', 'OccupancyIdentifier']]
[['1315', 'NE', 'HARBOR', 'RIDGE'], ['AddressNumber', 'StreetNamePreDirectional', 'StreetName', 'StreetNamePostType']]
[['4850', 'SE', '51ST', 'ST'], ['AddressNumber', 'StreetNamePreDirectional', 'StreetName', 'StreetNamePostType']]
[['1501', 'SE', 'EAST', 'DEVILS', 'LAKE', 'RD'], ['AddressNumber', 'StreetNamePreDirectional', 'StreetName', 'StreetName', 'StreetName', 'StreetNamePostType']]
[['1525', 'NE', 'REGATTA', 'WAY'], ['AddressNumber', 'StreetNamePreDirectional', 'StreetName', 'StreetNamePostType']]
[['6458', 'NE', 'MAST', 'AVE'], ['AddressNumber', 'StreetNamePreDirectional', 'StreetName', 'StreetNamePostType']]
[['4009', 'SW', 'HWY', '101'], ['AddressNumber', 'StreetNamePreDirectional', 'StreetNamePreType', 'StreetName']]
[['814', 'SW', '9TH', 'ST'], ['AddressNumber', 'StreetNamePreDirectional', 'StreetName', 'StreetNamePostType']]
[['1665', 'SALMON', 'RIVER', 'HWY'], ['AddressNumber', 'StreetName', 'StreetName', 'StreetNamePostType']]
[['3500', 'NE', 'WEST', 'DEVILS', 'LAKE', 'RD,', 'UNIT', '18'], ['AddressNumber', 'StreetNamePreDirectional', 'StreetName', 'StreetName', 'StreetName', 'StreetNamePostType', 'OccupancyType', 'OccupancyIdentifier']]
[['1912', 'NE', '56TH', 'DR'], ['AddressNumber', 'StreetNamePreDirectional', 'StreetName', 'StreetNamePostType']]
[['3334', 'NE', 'SURF', 'AVE'], ['AddressNumber', 'StreetNamePreDirectional', 'StreetName', 'StreetNamePostType']]
[['2734', 'SW', 'DUNE', 'CT'], ['AddressNumber', 'StreetNamePreDirectional', 'StreetName', 'StreetNamePostType']]
[['2558', 'NE', '33RD', 'ST'], ['AddressNumber', 'StreetNamePreDirectional', 'StreetName', 'StreetNamePostType']]
[['2600', 'NE', '33RD', 'ST'], ['AddressNumber', 'StreetNamePreDirectional', 'StreetName', 'StreetNamePostType']]
[['5617', 'NW', 'JETTY', 'AVE'], ['AddressNumber', 'StreetNamePreDirectional', 'StreetName', 'StreetNamePostType']]

这是标记方法:

## Get a library
import usaddress

## Open the file with read only permmission
f = open('address_sample.txt')

## Read the first line
line = f.readline()

## If the file is not empty keep reading line one at a time
## until the file is empty
while line:
## Try tag method
tagged = usaddress.tag(line)
## See what the tag results look like
items_ = list(tagged[0].items())
zippy2 = [list(i) for i in zip(*items_)]
print(zippy2)
## read the next line
line = f.readline()

## close the file
f.close()

产生以下输出,可以更好地处理具有相同标签的多个部分的组合:

[['AddressNumber', 'StreetNamePreDirectional', 'StreetName', 'StreetNamePostType'], ['2244', 'NE', '29TH', 'DR']]
[['AddressNumber', 'StreetNamePreDirectional', 'StreetName', 'StreetNamePostType'], ['1742', 'NW', '57TH', 'ST']]
[['AddressNumber', 'StreetNamePreDirectional', 'StreetName', 'StreetNamePostType'], ['1241', 'NE', 'EAST DEVILS LAKE', 'RD']]
[['AddressNumber', 'StreetNamePreDirectional', 'StreetNamePreType', 'StreetName', 'OccupancyType', 'OccupancyIdentifier'], ['4239', 'SW', 'HWY', '101', 'UNIT', '19']]
[['AddressNumber', 'StreetNamePreDirectional', 'StreetName', 'StreetNamePostType'], ['1315', 'NE', 'HARBOR', 'RIDGE']]
[['AddressNumber', 'StreetNamePreDirectional', 'StreetName', 'StreetNamePostType'], ['4850', 'SE', '51ST', 'ST']]
[['AddressNumber', 'StreetNamePreDirectional', 'StreetName', 'StreetNamePostType'], ['1501', 'SE', 'EAST DEVILS LAKE', 'RD']]
[['AddressNumber', 'StreetNamePreDirectional', 'StreetName', 'StreetNamePostType'], ['1525', 'NE', 'REGATTA', 'WAY']]
[['AddressNumber', 'StreetNamePreDirectional', 'StreetName', 'StreetNamePostType'], ['6458', 'NE', 'MAST', 'AVE']]
[['AddressNumber', 'StreetNamePreDirectional', 'StreetNamePreType', 'StreetName'], ['4009', 'SW', 'HWY', '101']]
[['AddressNumber', 'StreetNamePreDirectional', 'StreetName', 'StreetNamePostType'], ['814', 'SW', '9TH', 'ST']]
[['AddressNumber', 'StreetName', 'StreetNamePostType'], ['1665', 'SALMON RIVER', 'HWY']]
[['AddressNumber', 'StreetNamePreDirectional', 'StreetName', 'StreetNamePostType', 'OccupancyType', 'OccupancyIdentifier'], ['3500', 'NE', 'WEST DEVILS LAKE', 'RD', 'UNIT', '18']]
[['AddressNumber', 'StreetNamePreDirectional', 'StreetName', 'StreetNamePostType'], ['1912', 'NE', '56TH', 'DR']]
[['AddressNumber', 'StreetNamePreDirectional', 'StreetName', 'StreetNamePostType'], ['3334', 'NE', 'SURF', 'AVE']]
[['AddressNumber', 'StreetNamePreDirectional', 'StreetName', 'StreetNamePostType'], ['2734', 'SW', 'DUNE', 'CT']]
[['AddressNumber', 'StreetNamePreDirectional', 'StreetName', 'StreetNamePostType'], ['2558', 'NE', '33RD', 'ST']]
[['AddressNumber', 'StreetNamePreDirectional', 'StreetName', 'StreetNamePostType'], ['2600', 'NE', '33RD', 'ST']]
[['AddressNumber', 'StreetNamePreDirectional', 'StreetName', 'StreetNamePostType'], ['5617', 'NW', 'JETTY', 'AVE']]

最佳答案

只需使用 csv.DictWriter使用您的标记方法上课:

from csv import DictWriter
import usaddress

tagged_lines = []
fields = set()
# Note 1: Use the 'with' statement instead of worrying about opening
# and closing your file manually
with open('address_sample.txt') as in_file:
# Note 2: You don't need to mess with readline() and while loops;
# just iterate over the file handle directly, it produces lines.
for line in in_file:
tagged = usaddress.tag(line)[0]
tagged_lines.append(tagged)
fields.update(tagged.keys()) # keep track of all field names we see

with open('address_sample.csv', 'w') as out_file:
writer = DictWriter(out_file, fieldnames=fields)
writer.writeheader()
writer.writerows(tagged_lines)

请注意,这对于大文件来说效率很低,因为它会一次性将您输入的全部内容保存在内存中;唯一的原因是事先不知道字段名集(即 csv 列标题)。

如果你知道完整的集合,你可以在一次流式传输中完成,在你阅读每一行时写下标记的输出。或者,您可以通过一次传递文件来生成一组 header ,然后第二次传递来进行转换。

关于python - 将包含带有标记部分的 OrderedDict 的元组转换为包含以标记部分命名的列的表,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/29782125/

25 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com