gpt4 book ai didi

Python pandas read_csv - 在数据框中加载 tgz 压缩数据集

转载 作者:太空宇宙 更新时间:2023-11-03 15:01:57 27 4
gpt4 key购买 nike

我正在尝试直接从源 URL 将“加州住房”数据集加载到 pandas 数据框中。该 URL 指向包含两个文件 cal_housing.data 和 cal_housing.domain 的 tgz 文件。

使用 pandas read_csv 加载文件工作正常,但它会出现一个我不理解并且想要摆脱的错误:数据帧的第一个值(第一行,第一列)被文件名替换。

这就是 cal_housing.data 的样子:

0 -122.230000,37.880000,41.000000,880.000000,129.000000,322.000000,126.000000,8.325200,452600.000000
1 -122.220000,37.860000,21.000000,7099.000000,1106.000000,2401.000000,1138.000000,8.301400,358500.000000
2 -122.240000,37.850000,52.000000,1467.000000,190.000000,496.000000,177.000000,7.257400,352100.000000
3 ...

这就是 cal_housing.domain 的样子:

0 longitude: continuous.
1 latitude: continuous.
2 housingMedianAge: continuous.
3 totalRooms: continuous.
4 totalBedrooms: continuous.
5 population: continuous.
6 households: continuous.
7 medianIncome: continuous.
8 medianHouseValue: continuous.

这就是我所做的:

import pandas as pd
source = 'http://www.dcc.fc.up.pt/~ltorgo/Regression/cal_housing.tgz'
col_names = ['longitude', 'latitude', 'housingMedianAge', 'totalRooms', 'totalBedrooms', 'population', 'households', 'medianIncome', 'medianHouseValue']
data = pd.read_csv(source, compression='gzip', header=None, names=col_names).dropna()
print(type(data))

这就是我得到的:

0      CaliforniaHousing/cal_housing.data     37.88              41.0   ...
1 -122.220000 37.86 21.0 ...
2 -122.240000 37.85 52.0 ...
...

最后,这就是我想要得到的:

0      -122.230000     37.88              41.0   ...
1 -122.220000 37.86 21.0 ...
2 -122.240000 37.85 52.0 ...
...

最佳答案

好吧,经过一番尝试,我找到了解决方案。它比我希望的要复杂得多...所以如果您找到更好的解决方案,请随时发布。

import pandas as pd
import io
import tarfile
import urllib.request
source = 'http://www.dcc.fc.up.pt/~ltorgo/Regression/cal_housing.tgz'
col_names = ['longitude', 'latitude', 'housingMedianAge', 'totalRooms', 'totalBedrooms', 'population', 'households', 'medianIncome', 'medianHouseValue']
tar = tarfile.open(fileobj=urllib.request.urlopen(source), mode="r|gz")
for member in tar:
if 'data' in member.name:
content = tar.extractfile(member).read()
data = pd.read_csv(io.BytesIO(content), encoding='utf8', header=None, names=col_names)
print(data)

这就是我得到的:

0      -122.230000     37.88              41.0   ...
1 -122.220000 37.86 21.0 ...
2 -122.240000 37.85 52.0 ...
...

关于Python pandas read_csv - 在数据框中加载 tgz 压缩数据集,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/44998868/

27 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com