gpt4 book ai didi

python - Pandas DataFrame [cell=(label,value)],分为 2 个独立的数据框

转载 作者:行者123 更新时间:2023-12-01 04:00:03 27 4
gpt4 key购买 nike

我找到了一种用pandas解析html的绝妙方法。我的数据格式有点奇怪(见下文)。我想将这些数据拆分为 2 个单独的数据帧

注意每个单元格如何由分隔...是否有任何真正有效的方法来分割所有这些单元格并创建2个数据帧,一个一个用于标签,一个用于括号中的 ( value )

NumPy 拥有所有这些 ufuncs,有没有办法可以在 string dtypes 上使用它们,因为它们可以用 DF.as_matrix() 转换为 np.array 吗?我试图避开 for 循环,我可以迭代所有索引并填充一个空数组,但这非常野蛮。

我正在使用Beaker Notebook 顺便说一句,它真的很酷(强烈推荐)

enter image description here

#Set URL Destination
url = "http://www.reef.org/print/db/stats"

#Process raw table
DF_raw = pd.pandas.read_html(url)[0]

#Get start/end indices of table
start_label = "10 Most Frequent Species"; start_idx = (DF_raw.iloc[:,0] == start_label).argmax()
end_label = "Top 10 Sites for Species Richness"; end_idx = (DF_raw.iloc[:,0] == end_label).argmax()

#Process table
DF_freqSpecies = pd.DataFrame(
DF_raw.as_matrix()[(start_idx + 1):end_idx,:],
columns = DF_raw.iloc[0,:]
)
DF_freqSpecies

#Split these into 2 separate DataFrames

这是我这样做的天真的方法:

import re
DF_species = pd.DataFrame(np.zeros_like(DF_freqSpecies),columns=DF_freqSpecies.columns)
DF_freq = pd.DataFrame(np.zeros_like(DF_freqSpecies).astype(str),columns=DF_freqSpecies.columns)

dims = DF_freqSpecies.shape
for i in range(dims[0]):
for j in range(dims[1]):
#Parse current dataframe
species, freq = re.split("\s\(\d",DF_freqSpecies.iloc[i,j])
freq = float(freq[:-1])
#Populate split DataFrames
DF_species.iloc[i,j] = species
DF_freq.iloc[i,j] = freq

我想要这 2 个数据帧作为我的输出:

(1) 物种; enter image description here(2) 频率 enter image description here

最佳答案

你可以这样做:

DF1:

In [182]: df1 = DF_freqSpecies.replace(r'\s*\(\d+\.*\d*\)', '', regex=True)

In [183]: df1.head()
Out[183]:
0 Tropical Western Atlantic California, Pacific Northwest and Alaska \
0 Bluehead Copper Rockfish
1 Blue Tang Lingcod
2 Stoplight Parrotfish Painted Greenling
3 Bicolor Damselfish Sunflower Star
4 French Grunt Plumose Anemone

0 Hawaii Tropical Eastern Pacific \
0 Saddle Wrasse King Angelfish
1 Hawaiian Whitespotted Toby Mexican Hogfish
2 Raccoon Butterflyfish Barberfish
3 Manybar Goatfish Flag Cabrilla
4 Moorish Idol Panamic Sergeant Major

0 South Pacific Northeast US and Eastern Canada \
0 Regal Angelfish Cunner
1 Bluestreak Cleaner Wrasse Winter Flounder
2 Manybar Goatfish Rock Gunnel
3 Brushtail Tang Pollock
4 Two-spined Angelfish Grubby Sculpin

0 South Atlantic States Central Indo-Pacific
0 Slippery Dick Moorish Idol
1 Belted Sandfish Three-spot Dascyllus
2 Black Sea Bass Bluestreak Cleaner Wrasse
3 Tomtate Blacklip Butterflyfish
4 Cubbyu Clark's Anemonefish

和 DF2

In [193]: df2 = DF_freqSpecies.replace(r'.*\((\d+\.*\d*)\).*', r'\1', regex=True)

In [194]: df2.head()
Out[194]:
0 Tropical Western Atlantic California, Pacific Northwest and Alaska Hawaii \
0 85 54.6 92
1 84.8 53.2 85.8
2 81 50.8 85.7
3 79.9 50.2 85.7
4 74.8 49.7 82.9

0 Tropical Eastern Pacific South Pacific Northeast US and Eastern Canada \
0 85.7 79 67.4
1 82.5 77.3 46.6
2 75.2 73.9 26.2
3 68.9 73.3 25.2
4 67.9 72.8 23.7

0 South Atlantic States Central Indo-Pacific
0 79.7 80.1
1 78.5 75.6
2 78.5 73.5
3 72.7 71.4
4 65.7 70.2

RegEx debugging and explanation:

我们基本上想要删除除括号中的数字之外的所有内容:

(\d+\.*\d*) - group(1) - 这是我们的号码

\((\d+\.*\d*)\) - 括号中的数字

.*\((\d+\.*\d*)\).* - 整个事情 - '('、'('、我们的号码、')'之前的任何内容,直到单元格末尾的任何内容

它将被替换为 group(1) - 我们的号码

关于python - Pandas DataFrame [cell=(label,value)],分为 2 个独立的数据框,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/36729551/

27 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com