gpt4 book ai didi

python - 多列+条件连接

转载 作者:行者123 更新时间:2023-12-01 07:31:32 27 4
gpt4 key购买 nike

我有两个数据框想要连接在一起(自行车骑行数据框和自行车站数据框)。

我一直在使用 pandas 库,但我似乎无法编写代码来完美地操作连接。最初,我只是加入关键“station_id”,但我发现最近更新的电台数据集包含更多电台,问题是有些电台没有 station_id。对于这些站点,我想参与匹配纬度和经度坐标。

当我刚刚使用 station_id 加入数据帧时的初始代码

rides_df = rides_df.rename(columns = {'start_station_id': 'station_id'})
rides_df = rides_df.merge(stations_df[['station_id','station_name']],
on = 'station_id', how = 'left')
rides_df = rides_df.rename(columns = {'station_id':'start_station_id',
'station_name':'station_name_start'})

#merge ending station name
rides_df = rides_df.rename(columns = {'end_station_id': 'station_id'})
rides_df = rides_df.merge(stations_df[['station_id', 'station_name']],
on = 'station_id', how = 'left')
rides_df = rides_df.rename(columns = {'station_id':'end_station_id',
'station_name': 'station_name_end'})

游乐设施数据帧的结构如下(采样):

rides_df = pd.DataFrame([[1912818,'Round Trip',3014,34.0566101,-118.23721,3014,34.0566101,-118.23721],
[1933383,'Round Trip',3016,34.0528984,-118.24156,3016,34.0528984,-118.24156],
[1944197,'Round Trip',3016,34.0528984,-118.24156,3016,34.0528984,-118.24156],
[1940317,'Round Trip','NaN',34.03352,-118.24184,'NaN',34.03352,-118.24184],
[1944075,'One Way',3021,34.0456085,-118.23703,3016,34.0566101,-118.23721]]
, columns = ['trip_id','trip_route_category','start_station_id','start_lat',
'start_lon','end_station_id','end_lat','end_lon'])

enter image description here

站点数据帧的结构如下(采样):

stations_df = pd.DataFrame([['Union Station West Portal',34.05661,-118.23721,3014],
['Los Angeles & Temple',34.0529,-118.24156,3016],
['Grand & Olympic',34.04373,-118.26014,3018],
['12th & Hill',34.03861,-118.26086,3019],
['Hill & Washington',34.03105,-118.26709,3020],
['Row DTLA',34.03352,-118.24184,'NaN']],
columns = ['station_name', 'lat', 'lon','station_id'])

enter image description here我想要的是在游乐设施数据框中添加起始位置和结束位置的车站名称,这样我就会有一列“Start_Station_Name”和“End_Station_Name”。我想加入“station_id”,但如果 station_id 为 NaN,则匹配开始和结束的经纬度。

The data frame that I want as a result is structured as follows:
want_df = pd.DataFrame([[1912818,'Round Trip','Union Station West Portal',3014,34.0566101,-118.23721,'Union Station West Portal',3014,34.0566101,-118.23721],
[1933383,'Round Trip','Los Angeles & Temple',3016,34.0528984,-118.24156,'Los Angeles & Temple',3016,34.0528984,-118.24156],
[1944197,'Round Trip','Los Angeles & Temple',3016,34.0528984,-118.24156,'Los Angeles & Temple',3016,34.0528984,-118.24156],
[1940317,'Round Trip','Row DTLA','Nan',34.03352,-118.24184,'Row DTLA','Nan',34.03352,-118.24184],
[1944075,'One Way','NaN',3021,34.0456085,-118.23703,'Los Angeles & Temple',3016,34.0566101,-118.23721]]
, columns = ['trip_id','trip_route_category','start_station_name','start_station_id','start_lat',
'start_lon','end_station_name','end_station_id','end_lat','end_lon'])

enter image description here

最佳答案

这是实现此目的的代码的更新版本:

# rides_df and station_df are slightly modified to make sure that the code works as intended
rides_df = pd.DataFrame([[1912818,'Round Trip',3014,34.0566101,-118.23721,3014,34.0566101,-118.23721],
[1933383,'Round Trip',3016,34.0528984,-118.24156,3016,34.0528984,-118.24156],
[1944197,'Round Trip',3016,34.0528984,-118.24156,3016,34.0528984,-118.24156],
[1940317,'Round Trip','NaN' ,34.03352,-118.24184,3018,34.03352,-118.24184],
[1944075,'One Way',3021,34.0456085,-118.23703,3016,34.0566101,-118.23721]]
, columns = ['trip_id','trip_route_category','start_station_id','start_lat',
'start_lon','end_station_id','end_lat','end_lon'])

stations_df = pd.DataFrame([['Union Station West Portal',34.05661,-118.23721,'NaN'],
['Los Angeles & Temple',34.0529,-118.24156,3016],
['Grand & Olympic',34.04373,-118.26014,3018],
['12th & Hill',34.03861,-118.26086,3019],
['Hill & Washington',34.03105,-118.26709,3020],
['Row DTLA',34.03352,-118.24184,'NaN']],
columns = ['station_name', 'lat', 'lon','station_id'])


# Convert to floats to match NaNs
rides_df[["start_station_id", "end_station_id"]] = rides_df[["start_station_id", "end_station_id"]].astype(float)
stations_df["station_id"] = stations_df["station_id"].astype(float)
# Convert the NaNs to another invalid id so they stop matching on merge
stations_df.loc[stations_df["station_id"].isnull(), "station_id"] = -1
# Round so numbers are an exact match
rides_df = rides_df.round(5)

# Merge beginning station name
rides_df = rides_df.rename(columns = {'start_station_id': 'station_id',
'start_lat': 'lat', 'start_lon': 'lon'})
rides_df = rides_df.merge(stations_df[['station_id','station_name']],
on = 'station_id', how = 'left')
# Merge again by looking at lat/lon values
rides_df = rides_df.merge(stations_df[['lat', 'lon','station_name']],
on = ['lat', 'lon'], how = 'left')
# Merge the two merge results
rides_df.loc[:, "station_name"] = rides_df["station_name_x"].combine(rides_df["station_name_y"], lambda x,y: x if not x!=x else y)
rides_df.drop(["station_name_x", "station_name_y"], axis=1, inplace=True)
rides_df = rides_df.rename(columns = {'station_id':'start_station_id',
'station_name':'start_station_name',
'lat':'start_lat', 'lon':'start_lon'})

# Merge ending station name
rides_df = rides_df.rename(columns = {'end_station_id': 'station_id',
'start_lat': 'lat', 'start_lon': 'lon'})
rides_df = rides_df.merge(stations_df[['station_id', 'station_name']],
on = 'station_id', how = 'left')
rides_df = rides_df.merge(stations_df[['lat', 'lon','station_name']],
on = ['lat', 'lon'], how = 'left')
rides_df.loc[:, "station_name"] = rides_df["station_name_x"].combine(rides_df["station_name_y"], lambda x,y: x if not x!=x else y)
rides_df.drop(["station_name_x", "station_name_y"], axis=1, inplace=True)
rides_df = rides_df.rename(columns = {'station_id':'end_station_id',
'station_name': 'end_station_name',
'lat':'start_lat', 'lon':'start_lon'})

print(rides_df)

输出:

   trip_id trip_route_category  start_station_id  start_lat  start_lon  end_station_id   end_lat    end_lon         start_station_name           end_station_name
0 1912818 Round Trip 3014.0 34.05661 -118.23721 3014.0 34.05661 -118.23721 Union Station West Portal Union Station West Portal
1 1933383 Round Trip 3016.0 34.05290 -118.24156 3016.0 34.05290 -118.24156 Los Angeles & Temple Los Angeles & Temple
2 1944197 Round Trip 3016.0 34.05290 -118.24156 3016.0 34.05290 -118.24156 Los Angeles & Temple Los Angeles & Temple
3 1940317 Round Trip NaN 34.03352 -118.24184 3018.0 34.03352 -118.24184 Row DTLA Grand & Olympic
4 1944075 One Way 3021.0 34.04561 -118.23703 3016.0 34.05661 -118.23721 NaN Los Angeles & Temple

关于python - 多列+条件连接,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/57190358/

27 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com