Pandas Merging 101(大熊猫合并101只)

转载作者：bug小助手更新时间：2023-10-25 11:59:25

How can I perform a (INNER| (LEFT|RIGHT|FULL) OUTER) JOIN with pandas?

How do I add NaNs for missing rows after a merge?

How do I get rid of NaNs after merging?

Can I merge on the index?

How do I merge multiple DataFrames?

Cross join with pandas

merge? join? concat? update? Who? What? Why?!

... and more. I've seen these recurring questions asking about various facets of the pandas merge functionality. Most of the information regarding merge and its various use cases today is fragmented across dozens of badly worded, unsearchable posts. The aim here is to collate some of the more important points for posterity.

..。还有更多。我见过这些反复出现的问题，询问熊猫合并功能的各个方面。今天，关于Merge及其各种用例的大部分信息都分散在数十篇措辞拙劣、无法搜索的帖子中。这里的目的是为子孙后代整理一些更重要的观点。

This Q&A is meant to be the next installment in a series of helpful user guides on common pandas idioms (see this post on pivoting, and this post on concatenation, which I will be touching on, later).

这篇问答是关于熊猫常见习语的一系列有用的用户指南中的下一篇(参见关于旋转的这篇文章，以及关于拼接的这篇文章，我将在后面讨论)。

Please note that this post is not meant to be a replacement for the documentation, so please read that as well! Some of the examples are taken from there.

请注意，这篇文章并不是要取代文档，所以也请阅读！其中一些例子就是从那里取来的。

_{For ease of access.}

为了便于访问。

Merging basics - basic types of joins (read this first)
合并基础--连接的基本类型(请先阅读本文)

Index-based joins
基于索引的联接

Generalizing to multiple DataFrames
推广到多个DataFrame

Cross join
交叉连接

更多回答

优秀答案推荐

This post aims to give readers a primer on SQL-flavored merging with Pandas, how to use it, and when not to use it.

这篇文章旨在为读者提供一本关于与熊猫合并的入门读物，如何使用它，以及何时不使用它。

In particular, here's what this post will go through:

特别是，以下是这篇帖子将经历的：

The basics - types of joins (LEFT, RIGHT, OUTER, INNER)
基本连接类型(左、右、外、内)
- merging with different column names
- merging with multiple columns
- avoiding duplicate merge key column in output

What this post (and other posts by me on this thread) will not go through:

这篇帖子(以及我在这个帖子上的其他帖子)不会经历的：

Performance-related discussions and timings (for now). Mostly notable mentions of better alternatives, wherever appropriate.

Handling suffixes, removing extra columns, renaming outputs, and other specific use cases. There are other (read: better) posts that deal with that, so figure it out!

Note
Most examples default to INNER JOIN operations while demonstrating various features, unless otherwise specified.

Furthermore, all the DataFrames here can be copied and replicated so
you can play with them. Also, see this
post
on how to read DataFrames from your clipboard.

Lastly, all visual representation of JOIN operations have been hand-drawn using Google Drawings. Inspiration from here.

Enough talk - just show me how to use `merge`!

Setup & Basics

np.random.seed(0)
left = pd.DataFrame({'key': ['A', 'B', 'C', 'D'], 'value': np.random.randn(4)})
right = pd.DataFrame({'key': ['B', 'D', 'E', 'F'], 'value': np.random.randn(4)})

left

  key     value
0   A  1.764052
1   B  0.400157
2   C  0.978738
3   D  2.240893

right

  key     value
0   B  1.867558
1   D -0.977278
2   E  0.950088
3   F -0.151357

For the sake of simplicity, the key column has the same name (for now).

为简单起见，键列具有相同的名称(目前)。

An INNER JOIN is represented by

内部联接由

Note
This, along with the forthcoming figures all follow this convention:

blue indicates rows that are present in the merge result

red indicates rows that are excluded from the result (i.e., removed)

green indicates missing values that are replaced with NaNs in the result

To perform an INNER JOIN, call merge on the left DataFrame, specifying the right DataFrame and the join key (at the very least) as arguments.

要执行内部联接，请在左边的DataFrame上调用Merge，并将右边的DataFrame和联接键(至少)指定为参数。

left.merge(right, on='key')
# Or, if you want to be explicit
# left.merge(right, on='key', how='inner')

  key   value_x   value_y
0   B  0.400157  1.867558
1   D  2.240893 -0.977278

This returns only rows from left and right which share a common key (in this example, "B" and "D).

这只返回从左到右共用一个公共键的行(在本例中为“B”和“D”)。

A LEFT OUTER JOIN, or LEFT JOIN is represented by

左外部联接或左联接由

This can be performed by specifying how='left'.

这可以通过指定How=‘Left’来执行。

left.merge(right, on='key', how='left')

  key   value_x   value_y
0   A  1.764052       NaN
1   B  0.400157  1.867558
2   C  0.978738       NaN
3   D  2.240893 -0.977278

Carefully note the placement of NaNs here. If you specify how='left', then only keys from left are used, and missing data from right is replaced by NaN.

请仔细注意此处放置的NaN。如果指定HOW=‘LEFT’，则只使用左侧的键，而右侧缺少的数据将替换为NaN。

And similarly, for a RIGHT OUTER JOIN, or RIGHT JOIN which is...

类似地，对于右外连接，或右连接，它是...

...specify how='right':

...指定How=‘Right’：

left.merge(right, on='key', how='right')

  key   value_x   value_y
0   B  0.400157  1.867558
1   D  2.240893 -0.977278
2   E       NaN  0.950088
3   F       NaN -0.151357

Here, keys from right are used, and missing data from left is replaced by NaN.

这里，使用来自右侧的密钥，而来自左侧的缺失数据被替换为NaN。

Finally, for the FULL OUTER JOIN, given by

最后，对于完整的外连接，由

specify how='outer'.

指定How=‘out’。

left.merge(right, on='key', how='outer')

  key   value_x   value_y
0   A  1.764052       NaN
1   B  0.400157  1.867558
2   C  0.978738       NaN
3   D  2.240893 -0.977278
4   E       NaN  0.950088
5   F       NaN -0.151357

This uses the keys from both frames, and NaNs are inserted for missing rows in both.

这使用了两个帧中的关键点，并为这两个帧中缺少的行插入了NAN。

The documentation summarizes these various merges nicely:

文档很好地总结了这些不同的合并：

Other JOINs - LEFT-Excluding, RIGHT-Excluding, and FULL-Excluding/ANTI JOINs

If you need LEFT-Excluding JOINs and RIGHT-Excluding JOINs in two steps.

如果需要在两个步骤中排除左侧联接和右侧联接。

For LEFT-Excluding JOIN, represented as

对于左排除联接，表示为

Start by performing a LEFT OUTER JOIN and then filtering to rows coming from left only (excluding everything from the right),

首先执行左外部联接，然后过滤到仅来自左侧的行(排除来自右侧的所有行)，

(left.merge(right, on='key', how='left', indicator=True)
     .query('_merge == "left_only"')
     .drop('_merge', 1))

  key   value_x  value_y
0   A  1.764052      NaN
2   C  0.978738      NaN

Where,

哪里,

left.merge(right, on='key', how='left', indicator=True)

  key   value_x   value_y     _merge
0   A  1.764052       NaN  left_only
1   B  0.400157  1.867558       both
2   C  0.978738       NaN  left_only
3   D  2.240893 -0.977278       both

And similarly, for a RIGHT-Excluding JOIN,

类似地，对于排除权限的联接，

(left.merge(right, on='key', how='right', indicator=True)
     .query('_merge == "right_only"')
     .drop('_merge', 1))

  key  value_x   value_y
2   E      NaN  0.950088
3   F      NaN -0.151357

Lastly, if you are required to do a merge that only retains keys from the left or right, but not both (IOW, performing an ANTI-JOIN),

最后，如果需要执行只保留左侧或右侧的键，而不保留这两个键的合并(IOW，执行反联接)，

You can do this in similar fashion—

你可以用类似的方式来做这个-

(left.merge(right, on='key', how='outer', indicator=True)
     .query('_merge != "both"')
     .drop('_merge', 1))

  key   value_x   value_y
0   A  1.764052       NaN
2   C  0.978738       NaN
4   E       NaN  0.950088
5   F       NaN -0.151357

Different names for key columns

If the key columns are named differently—for example, left has keyLeft, and right has keyRight instead of key—then you will have to specify left_on and right_on as arguments instead of on:

如果键列的名称不同-例如，left的名称是keyLeft，right的名称是keyRight而不是key-则必须指定left_on和right_on作为参数，而不是on：

left2 = left.rename({'key':'keyLeft'}, axis=1)
right2 = right.rename({'key':'keyRight'}, axis=1)

left2

  keyLeft     value
0       A  1.764052
1       B  0.400157
2       C  0.978738
3       D  2.240893

right2

  keyRight     value
0        B  1.867558
1        D -0.977278
2        E  0.950088
3        F -0.151357

left2.merge(right2, left_on='keyLeft', right_on='keyRight', how='inner')

  keyLeft   value_x keyRight   value_y
0       B  0.400157        B  1.867558
1       D  2.240893        D -0.977278

Avoiding duplicate key column in output

When merging on keyLeft from left and keyRight from right, if you only want either of the keyLeft or keyRight (but not both) in the output, you can start by setting the index as a preliminary step.

在从左开始合并keyLeft和从右合并keyRight时，如果只需要在输出中使用keyLeft或keyRight之一(而不是两个)，则可以从设置索引开始，作为初步步骤。

left3 = left2.set_index('keyLeft')
left3.merge(right2, left_index=True, right_on='keyRight')

    value_x keyRight   value_y
0  0.400157        B  1.867558
1  2.240893        D -0.977278

Contrast this with the output of the command just before (that is, the output of left2.merge(right2, left_on='keyLeft', right_on='keyRight', how='inner')), you'll notice keyLeft is missing. You can figure out what column to keep based on which frame's index is set as the key. This may matter when, say, performing some OUTER JOIN operation.

将其与前面命令的输出(即left2.merge(right2，Left_on=‘keyLeft’，right_on=‘keyRight’，How=‘Internal’)的输出进行对比)，您会注意到缺少了keyLeft。您可以根据将哪个帧的索引设置为键来确定要保留哪一列。比方说，当执行某些外部联接操作时，这可能很重要。

Merging only a single column from one of the `DataFrames`

For example, consider

例如，考虑

right3 = right.assign(newcol=np.arange(len(right)))
right3
  key     value  newcol
0   B  1.867558       0
1   D -0.977278       1
2   E  0.950088       2
3   F -0.151357       3

If you are required to merge only "newcol" (without any of the other columns), you can usually just subset columns before merging:

如果您只需要合并“NEWCOL”(不合并任何其他列)，您通常可以在合并之前只对列进行子集：

left.merge(right3[['key', 'newcol']], on='key')

  key     value  newcol
0   B  0.400157       0
1   D  2.240893       1

If you're doing a LEFT OUTER JOIN, a more performant solution would involve map:

如果您正在进行左外部联接，则性能更好的解决方案将涉及MAP：

# left['newcol'] = left['key'].map(right3.set_index('key')['newcol']))
left.assign(newcol=left['key'].map(right3.set_index('key')['newcol']))

  key     value  newcol
0   A  1.764052     NaN
1   B  0.400157     0.0
2   C  0.978738     NaN
3   D  2.240893     1.0

As mentioned, this is similar to, but faster than

如前所述，这类似于，但比

left.merge(right3[['key', 'newcol']], on='key', how='left')

  key     value  newcol
0   A  1.764052     NaN
1   B  0.400157     0.0
2   C  0.978738     NaN
3   D  2.240893     1.0

Merging on multiple columns

To join on more than one column, specify a list for on (or left_on and right_on, as appropriate).

要联接多个列，请为ON(或相应的LEFT_ON和RIGHT_ON)指定一个列表。

left.merge(right, on=['key1', 'key2'] ...)

Or, in the event the names are different,

或者，如果名字不同，

left.merge(right, left_on=['lkey1', 'lkey2'], right_on=['rkey1', 'rkey2'])

**Other useful `merge*` operations and functions**

Merging a DataFrame with Series on index: See this answer.
将DataFrame与索引上的Series合并：请参见此答案。

Besides merge, DataFrame.update and DataFrame.combine_first are also used in certain cases to update one DataFrame with another.
除了合并之外，在某些情况下还可以使用DataFrame.update和DataFrame.combine_first来用一个DataFrame更新另一个DataFrame。

pd.merge_ordered is a useful function for ordered JOINs.
Pd.merge_ordered是一个用于有序连接的有用函数。

pd.merge_asof (read: merge_asOf) is useful for approximate joins.
Pd.merge_asof(Read：merge_asof)对于近似联接很有用。

This section only covers the very basics, and is designed to only whet your appetite. For more examples and cases, see the documentation on merge, join, and concat as well as the links to the function specifications.

这一部分只涵盖了最基本的内容，而且只是为了满足你的胃口。有关更多示例和用例，请参阅关于合并、联接和连接的文档以及指向函数规范的链接。

Continue Reading

Jump to other topics in Pandas Merging 101 to continue learning:

跳到熊猫合并101中的其他主题继续学习：

Merging basics - basic types of joins ^*
合并基础-连接的基本类型*

Index-based joins
基于索引的联接

Generalizing to multiple DataFrames
推广到多个DataFrame

Cross join
交叉连接

_{*You are here.}

*你在这里。

A supplemental visual view of pd.concat([df0, df1], kwargs).
Notice that, kwarg axis=0 or axis=1 's meaning is not as intuitive as df.mean() or df.apply(func)

Pd.conat([df0，df1]，kwargs)的补充视觉视图。请注意，kwarg轴=0或轴=1‘S的含义不像df.ean()或df.Apply(Func)那样直观

Joins 101

These animations might be better to explain you visually.
Credits: Garrick Aden-Buie tidyexplain repo

这些动画可能会更好地从视觉上解释你。致谢：Garrick Aden-Buie tidyexpline repo

Inner Join

Outer Join or Full Join

Right Join

Left Join

In this answer, I will consider practical examples of:

在这个答案中，我将考虑以下的实际例子：

pandas.concat

pandas.DataFrame.merge to merge dataframes from the index of one and the column of another one.

We will be using different dataframes for each of the cases.

我们将为每种情况使用不同的数据帧。

1. pandas.concat

1.熊猫。康卡特

Considering the following DataFrames with the same column names:

考虑以下具有相同列名的DataFrame：

Price2018 with size (8784, 5)

价格2018(含尺码)(8784，5)

   Year  Month  Day  Hour  Price
0  2018      1    1     1   6.74
1  2018      1    1     2   4.74
2  2018      1    1     3   3.66
3  2018      1    1     4   2.30
4  2018      1    1     5   2.30
5  2018      1    1     6   2.06
6  2018      1    1     7   2.06
7  2018      1    1     8   2.06
8  2018      1    1     9   2.30
9  2018      1    1    10   2.30

Price2019 with size (8760, 5)

售价2019(含尺码)(8760，5)

   Year  Month  Day  Hour  Price
0  2019      1    1     1  66.88
1  2019      1    1     2  66.88
2  2019      1    1     3  66.00
3  2019      1    1     4  63.64
4  2019      1    1     5  58.85
5  2019      1    1     6  55.47
6  2019      1    1     7  56.00
7  2019      1    1     8  61.09
8  2019      1    1     9  61.01
9  2019      1    1    10  61.00

One can combine them using pandas.concat, by simply

您可以使用pandas.conat将它们组合在一起，只需

import pandas as pd

frames = [Price2018, Price2019]

df_merged = pd.concat(frames)

Which results in a DataFrame with size (17544, 5)

这会产生一个大小为(17544，5)的DataFrame

If one wants to have a clear picture of what happened, it works like this

如果一个人想要清楚地了解发生了什么，它的工作原理是这样的

(Source)

(来源)

2. pandas.DataFrame.merge

2.Pandas.DataFrame.merge

In this section, we will consider a specific case: merging the index of one dataframe and the column of another dataframe.

在本节中，我们将考虑一个特定的案例：合并一个数据帧的索引和另一个数据帧的列。

Let's say one has the dataframe Geo with 54 columns, being one of the columns the Date, which is of type datetime64[ns].

假设有一个dataframe Geo有54列，是Date列之一，其类型为DateTime64[ns]。

                 Date         1         2  ...        51        52        53
0 2010-01-01 00:00:00  0.565919  0.892376  ...  0.593049  0.775082  0.680621
1 2010-01-01 01:00:00  0.358960  0.531418  ...  0.734619  0.480450  0.926735
2 2010-01-01 02:00:00  0.531870  0.221768  ...  0.902369  0.027840  0.398864
3 2010-01-01 03:00:00  0.475463  0.245810  ...  0.306405  0.645762  0.541882
4 2010-01-01 04:00:00  0.954546  0.867960  ...  0.912257  0.039772  0.627696

And the dataframe Price that has one column with the price named Price, and the index corresponds to the dates (Date)

和数据帧价格，其中有一列的价格名为Price，索引对应于日期(Date)

                     Price
Date                      
2010-01-01 00:00:00  29.10
2010-01-01 01:00:00   9.57
2010-01-01 02:00:00   0.00
2010-01-01 03:00:00   0.00
2010-01-01 04:00:00   0.00

In order to merge them, one can use pandas.DataFrame.merge as follows

为了合并它们，可以使用pandas.DataFrame.merge，如下所示

df_merged = pd.merge(Price, Geo, left_index=True, right_on='Date')

where Geo and Price are the previous dataframes.

其中Geo和Price是之前的数据帧。

That results in the following dataframe

这将产生以下数据帧

   Price                Date         1  ...        51        52        53
0  29.10 2010-01-01 00:00:00  0.565919  ...  0.593049  0.775082  0.680621
1   9.57 2010-01-01 01:00:00  0.358960  ...  0.734619  0.480450  0.926735
2   0.00 2010-01-01 02:00:00  0.531870  ...  0.902369  0.027840  0.398864
3   0.00 2010-01-01 03:00:00  0.475463  ...  0.306405  0.645762  0.541882
4   0.00 2010-01-01 04:00:00  0.954546  ...  0.912257  0.039772  0.627696

This post will go through the following topics:

这篇文章将涉及以下主题：

Merging with index under different conditions
- options for index-based joins: merge, join, concat
- merging on indexes
- merging on index of one, column of other

effectively using named indexes to simplify merging syntax

_{BACK TO TOP}

回到顶端

Index-based joins

TL;DR

There are a few options, some simpler than others depending on the use
case.

DataFrame.merge with left_index and right_index (or left_on and right_on using named indexes)

supports inner/left/right/full

can only join two at a time

supports column-column, index-column, index-index joins

DataFrame.join (join on index)

supports inner/left (default)/right/full

can join multiple DataFrames at a time

supports index-index joins

pd.concat (joins on index)

supports inner/full (default)

can join multiple DataFrames at a time

supports index-index joins

Index to index joins

Setup & Basics

设置和基本信息

import pandas as pd
import numpy as np

np.random.seed([3, 14])
left = pd.DataFrame(data={'value': np.random.randn(4)}, 
                    index=['A', 'B', 'C', 'D'])    
right = pd.DataFrame(data={'value': np.random.randn(4)},  
                     index=['B', 'D', 'E', 'F'])
left.index.name = right.index.name = 'idxkey'

left
           value
idxkey          
A      -0.602923
B      -0.402655
C       0.302329
D      -0.524349

right
 
           value
idxkey          
B       0.543843
D       0.013135
E      -0.326498
F       1.385076

Typically, an inner join on index would look like this:

通常，索引上的内部联接如下所示：

left.merge(right, left_index=True, right_index=True)

         value_x   value_y
idxkey                    
B      -0.402655  0.543843
D      -0.524349  0.013135

Other joins follow similar syntax.

其他联接遵循类似的语法。

Notable Alternatives

值得注意的替代方案

DataFrame.join defaults to joins on the index. DataFrame.join does a LEFT OUTER JOIN by default, so how='inner' is necessary here.

 left.join(right, how='inner', lsuffix='_x', rsuffix='_y')

          value_x   value_y
 idxkey                    
 B      -0.402655  0.543843
 D      -0.524349  0.013135

Note that I needed to specify the lsuffix and rsuffix arguments since join would otherwise error out:

 left.join(right)
 ValueError: columns overlap but no suffix specified: Index(['value'], dtype='object')

Since the column names are the same. This would not be a problem if they were differently named.

 left.rename(columns={'value':'leftvalue'}).join(right, how='inner')

         leftvalue     value
 idxkey                     
 B       -0.402655  0.543843
 D       -0.524349  0.013135

pd.concat joins on the index and can join two or more DataFrames at once. It does a full outer join by default, so how='inner' is required here..

 pd.concat([left, right], axis=1, sort=False, join='inner')

            value     value
 idxkey                    
 B      -0.402655  0.543843
 D      -0.524349  0.013135

For more information on concat, see this post.

Index to Column joins

To perform an inner join using index of left, column of right, you will use DataFrame.merge a combination of left_index=True and right_on=....

要使用左列右列的索引执行内连接，您将使用DataFrame。合并Left_index=True和Right_On=...的组合

right2 = right.reset_index().rename({'idxkey' : 'colkey'}, axis=1)
right2
 
  colkey     value
0      B  0.543843
1      D  0.013135
2      E -0.326498
3      F  1.385076

left.merge(right2, left_index=True, right_on='colkey')

    value_x colkey   value_y
0 -0.402655      B  0.543843
1 -0.524349      D  0.013135

Other joins follow a similar structure. Note that only merge can perform index to column joins. You can join on multiple columns, provided the number of index levels on the left equals the number of columns on the right.

其他连接遵循类似的结构。请注意，只有Merge才能执行索引到列的联接。如果左侧的索引级数等于右侧的列数，则可以联接多个列。

join and concat are not capable of mixed merges. You will need to set the index as a pre-step using DataFrame.set_index.

Join和Conat不能混合合并。您需要使用DataFrame.set_index将索引设置为前置步骤。

Effectively using Named Index [pandas >= 0.23]

If your index is named, then from pandas >= 0.23, DataFrame.merge allows you to specify the index name to on (or left_on and right_on as necessary).

如果你的索引是命名的，那么从pandas >= 0.23开始，DataFrame.merge允许你指定索引名为on（或者根据需要指定left_on和right_on）。

left.merge(right, on='idxkey')

         value_x   value_y
idxkey                    
B      -0.402655  0.543843
D      -0.524349  0.013135

For the previous example of merging with the index of left, column of right, you can use left_on with the index name of left:

对于前面的合并左，右列的索引的例子，你可以使用left_on与左的索引名称：

left.merge(right2, left_on='idxkey', right_on='colkey')

    value_x colkey   value_y
0 -0.402655      B  0.543843
1 -0.524349      D  0.013135

Continue Reading

Jump to other topics in Pandas Merging 101 to continue learning:

跳到熊猫合并101中的其他主题继续学习：

Merging basics - basic types of joins
合并基础-连接的基本类型

Index-based joins^*
基于索引的联接*

Generalizing to multiple DataFrames
推广到多个DataFrame

Cross join
交叉连接

_{* you are here}

*你在这里

This post will go through the following topics:

这篇文章将涉及以下主题：

how to correctly generalize to multiple DataFrames (and why merge has shortcomings here)

merging on unique keys

merging on non-unique keys

_{BACK TO TOP}

回到顶端

Generalizing to multiple DataFrames

Oftentimes, the situation arises when multiple DataFrames are to be merged together. Naively, this can be done by chaining merge calls:

通常，当要将多个DataFrame合并在一起时会出现这种情况。很简单，这可以通过链接合并调用来实现：

df1.merge(df2, ...).merge(df3, ...)

However, this quickly gets out of hand for many DataFrames. Furthermore, it may be necessary to generalise for an unknown number of DataFrames.

然而，对于许多DataFrame来说，这很快就失去了控制。此外，可能有必要对未知数量的DataFrame进行概括。

Here I introduce pd.concat for multi-way joins on unique keys, and DataFrame.join for multi-way joins on non-unique keys. First, the setup.

在这里，我将介绍用于唯一键上的多路联接的pd.conat和用于非唯一键上的多路联接的DataFrame.Join。首先，是设置。

# Setup.
np.random.seed(0)
A = pd.DataFrame({'key': ['A', 'B', 'C', 'D'], 'valueA': np.random.randn(4)})    
B = pd.DataFrame({'key': ['B', 'D', 'E', 'F'], 'valueB': np.random.randn(4)})
C = pd.DataFrame({'key': ['D', 'E', 'J', 'C'], 'valueC': np.ones(4)})
dfs = [A, B, C] 

# Note: the "key" column values are unique, so the index is unique.
A2 = A.set_index('key')
B2 = B.set_index('key')
C2 = C.set_index('key')

dfs2 = [A2, B2, C2]

Multiway merge on unique keys

If your keys (here, the key could either be a column or an index) are unique, then you can use pd.concat. Note that pd.concat joins DataFrames on the index.

如果您的键(这里的键可以是列或索引)是唯一的，那么您可以使用pd.conat。请注意，pd.conat加入了索引上的DataFrames。

# Merge on `key` column. You'll need to set the index before concatenating
pd.concat(
    [df.set_index('key') for df in dfs], axis=1, join='inner'
).reset_index()

  key    valueA    valueB  valueC
0   D  2.240893 -0.977278     1.0

# Merge on `key` index.
pd.concat(dfs2, axis=1, sort=False, join='inner')

       valueA    valueB  valueC
key                            
D    2.240893 -0.977278     1.0

Omit join='inner' for a FULL OUTER JOIN. Note that you cannot specify LEFT or RIGHT OUTER joins (if you need these, use join, described below).

对于完全的外部联接，省略JOIN=‘INTER’。请注意，您不能指定左连接或右连接(如果需要，请使用Join，如下所述)。

Multiway merge on keys with duplicates

concat is fast, but has its shortcomings. It cannot handle duplicates.

Concat速度很快，但也有缺点。它不能处理重复项。

A3 = pd.DataFrame({'key': ['A', 'B', 'C', 'D', 'D'], 'valueA': np.random.randn(5)})
pd.concat([df.set_index('key') for df in [A3, B, C]], axis=1, join='inner')

ValueError: Shape of passed values is (3, 4), indices imply (3, 2)

In this situation, we can use join since it can handle non-unique keys (note that join joins DataFrames on their index; it calls merge under the hood and does a LEFT OUTER JOIN unless otherwise specified).

在这种情况下，我们可以使用Join，因为它可以处理非唯一键(请注意，Join在它们的索引上连接DataFrame；除非另有指定，否则它在幕后调用Merge并执行左外部连接)。

# Join on `key` column. Set as the index first.
# For inner join. For left join, omit the "how" argument.
A.set_index('key').join([B2, C2], how='inner').reset_index()

  key    valueA    valueB  valueC
0   D  2.240893 -0.977278     1.0

# Join on `key` index.
A3.set_index('key').join([B2, C2], how='inner')

       valueA    valueB  valueC
key                            
D    1.454274 -0.977278     1.0
D    0.761038 -0.977278     1.0

Continue Reading

Jump to other topics in Pandas Merging 101 to continue learning:

跳到熊猫合并101中的其他主题继续学习：

Merging basics - basic types of joins
合并基础-连接的基本类型

Index-based joins
基于索引的联接

Generalizing to multiple DataFrames ^*
泛化到多个数据帧 *

Cross join
交叉连接

_{* you are here}

*你在这里

Pandas at the moment does not support inequality joins within the merge syntax; one option is with the conditional_join function from pyjanitor - I am a contributor to this library:

熊猫目前不支持合并语法中的不平等连接；一种选择是使用来自pyjanitor的ConditionalJoin函数-我是这个库的贡献者：

# pip install pyjanitor
import pandas as pd
import janitor 

left.conditional_join(right, ('value', 'value', '>'))

   left           right
    key     value   key     value
0     A  1.764052     D -0.977278
1     A  1.764052     F -0.151357
2     A  1.764052     E  0.950088
3     B  0.400157     D -0.977278
4     B  0.400157     F -0.151357
5     C  0.978738     D -0.977278
6     C  0.978738     F -0.151357
7     C  0.978738     E  0.950088
8     D  2.240893     D -0.977278
9     D  2.240893     F -0.151357
10    D  2.240893     E  0.950088
11    D  2.240893     B  1.867558

left.conditional_join(right, ('value', 'value', '<'))

  left           right
   key     value   key     value
0    A  1.764052     B  1.867558
1    B  0.400157     E  0.950088
2    B  0.400157     B  1.867558
3    C  0.978738     B  1.867558

The columns are passed as a variable argument of tuples, each tuple comprising of a column from the left dataframe, column from the right dataframe, and the join operator, which can be any of (>, <, >=, <=, !=). In the example above, a MultiIndex column is returned, because of overlaps in the column names.

列作为元组的变量参数传递，每个元组由左侧数据框中的列、右侧数据框中的列和联接操作符组成，连接操作符可以是(>、<、>=、<=、！=)中的任何一个。在上面的示例中，由于列名重叠，返回了一个多索引列。

Performance wise, this is better than a naive cross join:

在性能方面，这比幼稚的交叉连接要好：

np.random.seed(0)
dd = pd.DataFrame({'value':np.random.randint(100000, size=50_000)})
df = pd.DataFrame({'start':np.random.randint(100000, size=1_000), 
                   'end':np.random.randint(100000, size=1_000)})

dd.head()

   value
0  68268
1  43567
2  42613
3  45891
4  21243

df.head()

   start    end
0  71915  47005
1  64284  44913
2  13377  96626
3  75823  38673
4  29151    575


%%timeit
out = df.merge(dd, how='cross')
out.loc[(out.start < out.value) & (out.end > out.value)]
5.12 s ± 19 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

%timeit df.conditional_join(dd, ('start', 'value' ,'<'), ('end', 'value' ,'>'))
280 ms ± 5.56 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

%timeit df.conditional_join(dd, ('start', 'value' ,'<'), ('end', 'value' ,'>'), use_numba=True)
124 ms ± 12.4 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

out = df.merge(dd, how='cross')
out = out.loc[(out.start < out.value) & (out.end > out.value)]
A = df.conditional_join(dd, ('start', 'value' ,'<'), ('end', 'value' ,'>'))
columns = A.columns.tolist()
A = A.sort_values(columns, ignore_index = True)
out = out.sort_values(columns, ignore_index = True)

A.equals(out)
True

Depending on the data size, you could get more performance when an equi join is present. In this case, pandas merge function is used, but the final data frame is delayed until the non-equi joins are computed. Let's look at data from here:

根据数据大小的不同，当存在对等连接时，您可能会获得更高的性能。在这种情况下，使用了熊猫合并功能，但最终数据帧被延迟，直到计算出非等联接。让我们来看看这里的数据：

import pandas as pd
import numpy as np
import random
import datetime

def random_dt_bw(start_date,end_date):
    days_between = (end_date - start_date).days
    random_num_days = random.randrange(days_between)
    random_dt = start_date + datetime.timedelta(days=random_num_days)
    return random_dt

def generate_data(n=1000):
    items = [f"i_{x}" for x in range(n)]
    start_dates = [random_dt_bw(datetime.date(2020,1,1),datetime.date(2020,9,1)) for x in range(n)]
    end_dates = [x + datetime.timedelta(days=random.randint(1,10)) for x in start_dates]
    
    offerDf = pd.DataFrame({"Item":items,
                            "StartDt":start_dates,
                            "EndDt":end_dates})
    
    transaction_items = [f"i_{random.randint(0,n)}" for x in range(5*n)]
    transaction_dt = [random_dt_bw(datetime.date(2020,1,1),datetime.date(2020,9,1)) for x in range(5*n)]
    sales_amt = [random.randint(0,1000) for x in range(5*n)]
    
    transactionDf = pd.DataFrame({"Item":transaction_items,"TransactionDt":transaction_dt,"Sales":sales_amt})

    return offerDf,transactionDf

offerDf,transactionDf = generate_data(n=100000)


offerDf = (offerDf
           .assign(StartDt = offerDf.StartDt.astype(np.datetime64), 
                   EndDt = offerDf.EndDt.astype(np.datetime64)
                  )
           )

transactionDf = transactionDf.assign(TransactionDt = transactionDf.TransactionDt.astype(np.datetime64))

# you can get more performance when using ints/datetimes
# in the equi join, compared to strings

offerDf = offerDf.assign(Itemr = offerDf.Item.str[2:].astype(int))

transactionDf = transactionDf.assign(Itemr = transactionDf.Item.str[2:].astype(int))

transactionDf.head()
      Item TransactionDt  Sales  Itemr
0  i_43407    2020-05-29    692  43407
1  i_95044    2020-07-22    964  95044
2  i_94560    2020-01-09    462  94560
3  i_11246    2020-02-26    690  11246
4  i_55974    2020-03-07    219  55974

offerDf.head()
  Item    StartDt      EndDt  Itemr
0  i_0 2020-04-18 2020-04-19      0
1  i_1 2020-02-28 2020-03-07      1
2  i_2 2020-03-28 2020-03-30      2
3  i_3 2020-08-03 2020-08-13      3
4  i_4 2020-05-26 2020-06-04      4

# merge on strings 
merged_df = pd.merge(offerDf,transactionDf,on='Itemr')
classic_int = merged_df[(merged_df['TransactionDt']>=merged_df['StartDt']) &
                        (merged_df['TransactionDt']<=merged_df['EndDt'])]

# merge on ints ... usually faster
merged_df = pd.merge(offerDf,transactionDf,on='Item')
classic_str = merged_df[(merged_df['TransactionDt']>=merged_df['StartDt']) &            
                        (merged_df['TransactionDt']<=merged_df['EndDt'])]

# merge on integers
cond_join_int = (transactionDf
                 .conditional_join(
                     offerDf, 
                     ('Itemr', 'Itemr', '=='), 
                     ('TransactionDt', 'StartDt', '>='), 
                     ('TransactionDt', 'EndDt', '<=')
                  )
                 )

# merge on strings
cond_join_str = (transactionDf
                 .conditional_join(
                     offerDf, 
                     ('Item', 'Item', '=='), 
                     ('TransactionDt', 'StartDt', '>='), 
                     ('TransactionDt', 'EndDt', '<=')
                  )
                )

%%timeit
merged_df = pd.merge(offerDf,transactionDf,on='Item')
classic_str = merged_df[(merged_df['TransactionDt']>=merged_df['StartDt']) &
                        (merged_df['TransactionDt']<=merged_df['EndDt'])]
292 ms ± 3.84 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

%%timeit
merged_df = pd.merge(offerDf,transactionDf,on='Itemr')
classic_int = merged_df[(merged_df['TransactionDt']>=merged_df['StartDt']) &
                        (merged_df['TransactionDt']<=merged_df['EndDt'])]
253 ms ± 2.7 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

%%timeit 
(transactionDf
.conditional_join(
    offerDf, 
    ('Item', 'Item', '=='), 
    ('TransactionDt', 'StartDt', '>='), 
    ('TransactionDt', 'EndDt', '<=')
   )
)
256 ms ± 9.66 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

%%timeit 
(transactionDf
.conditional_join(
    offerDf, 
    ('Itemr', 'Itemr', '=='), 
    ('TransactionDt', 'StartDt', '>='), 
    ('TransactionDt', 'EndDt', '<=')
   )
)
71.8 ms ± 2.24 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

# check that both dataframes are equal
cols = ['Item', 'TransactionDt', 'Sales', 'Itemr_y','StartDt', 'EndDt', 'Itemr_x']
cond_join_str = cond_join_str.drop(columns=('right', 'Item')).set_axis(cols, axis=1)

(cond_join_str
.sort_values(cond_join_str.columns.tolist())
.reset_index(drop=True)
.reindex(columns=classic_str.columns)
.equals(
    classic_str
    .sort_values(classic_str.columns.tolist())
    .reset_index(drop=True)
))

True

I think you should include this in your explanation as it is a relevant merge that I see fairly often, which is termed cross-join I believe. This is a merge that occurs when unique df's share no columns, and it simply merging 2 dfs side-by-side:

我认为你应该在你的解释中包括这一点，因为这是一种我经常看到的相关合并，我认为这被称为交叉连接。这是当唯一的df没有共享列时发生的合并，它只是并排合并2个df：

The setup:

设置：

names1 = [{'A':'Jack', 'B':'Jill'}]

names2 = [{'C':'Tommy', 'D':'Tammy'}]

df1=pd.DataFrame(names1)
df2=pd.DataFrame(names2)
df_merged= pd.merge(df1.assign(X=1), df2.assign(X=1), on='X').drop('X', 1)

This creates a dummy X column, merges on the X, and then drops it to produce

这将创建一个虚拟的X列，在X上合并，然后将其删除以生成

df_merged:

df_merged：

      A     B      C      D
0  Jack  Jill  Tommy  Tammy

更多回答

If anyone is confused by the table of contents at the end of each post, I split up this massive answer into 4 separate ones, 3 on this question and 1 on another. The way it was setup previously made it harder to reference folks to specific topics. This allows you to bookmark separate topics easily now!

如果有人对每篇文章末尾的目录感到困惑，我把这个庞大的答案分成4个独立的答案，3个关于这个问题，1个关于另一个问题。以前的设置方式使人们更难参考特定的主题。这使您现在可以轻松地为单独的主题添加书签！

This is an awesome resource! The only question I still have is why call it merge instead of join, and join instead of merge?

这是一个很棒的资源！我唯一的问题是，为什么称它为合并而不是联接，以及联接而不是合并？

This is a nice diagram. May I ask how you produced it?

这是一张很好的图表。我可以问一下你是怎么制作的吗？

google doc's built-in "insert ==> drawing... ==> new" (as of 2019-May). But, to be clear: the only reason I used google doc for this picture is because my notes is stored in google doc, and I would like a picture that can be modified quickly within google doc itself. Actually now you mentioned it, the google doc's drawing tool is pretty neat.

谷歌文档的内置“插入==>绘制...==>新”(截止到2019年5月)。但是，需要说明的是：我使用谷歌文档制作这张图片的唯一原因是因为我的笔记存储在谷歌文档中，而我想要一张可以在谷歌文档中快速修改的图片。其实现在你提到了，谷歌文档的绘图工具相当不错。

Wow, this is great. Coming from the SQL world, "vertical" join is not a join in my head, as the table's structure is always fixed. Now even think pandas should consolidate concat and merge with a direction parameter being horizontal or vertical.

哇，这太棒了。来自SQL世界的“垂直”连接在我的头脑中并不是连接，因为表的结构总是固定的。现在，甚至认为熊猫应该合并、合并和合并，方向参数是水平或垂直。

@Ufos Isn't that exactly what axis=1 and axis=0 is?

@UFOS，那不正是AXIS=1和AXIS=0的意思吗？

yes, there're now merge and concat and axis and whatever. However, as @eliu shows, it's all just the same concept of merge with "left" and "right" and "horizontal" or "vertical". I, personally, have to look into the documentation every time I have to remember which "axis" is 0 and which is 1.

是的，现在有合并、合并和轴等等。然而，正如@eliu所示，这完全是与“左”和“右”以及“水平”或“垂直”合并的相同概念。就我个人而言，每当我必须记住哪个“轴”是0，哪个是1时，我都必须查看文档。

Please check the second comment under the question. Cross join was initially a part of this (see edit history) but was later edited out into it's own post for volume.

请勾选问题下的第二条评论。Cross Join最初是这篇文章的一部分(参见编辑历史)，但后来被编辑为自己的帖子中的卷。

Seeing as cross join was not meant to be covered here, yes... However I appreciate your intent to contribute in good faith :)

鉴于交叉连接不是这里要讨论的内容，是的。然而，我感谢您真诚地捐款：)

list - 合并(合并)不同长度的向量
我有几个长度不等的 vector ，我想对其进行cbind。我将 vector 放入列表中，并尝试结合使用do.call(cbind, ...): nm <- list(1:8, 3:8, 1:5)
java - 合并(合并)2 个 JSONObjects 的最佳方式是什么？
合并(合并)两个 JSONObjects 的最佳方式是什么？ JSONObject o1 = { "one": "1", "two": "2", "three": "3" }
c# - 合并(合并？)并简化/减少 GeoJson 的 DbGeometry 记录
我在一个表中有许多空间实体，其中有一个名为 Boundaries 的 geometry 字段。我想生成一个具有简化形状/几何图形的 GeoJson 文件。这是我的第一次尝试: var entitie
postgresql 合并
谁能说出为什么这个选择返回 3.0 而不是 3.5: SELECT coalesce(1.0*(7/2),0) as foo 这个返回 3: SELECT coalesce(7/2,0) as foo
XSLT 合并/合并节点
首先抱歉，也许这个问题已经提出，但我找不到任何可以帮助我的东西，可能是因为我对 XSLT 缺乏了解。我有以下 XML: 0 OK
svn - 追溯复制/合并
有时用户会使用 Windows 资源管理器复制文件并在他们应该执行 svn 存储库级别的复制或合并时提交它们。因此，SVN 没有正确跟踪这些变化。一旦我发现这一点，损坏显然已经完成，并且可能已经对相关
excel - 合并/堆叠没有VBA的excel列范围定义
我想组合/堆叠 2 个不同列的值并获得唯一值。如果范围相邻，则可以正常工作。例如: =UNIQUE(FILTERXML(""&SUBSTITUTE(TEXTJOIN(",",TRUE,TRANSPO
.net - 如何使用iTextSharp将PDF与可填写的表单字段合并/合并？
使用iTextSharp，如何将多个PDF合并为一个PDF，而又不丢失每个PDF中的“表单字段”及其属性？ (我希望有一个使用来自数据库的流的示例，但文件系统也可以) 我发现this code可以正常
r - 合并/加入优先考虑非缺失值
是否有一个合并函数可以优先考虑公共(public)变量中的非缺失值？考虑以下示例。首先，我们生成两个 data.frames，它们具有相同的 ID，但在特定变量上有互补的缺失值: set.seed
tfs - 合并/分支策略
我们正在尝试实现 ALM Rangers 在最新的 Visual Studio TFS Branching and Merging Guide 中描述的“基本双分支计划”。 .从指导: The bas
file - 合并/附加文件并在unix中重新编号第一列
我在不同目录(3个不同名称)中有很多(3个只是一个例子)文本文件，如下所示: 目录:A，文件名:run.txt 格式:txt制表符分隔 ; file one 10 0.2 0.5 0.
sql - “合并”样式操作与文字值？
我有一张包含学生等级关系的表： Student Grade StartDate EndDate 1 1 09/01/2009 NULL 2
Symfony2 Doctrine 合并
我在学习 https://www.doctrine-project.org/projects/doctrine-orm/en/2.6/reference/working-with-associatio
移动的源代码文件的 SVN 合并
我觉得我有世界上最简单的 SVN 用例: 我有一个文件，Test.java在 trunk SVN的。我分行trunk至 dev-branch . 我搬家Test.java进入 com/mycompa
r - 合并/合并具有相同名称但数据不完整的列
我有两个数据框，其中一些列名称相同，而另一些列名称不同。数据框看起来像这样: df1 ID hello world hockey soccer 1 1 NA NA
elasticsearch - 合并/展平子aggs到主agg
Elasticsearch 中是否缺少以扁平化形式(多个子/子aggs)返回结果的方法？例如，当前我正在尝试获取所有产品类型及其状态(在线/离线)。这就是我最终得到的: aggs [ { key:
Scala:合并 map
如何合并如下所示的 map : Map1 = Map(1 -> Class1(1), 2 -> Class1(2)) Map2 = Map(2 -> Class2(1), 3 -> Class2(2)
hadoop - 执行sqoop时发生错误-合并
我试图通过从netezza服务器导入数据来合并两个数据集。以下是数据集，其数字为，ID为，字母为，名称为: 下表都是使用命令从netezza导入的: sqoop import --connect n
php - 合并、输出两个数组的项目并突出显示各自的项目
我有两个数组 $array1 = array('first', 'second', 'third', 'fourth'); $array2 = array('first', 'third', 'fou
sql - 合并 - 仅在值发生更改时更新
我正在 SQL Server 中运行合并。在我的更新中，我只想在值发生更改时更新该行。有一个版本行在每次更新时都会递增。下面是一个例子: MERGE Employee as tgt USING (SE

bug小助手

个人简介

我是一名优秀的程序员,十分优秀！

作者热门文章

滴滴打车优惠券免费领取

全站热门文章

首页

博学

6Ren·AI

商城

Pandas Merging 101(大熊猫合并101只)

Table of Contents

Enough talk - just show me how to use `merge`!

Setup & Basics

Other JOINs - LEFT-Excluding, RIGHT-Excluding, and FULL-Excluding/ANTI JOINs

Different names for key columns

Avoiding duplicate key column in output

Merging only a single column from one of the `DataFrames`

Merging on multiple columns

**Other useful `merge*` operations and functions**

Continue Reading

Joins 101

Inner Join

Outer Join or Full Join

Right Join

Left Join

Index-based joins

TL;DR

Index to index joins

Index to Column joins

Effectively using Named Index [pandas >= 0.23]

Continue Reading

Generalizing to multiple DataFrames

Multiway merge on unique keys

Multiway merge on keys with duplicates

Continue Reading

首页

博学

6Ren·AI

商城

Pandas Merging 101(大熊猫合并101只)

Table of Contents

Enough talk - just show me how to use merge!

Setup & Basics

Other JOINs - LEFT-Excluding, RIGHT-Excluding, and FULL-Excluding/ANTI JOINs

Different names for key columns

Avoiding duplicate key column in output

Merging only a single column from one of the DataFrames

Merging on multiple columns

Other useful merge* operations and functions

Continue Reading

Joins 101

Inner Join

Outer Join or Full Join

Right Join

Left Join

Index-based joins

TL;DR

Index to index joins

Index to Column joins

Effectively using Named Index [pandas >= 0.23]

Continue Reading

Generalizing to multiple DataFrames

Multiway merge on unique keys

Multiway merge on keys with duplicates

Continue Reading

Enough talk - just show me how to use `merge`!

Merging only a single column from one of the `DataFrames`

**Other useful `merge*` operations and functions**