gpt4 book ai didi

How do I create a new column where the values are selected based on existing columns?(如何创建基于现有列选择值的新列?)

转载 作者:bug小助手 更新时间:2023-10-28 21:17:33 26 4
gpt4 key购买 nike



How do I add a color column to the following dataframe so that color='green' if Set == 'Z', and color='red' otherwise?

如何将颜色列添加到下面的数据框中,以便如果设置为=‘Z’,则COLOR=‘GREEN’,否则COLOR=‘RED’?


   Type  Set
1 A Z
2 B Z
3 B X
4 C Y

更多回答

Related: Create new column based on values from other columns / apply a function of multiple columns, row-wise in Pandas (same idea, but the selection criteria are based on multiple columns)

相关:基于其他列的值创建新列/在Pandas中按行应用多列函数(想法相同,但选择条件基于多列)

优秀答案推荐

If you only have two choices to select from then use np.where:

如果只有两个选项可供选择,则使用np。其中:


df['color'] = np.where(df['Set']=='Z', 'green', 'red')

For example,

例如,


import pandas as pd
import numpy as np

df = pd.DataFrame({'Type':list('ABBC'), 'Set':list('ZZXY')})
df['color'] = np.where(df['Set']=='Z', 'green', 'red')
print(df)

yields

收益率


  Set Type  color
0 Z A green
1 Z B green
2 X B red
3 Y C red



If you have more than two conditions then use np.select. For example, if you want color to be

如果你有两个以上的条件,那么使用np.select。例如,如果您希望颜色



  • yellow when (df['Set'] == 'Z') & (df['Type'] == 'A')

  • otherwise blue when (df['Set'] == 'Z') & (df['Type'] == 'B')

  • otherwise purple when (df['Type'] == 'B')

  • otherwise black,


then use

然后使用


df = pd.DataFrame({'Type':list('ABBC'), 'Set':list('ZZXY')})
conditions = [
(df['Set'] == 'Z') & (df['Type'] == 'A'),
(df['Set'] == 'Z') & (df['Type'] == 'B'),
(df['Type'] == 'B')]
choices = ['yellow', 'blue', 'purple']
df['color'] = np.select(conditions, choices, default='black')
print(df)

which yields

哪一种收益


  Set Type   color
0 Z A yellow
1 Z B blue
2 X B purple
3 Y C black


List comprehension is another way to create another column conditionally. If you are working with object dtypes in columns, like in your example, list comprehensions typically outperform most other methods.

列表理解是有条件地创建另一个列的另一种方式。如果您使用列中的对象数据类型,就像在您的示例中一样,列表理解通常优于大多数其他方法。



Example list comprehension:

示例列表理解:



df['color'] = ['red' if x == 'Z' else 'green' for x in df['Set']]


%timeit tests:

%测试时间:



import pandas as pd
import numpy as np

df = pd.DataFrame({'Type':list('ABBC'), 'Set':list('ZZXY')})
%timeit df['color'] = ['red' if x == 'Z' else 'green' for x in df['Set']]
%timeit df['color'] = np.where(df['Set']=='Z', 'green', 'red')
%timeit df['color'] = df.Set.map( lambda x: 'red' if x == 'Z' else 'green')

1000 loops, best of 3: 239 µs per loop
1000 loops, best of 3: 523 µs per loop
1000 loops, best of 3: 263 µs per loop


The following is slower than the approaches timed here, but we can compute the extra column based on the contents of more than one column, and more than two values can be computed for the extra column.

下面的方法比这里介绍的方法要慢,但是我们可以根据多个列的内容计算额外的列,并且可以为额外的列计算两个以上的值。



Simple example using just the "Set" column:

仅使用“set”列的简单示例:



def set_color(row):
if row["Set"] == "Z":
return "red"
else:
return "green"

df = df.assign(color=df.apply(set_color, axis=1))

print(df)


  Set Type  color
0 Z A red
1 Z B red
2 X B green
3 Y C green


Example with more colours and more columns taken into account:

考虑更多颜色和更多列的示例:



def set_color(row):
if row["Set"] == "Z":
return "red"
elif row["Type"] == "C":
return "blue"
else:
return "green"

df = df.assign(color=df.apply(set_color, axis=1))

print(df)


  Set Type  color
0 Z A red
1 Z B red
2 X B green
3 Y C blue


Edit (21/06/2019): Using plydata



It is also possible to use plydata to do this kind of things (this seems even slower than using assign and apply, though).

也可以使用plydata来完成这类工作(不过,这似乎比使用Assign和Apply还要慢)。



from plydata import define, if_else


Simple if_else:

简单的If_Else:



df = define(df, color=if_else('Set=="Z"', '"red"', '"green"'))

print(df)


  Set Type  color
0 Z A red
1 Z B red
2 X B green
3 Y C green


Nested if_else:

嵌套的IF_ELSE:



df = define(df, color=if_else(
'Set=="Z"',
'"red"',
if_else('Type=="C"', '"green"', '"blue"')))

print(df)


  Set Type  color
0 Z A red
1 Z B red
2 X B blue
3 Y C green


Another way in which this could be achieved is

实现这一目标的另一种方式是



df['color'] = df.Set.map( lambda x: 'red' if x == 'Z' else 'green')


Here's yet another way to skin this cat, using a dictionary to map new values onto the keys in the list:

这里还有另一种剥猫皮的方法,使用字典将新值映射到列表中的键上:



def map_values(row, values_dict):
return values_dict[row]

values_dict = {'A': 1, 'B': 2, 'C': 3, 'D': 4}

df = pd.DataFrame({'INDICATOR': ['A', 'B', 'C', 'D'], 'VALUE': [10, 9, 8, 7]})

df['NEW_VALUE'] = df['INDICATOR'].apply(map_values, args = (values_dict,))


What's it look like:

它看起来是什么样子:



df
Out[2]:
INDICATOR VALUE NEW_VALUE
0 A 10 1
1 B 9 2
2 C 8 3
3 D 7 4


This approach can be very powerful when you have many ifelse-type statements to make (i.e. many unique values to replace).

当您有许多ifElse类型的语句要执行(即要替换许多唯一的值)时,这种方法可能非常强大。



And of course you could always do this:

当然,你总是可以这样做的:



df['NEW_VALUE'] = df['INDICATOR'].map(values_dict)


But that approach is more than three times as slow as the apply approach from above, on my machine.

但在我的机器上,这种方法的速度是从上面应用方法的三倍多。



And you could also do this, using dict.get:

您也可以使用DICT.GET来执行此操作:



df['NEW_VALUE'] = [values_dict.get(v, None) for v in df['INDICATOR']]


You can simply use the powerful .loc method and use one condition or several depending on your need (tested with pandas=1.0.5).

您可以简单地使用强大的.loc方法,并根据您的需要使用一个或几个条件(用Pandas=1.0.5进行测试)。


Code Summary:

代码摘要:


df=pd.DataFrame(dict(Type='A B B C'.split(), Set='Z Z X Y'.split()))
df['Color'] = "red"
df.loc[(df['Set']=="Z"), 'Color'] = "green"

#practice!
df.loc[(df['Set']=="Z")&(df['Type']=="B")|(df['Type']=="C"), 'Color'] = "purple"


Explanation:

解释:


df=pd.DataFrame(dict(Type='A B B C'.split(), Set='Z Z X Y'.split()))

# df so far:
Type Set
0 A Z
1 B Z
2 B X
3 C Y

add a 'color' column and set all values to "red"

添加“COLOR”列,并将所有值设置为“RED”


df['Color'] = "red"

Apply your single condition:

适用您的单一条件:


df.loc[(df['Set']=="Z"), 'Color'] = "green"


# df:
Type Set Color
0 A Z green
1 B Z green
2 B X red
3 C Y red

or multiple conditions if you want:

或多个条件(如果需要):


df.loc[(df['Set']=="Z")&(df['Type']=="B")|(df['Type']=="C"), 'Color'] = "purple"

You can read on Pandas logical operators and conditional selection here:
Logical operators for boolean indexing in Pandas

您可以在此处阅读Pandas逻辑运算符和条件选择:Pandas中布尔索引的逻辑运算符



You can use pandas methods where and mask:

您可以在以下位置使用熊猫方法和面具:


df['color'] = 'green'
df['color'] = df['color'].where(df['Set']=='Z', other='red')
# Replace values where the condition is False

or


df['color'] = 'red'
df['color'] = df['color'].mask(df['Set']=='Z', other='green')
# Replace values where the condition is True

Alternatively, you can use the method transform with a lambda function:

或者,您可以将方法Transform与lambda函数一起使用:


df['color'] = df['Set'].transform(lambda x: 'green' if x == 'Z' else 'red')

Output:

产出:


  Type Set  color
1 A Z green
2 B Z green
3 B X red
4 C Y red

Performance comparison from @chai:

@chai的性能对比:


import pandas as pd
import numpy as np
df = pd.DataFrame({'Type':list('ABBC')*1000000, 'Set':list('ZZXY')*1000000})

%timeit df['color1'] = 'red'; df['color1'].where(df['Set']=='Z','green')
%timeit df['color2'] = ['red' if x == 'Z' else 'green' for x in df['Set']]
%timeit df['color3'] = np.where(df['Set']=='Z', 'red', 'green')
%timeit df['color4'] = df.Set.map(lambda x: 'red' if x == 'Z' else 'green')

397 ms ± 101 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
976 ms ± 241 ms per loop
673 ms ± 139 ms per loop
796 ms ± 182 ms per loop


if you have only 2 choices, use np.where()

如果您只有两个选择,请使用np.where()


df = pd.DataFrame({'A':range(3)})
df['B'] = np.where(df.A>2, 'yes', 'no')

if you have over 2 choices, maybe apply() could work
input

如果您有两个以上的选择,则可以使用Apply()作为输入


arr = pd.DataFrame({'A':list('abc'), 'B':range(3), 'C':range(3,6), 'D':range(6, 9)})

and arr is

和阿里斯


    A   B   C   D
0 a 0 3 6
1 b 1 4 7
2 c 2 5 8

if you want the column E tobe if arr.A =='a' then arr.B elif arr.A=='b' then arr.C elif arr.A == 'c' then arr.D else something_else

如果您希望E列是arr.A==‘a’,则arr.B elf arr.A==‘b’,则arr.C elif arr.A==‘c’,则arr.D ELSE_ELSE


arr['E'] = arr.apply(lambda x: x['B'] if x['A']=='a' else(x['C'] if x['A']=='b' else(x['D'] if x['A']=='c' else 1234)), axis=1)

and finally the arr is

最后,ARR是


    A   B   C   D   E
0 a 0 3 6 0
1 b 1 4 7 4
2 c 2 5 8 8


One liner with .apply() method is following:

下面是一个使用.Apply()方法的线性函数:



df['color'] = df['Set'].apply(lambda set_: 'green' if set_=='Z' else 'red')


After that, df data frame looks like this:

在此之后,DF数据帧如下所示:



>>> print(df)
Type Set color
0 A Z green
1 B Z green
2 B X red
3 C Y red


The case_when function from pyjanitor is a wrapper around pd.Series.mask and offers a chainable/convenient form for multiple conditions:

来自pyjanitor的case_when函数是一个围绕pd.Series.mask的包装器,并为多个条件提供了一个可链接/方便的形式:


For a single condition:

对于单一条件:


df.case_when(
df.col1 == "Z", # condition
"green", # value if True
"red", # value if False
column_name = "color"
)

Type Set color
1 A Z green
2 B Z green
3 B X red
4 C Y red

For multiple conditions:

对于多个条件:


df.case_when(
df.Set.eq('Z') & df.Type.eq('A'), 'yellow', # condition, result
df.Set.eq('Z') & df.Type.eq('B'), 'blue', # condition, result
df.Type.eq('B'), 'purple', # condition, result
'black', # default if none of the conditions evaluate to True
column_name = 'color'
)
Type Set color
1 A Z yellow
2 B Z blue
3 B X purple
4 C Y black

More examples can be found here

更多的例子可以在这里找到



Here is an easy one-liner you can use when you have one or several conditions:

以下是当你有一个或几个条件时可以使用的简单的一行程序:


df['color'] = np.select(condlist=[df['Set']=="Z", df['Set']=="Y"], choicelist=["green", "yellow"], default="red")

Easy and good to go!

很容易就可以开始了!


See more here: https://numpy.org/doc/stable/reference/generated/numpy.select.html

点击此处查看更多信息:https://numpy.org/doc/stable/reference/generated/numpy.select.html



If you're working with massive data, a memoized approach would be best:

如果你在处理海量数据,最好是用记忆的方法:



# First create a dictionary of manually stored values
color_dict = {'Z':'red'}

# Second, build a dictionary of "other" values
color_dict_other = {x:'green' for x in df['Set'].unique() if x not in color_dict.keys()}

# Next, merge the two
color_dict.update(color_dict_other)

# Finally, map it to your column
df['color'] = df['Set'].map(color_dict)


This approach will be fastest when you have many repeated values. My general rule of thumb is to memoize when: data_size > 10**4 & n_distinct < data_size/4

当您有许多重复的值时,此方法将是最快的。我的一般经验是在以下情况下记住:data_size>10**4&n_DISTINCT



E.x. Memoize in a case 10,000 rows with 2,500 or fewer distinct values.

E.X。在具有2,500或更少不同值的情况下,对10,000行进行记忆。



A Less verbose approach using np.select:

使用np.select的一种不太冗长的方法:


a = np.array([['A','Z'],['B','Z'],['B','X'],['C','Y']])
df = pd.DataFrame(a,columns=['Type','Set'])

conditions = [
df['Set'] == 'Z'
]

outputs = [
'Green'
]
# conditions Z is Green, Red Otherwise.
res = np.select(conditions, outputs, 'Red')
res
array(['Green', 'Green', 'Red', 'Red'], dtype='<U5')
df.insert(2, 'new_column',res)

df
Type Set new_column
0 A Z Green
1 B Z Green
2 B X Red
3 C Y Red

df.to_numpy()

array([['A', 'Z', 'Green'],
['B', 'Z', 'Green'],
['B', 'X', 'Red'],
['C', 'Y', 'Red']], dtype=object)

%%timeit conditions = [df['Set'] == 'Z']
outputs = ['Green']
np.select(conditions, outputs, 'Red')

134 µs ± 9.71 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)

df2 = pd.DataFrame({'Type':list('ABBC')*1000000, 'Set':list('ZZXY')*1000000})
%%timeit conditions = [df2['Set'] == 'Z']
outputs = ['Green']
np.select(conditions, outputs, 'Red')

188 ms ± 26.5 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

更多回答

Note that, with much larger dataframes (think pd.DataFrame({'Type':list('ABBC')*100000, 'Set':list('ZZXY')*100000})-size), numpy.where outpaces map, but the list comprehension is king (about 50% faster than numpy.where).

请注意,对于大得多的数据帧(想想pd.DataFrame({‘Type’:List(‘abbc’)*100000,‘Set’:List(‘ZZxy’)*100000})-Size),numpy.where映射的是outpaces,但列表理解是王道(大约比numpy.where快50%)。

Can the list comprehension method be used if the condition needs information from multiple columns? I am looking for something like this (this does not work): df['color'] = ['red' if (x['Set'] == 'Z') & (x['Type'] == 'B') else 'green' for x in df]

如果条件需要来自多列的信息,是否可以使用列表理解方法?我正在寻找这样的东西(这不起作用):df[‘COLOR’]=[‘RED’IF(x[‘set’]==‘Z’)&(x[‘Type’]==‘B’)Else‘For x in df]

Add iterrows to the dataframe, then you can access multiple columns via row: ['red' if (row['Set'] == 'Z') & (row['Type'] == 'B') else 'green' for index, row in in df.iterrows()]

将迭代行添加到数据帧中,然后您可以通过行访问多列:[‘red’if(row[‘set’]==‘Z’)&(row[‘Type’]==‘B’)Else‘green’用于索引,row in in df.iterrow()]

Note this nice solution will not work if you need to take replacement values from another series in the data frame, such as df['color_type'] = np.where(df['Set']=='Z', 'green', df['Type'])

注意:如果您需要从数据框中的另一系列中获取替换值,例如df[‘COLOR_TYPE’]=np.where(df[‘set’]==‘Z’,‘green’,df[‘Type’]),这个很好的解决方案将不起作用

@cheekybastard Or don't, since .iterrows() is notoriously sluggish and the DataFrame shouldn't be modified while iterating.

@cheekyaccurst或不这样做,因为.iterrow()速度慢是出了名的,而且DataFrame不应该在迭代时修改。

Update: On 100,000,000 rows, 52 string values, .apply() takes 47 seconds, versus only 5.91 seconds for .map().

更新:对于100,000,000行、52个字符串值,.Apply()需要47秒,而.map()只需要5.91秒。

Alright, so with only 2 distinct values to map, 100,000,000 rows, it takes 6.67 seconds to run without "memoization", and 9.86 seconds with.

好的,只有两个不同的值要映射,100,000,000行,没有“Memoization”的运行时间是6.67秒,而使用“Memoization”的运行时间是9.86秒。

100,000,000 rows, 52 distinct values, where 1 of those maps to the first output value, and the other 51 all correspond to the other: 7.99 seconds without memoization, 11.1 seconds with.

100,000,000行,52个不同的值,其中1个映射到第一个输出值,其他51个都对应于另一个:7.99秒没有记忆,11.1秒有记忆。

Are your values in random order? Or are they back to back? High speed of pandas could be due to caching @AMC

你的值是随机排列的吗?或者他们是背靠背?熊猫的高速可能是因为缓存@AMC

Are your values in random order? Or are they back to back? Values are random, selected using random.choices().

你的值是随机排列的吗?或者他们是背靠背?值是随机的,使用随机选项()选择。

26 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com