gpt4 book ai didi

apache-spark - Pyspark - 拆分一列并取 n 个元素

转载 作者:行者123 更新时间:2023-12-02 20:04:09 26 4
gpt4 key购买 nike

我想获取一个列并使用一个字符拆分一个字符串。按照惯例,我知道方法 split 会返回一个列表,但是在编码时我发现返回的对象只有 getItem 或 getField 方法,API 中有以下描述:

@since(1.3)   
def getItem(self, key):
"""
An expression that gets an item at position ``ordinal`` out of a list,
or gets an item by key out of a dict.


@since(1.3)
def getField(self, name):
"""
An expression that gets a field by name in a StructField.

显然这不符合我的要求,例如,对于“A_B_C_D”列中的文本,我想在两个不同的列中拆分为“A_B_C_”和“D”。

这是我正在使用的代码

from pyspark.sql.functions import regexp_extract, col, split
df_test=spark.sql("SELECT * FROM db_test.table_test")
#Applying the transformations to the data

split_col=split(df_test['Full_text'],'_')
df_split=df_test.withColumn('Last_Item',split_col.getItem(3))

找一个例子:

from pyspark.sql import Row
from pyspark.sql.functions import regexp_extract, col, split
l = [("Item1_Item2_ItemN"),("FirstItem_SecondItem_LastItem"),("ThisShouldBeInTheFirstColumn_ThisShouldBeInTheLastColumn")]
rdd = sc.parallelize(l)
datax = rdd.map(lambda x: Row(fullString=x))
df = sqlContext.createDataFrame(datax)
split_col=split(df['fullString'],'_')
df=df.withColumn('LastItemOfSplit',split_col.getItem(2))

结果:

fullString                                                LastItemOfSplit
Item1_Item2_ItemN ItemN
FirstItem_SecondItem_LastItem LastItem
ThisShouldBeInTheFirstColumn_ThisShouldBeInTheLastColumn null

我的预期结果总是最后一项

fullString                                                LastItemOfSplit
Item1_Item2_ItemN ItemN
FirstItem_SecondItem_LastItem LastItem
ThisShouldBeInTheFirstColumn_ThisShouldBeInTheLastColumn ThisShouldBeInTheLastColumn

最佳答案

您可以使用 getItem(size - 1) 从数组中获取最后一项:

示例:

df = spark.createDataFrame([[['A', 'B', 'C', 'D']], [['E', 'F']]], ['split'])
df.show()
+------------+
| split|
+------------+
|[A, B, C, D]|
| [E, F]|
+------------+

import pyspark.sql.functions as F
df.withColumn('lastItem', df.split.getItem(F.size(df.split) - 1)).show()
+------------+--------+
| split|lastItem|
+------------+--------+
|[A, B, C, D]| D|
| [E, F]| F|
+------------+--------+

针对您的情况:

from pyspark.sql.functions import regexp_extract, col, split, size
df_test=spark.sql("SELECT * FROM db_test.table_test")
#Applying the transformations to the data

split_col=split(df_test['Full_text'],'_')
df_split=df_test.withColumn('Last_Item',split_col.getItem(size(split_col) - 1))

关于apache-spark - Pyspark - 拆分一列并取 n 个元素,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/55143035/

26 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com