gpt4 book ai didi

python - 如何使用其他数据框(PySpark)创建数据框?

转载 作者:行者123 更新时间:2023-11-30 22:23:42 24 4
gpt4 key购买 nike

我正在使用 PySpark v1.6.1,我想使用另一个创建一个数据框:

  • 转换在不同列中具有包含三个值的结构的字段
  • 将时间戳从字符串转换为数据时间
  • 使用该时间戳创建更多列
  • 更改其余列名称和类型

现在正在使用 .map(func) 使用该函数创建 RDD(该函数从原始类型的一行进行转换并返回包含新类型的行)。但这是在创建一个 RDD,我不会这样做。

有更好的方法吗?

最佳答案

from pyspark.sql.functions import unix_timestamp, col, to_date, struct

####
#sample data
####
df = sc.parallelize([[25, 'Prem', 'M', '12-21-2006 11:00:05','abc', '1'],
[20, 'Kate', 'F', '05-30-2007 10:05:00', 'asdf', '2'],
[40, 'Cheng', 'M', '12-30-2017 01:00:01', 'qwerty', '3']]).\
toDF(["age","name","sex","datetime_in_strFormat","initial_col_name","col_in_strFormat"])

#create 'struct' type column by combining first 3 columns of sample data - (this is built to answer query #1)
df = df.withColumn("struct_col", struct('age', 'name', 'sex')).\
drop('age', 'name', 'sex')
df.show()
df.printSchema()

####
#query 1
####
#Convert a field that has a struct of three values (i.e. 'struct_col') in different columns (i.e. 'name', 'age' & 'sex')
df = df.withColumn('name', col('struct_col.name')).\
withColumn('age', col('struct_col.age')).\
withColumn('sex', col('struct_col.sex')).\
drop('struct_col')
df.show()
df.printSchema()

####
#query 2
####
#Convert the timestamp from string (i.e. 'datetime_in_strFormat') to datetime (i.e. 'datetime_in_tsFormat')
df = df.withColumn('datetime_in_tsFormat',
unix_timestamp(col('datetime_in_strFormat'), 'MM-dd-yyyy hh:mm:ss').cast("timestamp"))
df.show()
df.printSchema()

####
#query 3
####
#create more columns using above timestamp (e.g. fetch date value from timestamp column)
df = df.withColumn('datetime_in_dateFormat', to_date(col('datetime_in_tsFormat')))
df.show()

####
#query 4.a
####
#Change column name (e.g. 'initial_col_name' is renamed to 'new_col_name)
df = df.withColumnRenamed('initial_col_name', 'new_col_name')
df.show()

####
#query 4.b
####
#Change column type (e.g. string type in 'col_in_strFormat' is coverted to double type in 'col_in_doubleFormat')
df = df.withColumn("col_in_doubleFormat", col('col_in_strFormat').cast("double"))
df.show()
df.printSchema()

示例数据:

+---------------------+----------------+----------------+------------+
|datetime_in_strFormat|initial_col_name|col_in_strFormat| struct_col|
+---------------------+----------------+----------------+------------+
| 12-21-2006 11:00:05| abc| 1| [25,Prem,M]|
| 05-30-2007 10:05:00| asdf| 2| [20,Kate,F]|
| 12-30-2017 01:00:01| qwerty| 3|[40,Cheng,M]|
+---------------------+----------------+----------------+------------+
root
|-- datetime_in_strFormat: string (nullable = true)
|-- initial_col_name: string (nullable = true)
|-- col_in_strFormat: string (nullable = true)
|-- struct_col: struct (nullable = false)
| |-- age: long (nullable = true)
| |-- name: string (nullable = true)
| |-- sex: string (nullable = true)

最终输出数据:

+---------------------+------------+----------------+-----+---+---+--------------------+----------------------+-------------------+
|datetime_in_strFormat|new_col_name|col_in_strFormat| name|age|sex|datetime_in_tsFormat|datetime_in_dateFormat|col_in_doubleFormat|
+---------------------+------------+----------------+-----+---+---+--------------------+----------------------+-------------------+
| 12-21-2006 11:00:05| abc| 1| Prem| 25| M| 2006-12-21 11:00:05| 2006-12-21| 1.0|
| 05-30-2007 10:05:00| asdf| 2| Kate| 20| F| 2007-05-30 10:05:00| 2007-05-30| 2.0|
| 12-30-2017 01:00:01| qwerty| 3|Cheng| 40| M| 2017-12-30 01:00:01| 2017-12-30| 3.0|
+---------------------+------------+----------------+-----+---+---+--------------------+----------------------+-------------------+

root
|-- datetime_in_strFormat: string (nullable = true)
|-- new_col_name: string (nullable = true)
|-- col_in_strFormat: string (nullable = true)
|-- name: string (nullable = true)
|-- age: long (nullable = true)
|-- sex: string (nullable = true)
|-- datetime_in_tsFormat: timestamp (nullable = true)
|-- datetime_in_dateFormat: date (nullable = true)
|-- col_in_doubleFormat: double (nullable = true)

关于python - 如何使用其他数据框(PySpark)创建数据框?,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/47993060/

24 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com