python - Pandas 到 PySpark : transforming a column of lists of tuples to separate columns for each tuple item-6ren

python - Pandas 到 PySpark : transforming a column of lists of tuples to separate columns for each tuple item

转载作者：太空宇宙更新时间：2023-11-03 10:50:06

25

4

我需要转换一个 DataFrame，其中一列由元组列表组成，每个元组中的每个项目都必须是一个单独的列。

这是 Pandas 中的示例和解决方案:

import pandas as pd

df_dict = {
    'a': {
        "1": "stuff", "2": "stuff2"
    }, 

    "d": {
        "1": [(1, 2), (3, 4)], "2": [(1, 2), (3, 4)]
    }
}

df = pd.DataFrame.from_dict(df_dict)
print(df)  # intial structure

           a    d
    1   stuff   [(1, 2), (3, 4)]
    2   stuff2  [(1, 2), (3, 4)]

# first transformation, let's separate each list item into a new row
row_breakdown = df.set_index(["a"])["d"].apply(pd.Series).stack()
print(row_breakdown)

            a        
    stuff   0    (1, 2)
            1    (3, 4)
    stuff2  0    (1, 2)
            1    (3, 4)
    dtype: object

row_breakdown = row_breakdown.reset_index().drop(columns=["level_1"])
print(row_breakdown)

    a   0
    0   stuff   (1, 2)
    1   stuff   (3, 4)
    2   stuff2  (1, 2)
    3   stuff2  (3, 4)

# second transformation, let's get each tuple item into a separate column
row_breakdown.columns = ["a", "d"]
row_breakdown = row_breakdown["d"].apply(pd.Series)
row_breakdown.columns = ["value_1", "value_2"]
print(row_breakdown)

        value_1 value_2
    0   1   2
    1   3   4
    2   1   2
    3   3   4

这是 Pandas 解决方案。我需要能够做同样的事情，但使用 PySpark (2.3)。我已经开始研究它，但很快就卡住了:

from pyspark.context import SparkContext, SparkConf
from pyspark.sql.session import SparkSession

conf = SparkConf().setAppName("appName").setMaster("local")
sc = SparkContext(conf=conf)

spark = SparkSession(sc)

df_dict = {
    'a': {
        "1": "stuff", "2": "stuff2"
    }, 

    "d": {
        "1": [(1, 2), (3, 4)], "2": [(1, 2), (3, 4)]
    }
}

df = pd.DataFrame(df_dict)
ddf = spark.createDataFrame(df)

row_breakdown = ddf.set_index(["a"])["d"].apply(pd.Series).stack()

    AttributeError: 'DataFrame' object has no attribute 'set_index'

显然，Spark 不支持索引。任何指点表示赞赏。

最佳答案

这可能会:

from pyspark.context import SparkContext, SparkConf
from pyspark.sql.session import SparkSession
from pyspark.sql import functions as F
import pandas as pd

conf = SparkConf().setAppName("appName").setMaster("local")
sc = SparkContext(conf=conf)

spark = SparkSession(sc)

df_dict = {
    'a': {
        "1": "stuff", "2": "stuff2"
    }, 

    "d": {
        "1": [(1, 2), (3, 4)], "2": [(1, 2), (3, 4)]
    }
}

df = pd.DataFrame(df_dict)
ddf = spark.createDataFrame(df)


exploded = ddf.withColumn('d', F.explode("d"))
exploded.show()

结果:

+------+------+
|     a|     d|
+------+------+
| stuff|[1, 2]|
| stuff|[3, 4]|
|stuff2|[1, 2]|
|stuff2|[3, 4]|
+------+------+

为此，我觉得使用 SQL 更舒服:

exploded.createOrReplaceTempView("exploded")
spark.sql("SELECT a, d._1 as value_1, d._2 as value_2 FROM exploded").show()

重要说明:这是使用 _1 的原因和 _2访问器是因为 spark 将元组解析为结构并为其提供了默认键。如果在您的实际实现中，数据框包含 array<int> , 你应该使用 [0]语法。

最后的结果是:

+------+-------+-------+
|     a|value_1|value_2|
+------+-------+-------+
| stuff|      1|      2|
| stuff|      3|      4|
|stuff2|      1|      2|
|stuff2|      3|      4|
+------+-------+-------+

关于python - Pandas 到 PySpark : transforming a column of lists of tuples to separate columns for each tuple item，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/52243200/

25

4

0

文章推荐： java - 如何使用类似于示波器的图表引擎绘制实时图形

文章推荐： Android Emulator 太慢了，我无法使用它

文章推荐： python - 导入 Python 文件是否也会将导入的文件导入 shell？

python : Find tuples from a list of tuples having duplicate data in the 0th element(of the tuple)
我有一个包含文件名和文件路径的元组列表。我想找到重复的 filename(但 filepath 可能不同)，即文件名相同但 filepath 可能不同的元组。元组列表示例: file_info
c++ - std::tuple 和 std::tuple 是否被 std::vector 视为同一类型？
我有一个像这样定义的变量 auto drum = std::make_tuple ( std::make_tuple ( 0.3f , Ex
swift 4 : pattern match an object against a tuple (Tuple pattern cannot match values of the non-tuple type)
我有一个包含几个字段的自定义结构，我想在快速 switch 语句中对其进行模式匹配，这样我就可以通过将其中一个字段与另一个字段进行比较来自定义匹配正则表达式。例如鉴于这种结构: struct MyS
c++ - 过滤嵌套动态元组(dynamic tuple of tuples)
我有一种动态元组结构: template //Should only be tuples class DynamicTuple { vector data; //All data is st
c# Tuple - 什么是 Tuple 的实际用途
这个问题在这里已经有了答案: What and When to use Tuple? [duplicate] (5 个答案) 关闭 8 年前。我正在查看 Tuple 的在线示例，但我没有看到任何理
tuples - common-lisp 中有 'tuple' 等价物吗？
在我的项目中我有很多坐标要处理，在二维情况下我发现(cons x y)的构造比(list x y)快和 (vector x y)。但是，我不知道如何将 cons 扩展到 3D 或更进一步，因为我没有
Scala Function.tupled 与 f.tupled
我有以下 Scala 代码: def f(x: Int, y: Int): Option[String] = x*y match { case 0 => None case n =>
scala - N-Tuple of Options to Option of N-Tuple
我的直觉告诉我，在一般情况下，只有宏或复杂类型的体操才能解决这个问题。 Shapeless 或 Scalaz 可以在这里帮助我吗？这是 N=2 问题的具体实例，但我正在寻找的解决方案适用于所有合理的
scala - 为什么 Scala 在解包 Tuple 时要构造一个新的 Tuple？
为什么这段 Scala 代码是这样的: class Test { def foo: (Int, String) = { (123, "123") } def bar: Unit
python - 类型错误 : can only concatenate tuple (not "vector") to tuple
我是 python 和 pygame 的新手，我正在尝试学习向量和类的基础知识，但在这个过程中我搞砸了，而且我在理解和修复标题中的错误消息方面苦苦挣扎。这是我的 Vector 类的代码: impor
python - "TypeError: can only concatenate tuple (not " float ") to tuple"
我正在编写一个程序来打开和读取一个 txt 文件，并在每一行中循环。将第 2 列和第 4 列中的值相乘并将其分配给第 5 列。 A 500.00 A 84.15 ? B 648.80 B 77.61
Python 类型错误 : can only concatenate tuple (not "str") to tuple
我知道还有其他几个问题提出了完全相同的问题，但是当我运行时: 导入命令从 pyDes 导入 * def encrypt(data, password,): k = des(password,
python 3 : Removing an empty tuple from a list of tuples
我有一个元组列表，内容如下: >>>myList [(), (), ('',), ('c', 'e'), ('ca', 'ea'), ('d',), ('do',), ('dog', 'ear', '
c++ - std::tuple 和 boost::tuple 之间的转换
给定一个 boost::tuple 和 std::tuple，你如何在它们之间进行转换？也就是说，您将如何实现以下两个功能？ template boost::tuple asBoostTuple(
c++ - 为什么不能用兼容类型的 std::tuple 按元素构造 std::tuple？
我无法初始化 std::tuple来自 std::tuple 的逐元素元素兼容类型。为什么它不像 boost::tuple 那样工作？ #include #include template st
java - 创建一个 backtype.storm.tuple.Tuple 用于测试目的？
我是 Storm 的新手并且我正在尝试找出如何编写一个 bolt 测试来测试子类 BaseRichBolt 中的 execute(Tuple tuple) 方法。问题是 Tuple 似乎是不可变的，
Python:从不考虑顺序的 "set of tuples"生成 "list of tuples"
如果我有如下元组列表: [('a', 'b'), ('c', 'd'), ('a', 'b'), ('b', 'a')] 我想删除重复的元组(在内容和内部项目顺序方面重复)以便输出为: [('a',
python - 类型错误 : can only concatenate tuple (not "list") to tuple"
我编写了一个简单的脚本来模拟基于每用户平均收入 (ARPU)、利润率和客户保持客户的年数 (ltvYears) 的客户生命周期值(value) (LTV)。下面是我的脚本。它在“ltvYears =
Python: Append tuple to a set with tuples(Python：将元组附加到具有元组的集合)
以下是我的代码，它是一组元组：。输出：设置([(‘A’，20160129，36.44)，(‘A’，20160104，41.06)，(‘A’，20160201，37.37)])。如何将另一个元组(‘A’
python - 类型错误 : Type Tuple cannot be instantiated; use tuple() instead
我用以下代码编写了一个程序: import pandas as pd import numpy as np from typing import Tuple def split_data(self,

首页

博学

6Ren·AI

商城

python - Pandas 到 PySpark : transforming a column of lists of tuples to separate columns for each tuple item