gpt4 book ai didi

scala - 如何使用 scala 删除重复的元组?笛卡尔 Scala Spark

转载 作者:行者123 更新时间:2023-12-02 19:24:27 26 4
gpt4 key购买 nike

实际上我有一个包含一些蛋白质名称及其域的 RDD。我使用笛卡尔函数来确定它可能的对。结果我不幸得到了重复的对。如何只保留一个元组并删除重复的元组?这是一个例子:

+------------------------------------+------------------------------------+
| Protein1 | Protein2 |
+------------------------------------+------------------------------------+
|(P0C2L1,IPR0179) |(P0CW05,IPR004372;IPR000890) |
|(P0CW05,IPR004372;IPR000890) |(P0C2L1,IPR0179) |
|(B2UDV1,IPR0104) |(Q4R8P0,IPR029058;IPR000073;IPR0266)|
|(Q4R8P0,IPR029058;IPR000073;IPR0266)|(B2UDV1,IPR0104) |
+------------------------------------+------------------------------------+

我想要:

+------------------------------------+------------------------------------+
| Protein1 | Protein2 |
+------------------------------------+------------------------------------+
|(P0C2L1,IPR0179) |(P0CW05,IPR004372;IPR000890) |
|(B2UDV1,IPR0104) |(Q4R8P0,IPR029058;IPR000073;IPR0266)|
+------------------------------------+------------------------------------+

最佳答案

我根据提供的信息假设了输入数据并实现了以下解决方案。

确实如此:

  1. 将 RDD 转换为 Spark 数据帧。
  2. 根据长度(Protein1) > 长度(Protein2) 交换每个第二个输入。
  3. 并使用 dropDuplicates 方法删除重复项。
  4. 存储在数据帧中,然后存储在 RDD 中。

注意:要使其起作用,必须满足“长度(蛋白质 1)> 长度(蛋白质 2)”。如果 OP 提供更清晰的输入数据,我们将致力于开发更多解决方案。

//Creating the paired RDD as provided by OP
var x: RDD[(String, String)] = sc.parallelize(Seq((("P0C2L1,IPR0179"), ("P0CW05,IPR004372;IPR000890")), (("P0CW05,IPR004372;IPR000890"),("P0C2L1,IPR0179") ), (("B2UDV1,IPR0104"),("Q4R8P0,IPR029058;IPR000073;IPR0266")), (("Q4R8P0,IPR029058;IPR000073;IPR0266"),("B2UDV1,IPR0104")) ))

//Creating as spark dataframe out of this RDD
var combDF = spark.createDataFrame(x).toDF("Protein1","Protein2")
combDF.show(20,false)

//+------------------------------------+------------------------------------+
//|Protein1 |Protein2 |
//+------------------------------------+------------------------------------+
//|(P0C2L1,IPR0179) |(P0CW05,IPR004372;IPR000890) |
//|(P0CW05,IPR004372;IPR000890) |(P0C2L1,IPR0179) |
//|(B2UDV1,IPR0104) |(Q4R8P0,IPR029058;IPR000073;IPR0266)|
//|(Q4R8P0,IPR029058;IPR000073;IPR0266)|(B2UDV1,IPR0104) |
//+------------------------------------+------------------------------------+

// creating temporary views
combDF.createOrReplaceTempView("combDF")

// Below statement is only required for this example just to cast to struct
combDF = spark.sql("""select named_struct("col1", element_at(split(Protein1,","),1), "col2", element_at(split(Protein1,","),2)) as Protein1,
named_struct("col1", element_at(split(Protein2,","),1), "col2", element_at(split(Protein2,","),2)) as Protein2
from combDF""")
//end

combDF.createOrReplaceTempView("combDF")
combDF.show()
var result = spark.sql("""
|select case when length(Protein1_m) > length(Protein2_m) then element_at(protein_array, 2)
| else element_at(protein_array, 1)
| end as Protein1,
| case when length(Protein1_m) > length(Protein2_m) then element_at(protein_array, 1)
| else element_at(protein_array, 2)
| end as Protein2
|from
|(select Protein1, Protein2, cast(Protein1 as string) as Protein1_m, cast(Protein2 as string) as Protein2_m,
| array(Protein1,Protein2) as protein_array
|from combDF) a
""".stripMargin).dropDuplicates()
// Result in spark dataframe
result.show(20,false)

//+-----------------+-------------------------------------+
//|Protein1 |Protein2 |
//+-----------------+-------------------------------------+
//|(B2UDV1,IPR0104) |(Q4R8P0,IPR029058;IPR000073;IPR0266) |
//|(P0C2L1,IPR0179) |(P0CW05,IPR004372;IPR000890) |
//+-------------------+-----------------------------------+

// result in RDD
var resultRDD = result.rdd
resultRDD.collect().foreach(println)

//[(B2UDV1,IPR0104),(Q4R8P0,IPR029058;IPR000073;IPR0266)]
//[(P0C2L1,IPR0179),(P0CW05,IPR004372;IPR000890)]

关于scala - 如何使用 scala 删除重复的元组?笛卡尔 Scala Spark,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/62608414/

26 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com