gpt4 book ai didi

scala - Scala中如何根据三列过滤数据

转载 作者:可可西里 更新时间:2023-11-01 16:28:56 28 4
gpt4 key购买 nike

我是 scala 的新手,我想为一个数据集迭代三个循环并执行一些分析。例如我的数据如下:

Sample.csv

1,100,0,NA,0,1,0,Friday,1,5
1,100,0,NA,0,1,0,Wednesday,1,9
1,100,1,NA,0,1,0,Friday,1,5
1,100,2,NA,0,1,0,Friday,1,5
1,101,0,NA,0,1,0,Friday,1,5
1,101,1,NA,0,1,0,Friday,1,5
1,101,2,NA,0,1,0,Friday,1,5
1,102,0,NA,0,1,0,Friday,1,5
1,102,1,NA,0,1,0,Friday,1,5
1,102,2,NA,0,1,0,Friday,1,5

所以现在我读到如下内容:

val data = sc.textFile("C:/users/ricky/Data.csv")

现在我需要在scala中对前三列实现一个过滤器来过滤整个数据的子集并做一些分析。例如前三列是要过滤的列。所以第一列 (1) 有一个值,第二列 (100,101,102) 有 3 个值,第三列 (0,1,2) 有 3 个值。所以现在我需要运行过滤器以提供整个数据的子集作为.使用像下面这样的循环好吗

for {
i <- 1
j <- 100 to 102
k <- 1 to 2
}

这应该需要像

这样的子集数据
1,100,0,NA,0,1,0,Friday,1,5
1,100,0,NA,0,1,0,Wednesday,1,9

where i=1 ,j=100,and k=0

直到

1,102,2,NA,0,1,0,Friday,1,5

where i=1 ,j=102,and k=2

我如何在 Scala 中运行数据(我从 CSV 中读取)。

最佳答案

从文本csv文件中读取后,您可以使用filter过滤您想要的数据

val tempData = data.map(line => line.split(","))
tempData.filter(array => array(0) == "1" && array(1) == "100" && array(2) == "0").foreach(x => println(x.mkString(",")))

这会给你结果

1,100,0,NA,0,1,0,Friday,1,5
1,100,0,NA,0,1,0,Wednesday,1,9

你可以对其他情况做同样的事情

数据框 api

为了简单起见,您可以使用 dataframe api,比 rdd 优化等等。第一步是将 csv 读取为 dataframe 作为

val df = sqlContext.read.format("com.databricks.spark.csv").load("path to csv file")

你会有

+---+---+---+---+---+---+---+---------+---+---+
|_c0|_c1|_c2|_c3|_c4|_c5|_c6|_c7 |_c8|_c9|
+---+---+---+---+---+---+---+---------+---+---+
|1 |100|0 |NA |0 |1 |0 |Friday |1 |5 |
|1 |100|0 |NA |0 |1 |0 |Wednesday|1 |9 |
|1 |100|1 |NA |0 |1 |0 |Friday |1 |5 |
|1 |100|2 |NA |0 |1 |0 |Friday |1 |5 |
|1 |101|0 |NA |0 |1 |0 |Friday |1 |5 |
|1 |101|1 |NA |0 |1 |0 |Friday |1 |5 |
|1 |101|2 |NA |0 |1 |0 |Friday |1 |5 |
|1 |102|0 |NA |0 |1 |0 |Friday |1 |5 |
|1 |102|1 |NA |0 |1 |0 |Friday |1 |5 |
|1 |102|2 |NA |0 |1 |0 |Friday |1 |5 |
+---+---+---+---+---+---+---+---------+---+---+

然后你可以使用 filter api as in rdd as

import sqlContext.implicits._
val df1 = df.filter($"_c0" === "1" && $"_c1" === "100" && $"_c2" === "0")

你应该有

+---+---+---+---+---+---+---+---------+---+---+
|_c0|_c1|_c2|_c3|_c4|_c5|_c6|_c7 |_c8|_c9|
+---+---+---+---+---+---+---+---------+---+---+
|1 |100|0 |NA |0 |1 |0 |Friday |1 |5 |
|1 |100|0 |NA |0 |1 |0 |Wednesday|1 |9 |
+---+---+---+---+---+---+---+---------+---+---+

您甚至可以定义 schema 来获得您想要的列名。

已编辑

在下面回答您的评论,这完全取决于您输出的内容

scala> val temp = tempData.filter(array => array(0) == "1" && array(1).toInt == "100" && array(2).toInt == "0").map(x => x.mkString(","))
temp: org.apache.spark.rdd.RDD[String] = MapPartitionsRDD[12] at map at <console>:28

scala> tempData.filter(array => array(0) == "1" && array(1).toInt == "100" && array(2).toInt == "0")
res9: org.apache.spark.rdd.RDD[Array[String]] = MapPartitionsRDD[13] at filter at <console>:29

我希望它清楚。

关于scala - Scala中如何根据三列过滤数据,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/45046704/

28 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com