gpt4 book ai didi

apache-spark - 将文本文件数据过滤为pyspark rdd和dataframe中的列

转载 作者:行者123 更新时间:2023-12-02 18:46:02 24 4
gpt4 key购买 nike

我有如下数据:

It has survived not only five centuries, but also the leap into electronic typesetting, remaining essentially unchanged. 
It was popularised in the 1960s with the release of Image sheets containing Buddy passages, and more recently with desktop publishing
software like

1 long title 1
2 long title 2
3 long title 3
4 long title 4
5 long title 5
6 long title 6
7 long title 7
8 long title 8
9 long title 9
10 long title 10
11 long title 11
12 long title 12
13 long title 13
14 long title 14
15 long title 15
16 long title 16
17 long title 17
18 long title 18
19 long title 19
20 long title 20

现在,在加载此文本文件时,我必须排除垃圾数据(即段落),并且必须包括从 long title 1开始的数据(即列数据)。我正在使用RDD,但无法正确加载。一旦正确填充了RDD中的数据,便可以将其转换为数据框。下面是我的代码:
from pyspark.context import SparkContext
from pyspark.sql import SparkSession
from pyspark import SparkConf


sc = SparkContext.getOrCreate(SparkConf().setMaster("local[*]"))
load_data=sc.textFile("E://long_sample.txt").filter(lambda x : "title")
load_data.foreach(print())

即使我试图将其过滤为“ title”,我仍然会得到不正确的全部数据。请帮我整理一下。没有显示错误。

最佳答案

try this in Pyspark:


>>> load_data=sc.textFile("file:///home/mahesh/Downloads/line_text.txt")

使用IN语句过滤数据并从现有RDD创建数据帧
>>> load_data.filter(lambda x: "title" in x).map(lambda x:(x.split(" ")[0],x.split(" ")[1]+" " + x.split(" ")[2],x.split(" ")[3] )).toDF(["Id","Name","Number"])

>>> df.show()
+---+----------+------+
| Id| Name|Number|
+---+----------+------+
| 1|long title| 1|
| 2|long title| 2|
| 3|long title| 3|
| 4|long title| 4|
| 5|long title| 5|
| 6|long title| 6|
| 7|long title| 7|
| 8|long title| 8|
| 9|long title| 9|
| 10|long title| 10|
| 11|long title| 11|
| 12|long title| 12|
| 13|long title| 13|
| 14|long title| 14|
| 15|long title| 15|
| 16|long title| 16|
| 17|long title| 17|
| 18|long title| 18|
| 19|long title| 19|
| 20|long title| 20|
+---+----------+------+

让我知道您是否需要更多帮助。

关于apache-spark - 将文本文件数据过滤为pyspark rdd和dataframe中的列,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/58361748/

24 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com