gpt4 book ai didi

python - PySpark 无法正确读取 CSV

转载 作者:行者123 更新时间:2023-11-30 21:55:43 25 4
gpt4 key购买 nike

我正在使用 df.to_csv("preprocessed_data.csv") 将包含 318477 行的 Pandas 数据帧中的数据保存到 csv 文件。当我将此文件加载到另一个笔记本中时:

df = pd.read_csv("preprocessed_data.csv")
len(df)

# out: 318477

行数符合预期。但是,当我尝试使用 PySpark 加载数据集时:

spark_df = spark.read.format("csv")
.option("header", "true")
.option("mode", "DROPMALFORMED")
.load("preprocessed_data.csv")
spark_df.count()

# out: 6422020

df_test = spark.sql("SELECT * FROM csv.`preprocessed_data.csv`")
df_test.count()

# out: 6422020

行数不正确。它读取的行数 6422020 是 csv 文件中的行数。由于有些行的内容跨越多行(即 https://imgur.com/a/qWd9jtq )

如何解决这个问题?我是否需要以某种方式保存 csv,并且在任何文本中都没有换行符,或者我可以更具体地指定 PySpark 中的 csv 读数吗?

这是我上一个问题的继续,我现在更了解这个问题 link

CSV 文件中的行:

120,teacher industrial design technology mabel park state high school,teach queensland,2018-10-07,brisbane,southern suburbs logan,education training,teaching secondary,mabel park state high school invites applications for a industrial design and technology teacher,,0,30,,0.0,0.03003003003003003
121,fabricatorinstaller,workplace access safety,2018-10-07,melbourne,bayside south eastern suburbs,trades services,welders boilermakers,trade qualified person with skills in welding and fabrication to assist in the manufacturing and installation of our custom height safety products,"<p>&nbsp;</p>
<p><strong><em>*&nbsp; Secure long term role with genuine career path to supervisor</em></strong></p>
<p><strong><em>*&nbsp; Competitive hourly rate with regular opportunity for overtime</em></strong></p>
<p><strong><em>*&nbsp; Full on-the-job training</em></strong></p>
<p><strong>About the&nbsp;role</strong></p>
<p>Having recently won a significant new national contract we are looking for another trade qualified person with welding and fabrication skills to help manage increased demands on our production and installation departments.&nbsp; This role will
see you involved in both manufacturing and on-site installation and there is a genuine career path to supervisor if that is your goal.&nbsp; Initially your role will require you to:-</p>
<ul>
<li>read and interpret drawings&nbsp;</li>
<li>fabricate and assemble orders as required</li>
<li>provide input to enhance factory processes</li>
<li>pack&nbsp;and dispatch orders</li>
<li>perform on-site installations (full training will be given)</li>
</ul>
<p><strong>About you</strong></p>
<p>This role is ideal for a trade qualified person&nbsp;(welder, boilermaker, fabricator etc) with good hands-on skills who will enjoy&nbsp;dividing their time between&nbsp;factory/manufacturing and on-site installations.&nbsp; Because installations
invariably take place on the roof, physical fitness is&nbsp;essential.</p>
<p><strong>What we offer</strong></p>
<ul>
<li>A secure, long-term role with a successful, well-established organisation</li>
<li>Full, ongoing on-the-job training</li>
<li>Opportunity for career progression to supervisor for the right person</li>
<li>Opportunity to work&nbsp;in a safe, supportive and friendly environment</li>
<li>Competitive hourly rate with regular opportunities for overtime</li>
<li>Occasional regional and interstate travel in response to major projects</li>
</ul>
<p><strong>How to apply</strong></p>
<p>Please copy and paste the URL below into your browser (it is <em>not</em> a live link so&nbsp;must be copied and pasted).&nbsp; This will take you to our custom online application form which includes a number of screening questions&nbsp;and a
profiling checklist which is an essential part of our application process.</p>
<p><strong>https://exenet.expr3ss.com/jobDetails?selectJob=296&amp;</strong></p>
<p>If you have any difficulties or would like more information please email <a class=""_2L3qcJ0"" data-contact-match=""true"" href=""mailto:gayle@exhr.com.au"">gayle@exhr.com.au</a> or phone <a class=""_2hhDNI-"" data-contact-match=""true"" href=""tel:0468 336 224"">0468 336 224</a>.</p>",0,30,full time,0.0,0.03003003003003003
122,boilermaker,rpm contracting qld pl,2018-10-07,brisbane,southern suburbs logan,trades services,welders boilermakers,perm rate 30 structural steel fab weld out located southside full time hours ongoing work ot modern clean facility offering great conditions,"<p>One of Australia's best engineering workshops is hiring!</p>
<p>They have ongoing, rolling projects and need good people now.</p>
<p>They are partnered with state and federal governments, international minerals and energy companies, and other market leading entities.</p>
<p>The workshop is state of the art, clean, and well-managed. There is a genuine focus on the safety and wellbeing of their people.</p>
<p>The facility and conditions are truly exceptional.</p>
<p>Secure and long term positions are on offer for forward-thinking, cooperative and professional tradesmen.</p>
<p>We are looking for qualified and/or ticketed boilermakers and 1st class welders that can offer high level trade skills.</p>
<p>Equally important is a cooperative, team-orientated attitude and a willingness to become involved and take ownership of their important role in this company.</p>
<p>They are building on a stable, permanent team, so candidates who step up can look forward to a secure future.</p>
<p>The position is ongoing, offering full-time hours, exceptional conditions, and penalties.</p>
<p>You require own car and licence, PPE and tools, relevant experience and to be available for an immediate start.</p>
<p>Good luck and kind regards,</p>
<p>RPM</p>",0,30,full time,0.0,0.03003003003003003



最佳答案

根据提供的示例,我尝试使用以下代码,它返回了 3 行:

>>> df = spark.read.csv('file:///tmp/test.csv', sep=',', multiLine=True)
>>> df.count()
3

如果它仍然不适合你,我会尝试强制 pandas 使用引号和分隔符

关于python - PySpark 无法正确读取 CSV,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/56258744/

25 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com