gpt4 book ai didi

python - 有条件地拆分 PySpark 列表中的逗号分隔值

转载 作者:行者123 更新时间:2023-12-01 04:00:19 26 4
gpt4 key购买 nike

我正在尝试在 PySpark 中运行作业。我的数据位于使用 PySpark Spark 上下文类 (sc) 创建的 RDD 中,如下所示:

directory_file = sc.textFile('directory.csv')

*我认为 Python 的 csv 模块不能用于 RDD 中的数据。

这会为 csv 中的每一行生成一个列表。我知道这很令人讨厌,但这是一个列表的示例(相当于原始 csv 中的一行):

[u'14K685,El Puente Academy for Peace and Justice,Brooklyn,K778,718-387-1125,718-387-4229,9,12,,,"B24, B39, B44, B44-SBS, B46, B48, B57, B60, B62, Q54, Q59","G to Broadway ; J, M to Hewes St ; Z to Marcy Ave",250 Hooper Street,Brooklyn,NY,11211,www.elpuente.us,225,N/A,Consortium School,"We are a small, innovative learning community that promotes comprehensive academic excellence for all students while inspiring and nurturing leadership for peace and justice. Our state-of-the-art facility allows for a creative and intellectually challenging environment where every student thrives. Our project-based curriculum is designed to prepare students to be active citizens and independent thinkers who share a passion for transforming their communities and the world into a better place. Our trimester system allows students to complete most of their high school credits by the 11th grade, opening opportunities for exciting internships and college courses during the school day in their senior year.","Accelerated credit accumulation (up to 18 credits per year), iLearn, iZone 360, Year-long SAT (Scholastic Aptitude Test) preparatory course, Individualized college counseling, Early College Awareness & Preparatory Program (ECAPP). Visits to college campuses in NYC, Visits to colleges outside NYC in partnership with the El Puente Leadership Center, Internships, Community-based Projects, Portfolio Assessment, Integrated-Arts Projects, Before- and After-school Tutoring; Elective courses include: Drama, Dance (Men\'s and Women\'s Groups), Debate Team partnership with Tufts University, Guitar, Filmmaking, Architecture, Glee",Spanish,,,,"AM and PM Academic Support, B-Boy/B-Girl, Chorus, College and Vocational Counseling and Placement, College Prep, Community Development Project, Computers, Dance Level 1 and 2, Individual Drama; Education for Public Inquiry and International Citizenship (EPIIC), El Puente Leadership Center, Film, Fine Arts, Liberation, Media, Men\u2019s and Women\u2019s Groups, Movement Theater Level 1, Movement Theater Level 2, Music, Music Production, Pre-professional training in Dance, PSAT/SAT Prep, Spoken Word, Student Council, Teatro El Puente, Visual Art",,,,"Boys & Girls Basketball, Baseball, Softball, Volleyball",El Puente Williamsburg Leadership Center; The El Puente Bushwick Center; Leadership Center at Taylor-Wythe Houses; Beacon Leadership Center at MS50.,"Woodhull Medical Center, Governor Hospital","Hunter College (CUNY), Eugene Lang College The New School for Liberal Arts, Pratt College of Design, Tufts University, and Touro College.","El Puente Leadership Center, El Puente Bushwick Center, Beacon Leadership Center at MS50, Leadership Center at Taylor-Wythe Houses, Center for Puerto Rican Studies, Hip- Hop Theatre Festival, Urban Word, and Summer Search.",,,,,Our school requires assessment of an Academic Portfolio for graduation.,,9:00 AM,3:30 PM,This school will provide students with disabilities the supports and services indicated on their IEPs.,ESL,Not Functionally Accessible,1,Priority to Brooklyn students or residents,Then to New York City residents,,,,,,,,,"250 Hooper Street']

我想使用逗号作为分隔符来分割每个列表项,除非逗号位于双引号之间(例如“, , ,”)。

parsed = directory_file.map(lambda x: x.split(',')) 显然没有解决双引号之间的逗号问题。有没有办法做到这一点?我见过这个问题,具体引用了 csv,但由于在这种情况下,csv 首先加载到 Spark RDD 中,所以我很确定 csv 模块在这里不适用。

谢谢。

最佳答案

您可以使用正则表达式。它在 PySpark 中运行得非常快:

import re
rdd=sc.textFile("factbook.csv")

# Get rid of those commas we do not need
cleanedRdd=rdd.map(lambda x:re.match( r'(.*".*)(,)(.*".*)', x, re.M|re.I).group(1)+" "re.match( r'(.*".*)(,)(.*".*)', x, re.M|re.I).group(3) if re.match( r'(.*".*)(,)(.*".*)', x, re.M|re.I) !=None else x)

因此,对于与此类似的每一行:

col1,"col2,blabla",col3

此代码与正则表达式模式匹配。如果找到该模式,它会创建 3 个组:

  • 第 1 组:col1,"col2
  • 第 2 组:,
  • 第 3 组:blabla",col3

最后我们连接组 1 和组 2,输出将是:

col1,"col2 blabla",col3

关于python - 有条件地拆分 PySpark 列表中的逗号分隔值,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/36680362/

26 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com