gpt4 book ai didi

apache-spark - pyspark 在一次加载中加载多个分区文件

转载 作者:行者123 更新时间:2023-12-02 03:41:50 24 4
gpt4 key购买 nike

我正在尝试在一次加载中加载多个文件。都是分区文件当我用 1 个文件尝试它时,它可以工作,但是当我列出 24 个文件时,它给了我这个错误,除了在加载后进行联合之外,我找不到任何有关限制的文档和解决方法。还有其他选择吗?

下面的代码可以重现问题:

basepath = '/file/' 
paths = ['/file/df201601.orc', '/file/df201602.orc', '/file/df201603.orc',
'/file/df201604.orc', '/file/df201605.orc', '/file/df201606.orc',
'/file/df201604.orc', '/file/df201605.orc', '/file/df201606.orc',
'/file/df201604.orc', '/file/df201605.orc', '/file/df201606.orc',
'/file/df201604.orc', '/file/df201605.orc', '/file/df201606.orc',
'/file/df201604.orc', '/file/df201605.orc', '/file/df201606.orc',
'/file/df201604.orc', '/file/df201605.orc', '/file/df201606.orc',
'/file/df201604.orc', '/file/df201605.orc', '/file/df201606.orc', ]

df = sqlContext.read.format('orc') \
options(header='true',inferschema='true',basePath=basePath)\
.load(*paths)

收到错误:

 TypeError                                 Traceback (most recent call last)
<ipython-input-43-7fb8fade5e19> in <module>()

---> 37 df = sqlContext.read.format('orc') .options(header='true', inferschema='true',basePath=basePath) .load(*paths)
38

TypeError: load() takes at most 4 arguments (24 given)

最佳答案

the official documentation 中所述,要读取多个文件,您应该传递一个列表:

path – optional string or a list of string for file-system backed data sources.

所以在你的情况下:

(sqlContext.read
.format('orc')
.options(basePath=basePath)
.load(path=paths))

仅当使用可变参数定义 load 时,参数解包 (*) 才有意义,例如:

def load(this, *paths):
...

关于apache-spark - pyspark 在一次加载中加载多个分区文件,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/48344580/

24 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com