gpt4 book ai didi

scala - 使用正则表达式时 Spark S3 访问被拒绝

转载 作者:可可西里 更新时间:2023-11-01 14:52:23 26 4
gpt4 key购买 nike

我想知道为什么尝试使用正则表达式从 S3 使用 Spark 读取数据时会有所不同?

我在“测试”桶中有一些文件:

/test/logs/2016-07-01/a.gz 
/test/logs/2016-07-02/a.gz
/test/logs/2016-07-03/a.gz

这两部作品:

val logRDD = sqlContext.read.json("s3a://test/logs/2016-07-01/*.gz")
or
val logRDD = sqlContext.read.json("s3n://test/logs/2016-07-01/*.gz")

但是当我这样做的时候:

val logRDD = sqlContext.read.json("s3a://test/logs/2016-07-0*/*.gz")

我明白了:

16/09/29 04:35:13 ERROR ApplicationMaster: User class threw exception: com.amazonaws.services.s3.model.AmazonS3Exception: Status Code: 403, AWS Service: Amazon S3, AWS Request ID: xxxx, AWS Error Code: null, AWS Error Message: Forbidden, S3 Extended Request ID: xxx=
com.amazonaws.services.s3.model.AmazonS3Exception: Status Code: 403, AWS Service: Amazon S3, AWS Request ID: xxx, AWS Error Code: null, AWS Error Message: Forbidden, S3 Extended Request ID: xxx=
at com.amazonaws.http.AmazonHttpClient.handleErrorResponse(AmazonHttpClient.java:798)
at com.amazonaws.http.AmazonHttpClient.executeHelper(AmazonHttpClient.java:421)
at com.amazonaws.http.AmazonHttpClient.execute(AmazonHttpClient.java:232)
at com.amazonaws.services.s3.AmazonS3Client.invoke(AmazonS3Client.java:3528)
at com.amazonaws.services.s3.AmazonS3Client.getObjectMetadata(AmazonS3Client.java:976)
at com.amazonaws.services.s3.AmazonS3Client.getObjectMetadata(AmazonS3Client.java:956)
at org.apache.hadoop.fs.s3a.S3AFileSystem.getFileStatus(S3AFileSystem.java:952)
at org.apache.hadoop.fs.s3a.S3AFileSystem.listStatus(S3AFileSystem.java:794)
at org.apache.hadoop.fs.Globber.listStatus(Globber.java:69)
at org.apache.hadoop.fs.Globber.glob(Globber.java:217)
at org.apache.hadoop.fs.FileSystem.globStatus(FileSystem.java:1655)
at org.apache.spark.deploy.SparkHadoopUtil.globPath(SparkHadoopUtil.scala:276)
at org.apache.spark.deploy.SparkHadoopUtil.globPathIfNecessary(SparkHadoopUtil.scala:283)
at org.apache.spark.sql.execution.datasources.ResolvedDataSource$$anonfun$11.apply(ResolvedDataSource.scala:173)
at org.apache.spark.sql.execution.datasources.ResolvedDataSource$$anonfun$11.apply(ResolvedDataSource.scala:169)
at scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:251)
at scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:251)
at scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:108)
at scala.collection.TraversableLike$class.flatMap(TraversableLike.scala:251)
at scala.collection.mutable.ArrayOps$ofRef.flatMap(ArrayOps.scala:108)
at org.apache.spark.sql.execution.datasources.ResolvedDataSource$.apply(ResolvedDataSource.scala:169)
at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:119)
at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:109)
at org.apache.spark.sql.DataFrameReader.json(DataFrameReader.scala:244)
at com.test.LogParser$.main(LogParser.scala:295)
at com.test.LogParser.main(LogParser.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:497)
at org.apache.spark.deploy.yarn.ApplicationMaster$$anon$2.run(ApplicationMaster.scala:559)

或者如果我使用这个:

val logRDD = sqlContext.read.json("s3n://test/logs/2016-07-0*/*.gz")

然后我明白了:

16/09/29 04:08:57 ERROR ApplicationMaster: User class threw exception: org.apache.hadoop.security.AccessControlException: Permission denied: s3n://test/logs
org.apache.hadoop.security.AccessControlException: Permission denied: s3n://test/logs
at org.apache.hadoop.fs.s3native.Jets3tNativeFileSystemStore.processException(Jets3tNativeFileSystemStore.java:449)
at org.apache.hadoop.fs.s3native.Jets3tNativeFileSystemStore.processException(Jets3tNativeFileSystemStore.java:427)
at org.apache.hadoop.fs.s3native.Jets3tNativeFileSystemStore.handleException(Jets3tNativeFileSystemStore.java:411)
at org.apache.hadoop.fs.s3native.Jets3tNativeFileSystemStore.retrieveMetadata(Jets3tNativeFileSystemStore.java:181)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:497)
at org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:256)
at org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:104)
at org.apache.hadoop.fs.s3native.$Proxy42.retrieveMetadata(Unknown Source)
at org.apache.hadoop.fs.s3native.NativeS3FileSystem.listStatus(NativeS3FileSystem.java:530)
at org.apache.hadoop.fs.Globber.listStatus(Globber.java:69)
at org.apache.hadoop.fs.Globber.glob(Globber.java:217)
at org.apache.hadoop.fs.FileSystem.globStatus(FileSystem.java:1674)
at org.apache.hadoop.mapred.FileInputFormat.singleThreadedListStatus(FileInputFormat.java:259)
at org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:229)
at org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:315)
at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:203)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:242)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:240)
at scala.Option.getOrElse(Option.scala:120)
at org.apache.spark.rdd.RDD.partitions(RDD.scala:240)
at org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:242)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:240)
at scala.Option.getOrElse(Option.scala:120)
at org.apache.spark.rdd.RDD.partitions(RDD.scala:240)
at org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:242)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:240)
at scala.Option.getOrElse(Option.scala:120)
at org.apache.spark.rdd.RDD.partitions(RDD.scala:240)
at org.apache.spark.rdd.RDD$$anonfun$treeAggregate$1.apply(RDD.scala:1136)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:150)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:111)
at org.apache.spark.rdd.RDD.withScope(RDD.scala:323)
at org.apache.spark.rdd.RDD.treeAggregate(RDD.scala:1134)
at org.apache.spark.sql.execution.datasources.json.InferSchema$.infer(InferSchema.scala:65)
at org.apache.spark.sql.execution.datasources.json.JSONRelation$$anonfun$4.apply(JSONRelation.scala:114)
at org.apache.spark.sql.execution.datasources.json.JSONRelation$$anonfun$4.apply(JSONRelation.scala:109)
at scala.Option.getOrElse(Option.scala:120)
at org.apache.spark.sql.execution.datasources.json.JSONRelation.dataSchema$lzycompute(JSONRelation.scala:109)
at org.apache.spark.sql.execution.datasources.json.JSONRelation.dataSchema(JSONRelation.scala:108)
at org.apache.spark.sql.sources.HadoopFsRelation.schema$lzycompute(interfaces.scala:636)
at org.apache.spark.sql.sources.HadoopFsRelation.schema(interfaces.scala:635)
at org.apache.spark.sql.execution.datasources.LogicalRelation.<init>(LogicalRelation.scala:37)
at org.apache.spark.sql.SQLContext.baseRelationToDataFrame(SQLContext.scala:442)
at org.apache.spark.sql.DataFrameReader.json(DataFrameReader.scala:288)
at com.test.LogParser$.main(LogParser.scala:294)
at com.test.LogParser.main(LogParser.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:497)
at org.apache.spark.deploy.yarn.ApplicationMaster$$anon$2.run(ApplicationMaster.scala:559)
Caused by: org.jets3t.service.impl.rest.HttpException
at org.jets3t.service.impl.rest.httpclient.RestStorageService.performRequest(RestStorageService.java:423)
at org.jets3t.service.impl.rest.httpclient.RestStorageService.performRequest(RestStorageService.java:277)
at org.jets3t.service.impl.rest.httpclient.RestStorageService.performRestHead(RestStorageService.java:1038)
at org.jets3t.service.impl.rest.httpclient.RestStorageService.getObjectImpl(RestStorageService.java:2250)
at org.jets3t.service.impl.rest.httpclient.RestStorageService.getObjectDetailsImpl(RestStorageService.java:2179)
at org.jets3t.service.StorageService.getObjectDetails(StorageService.java:1120)
at org.jets3t.service.StorageService.getObjectDetails(StorageService.java:575)
at org.apache.hadoop.fs.s3native.Jets3tNativeFileSystemStore.retrieveMetadata(Jets3tNativeFileSystemStore.java:174)
... 52 more

为什么它们不同?

最佳答案

像这样提供基本路径:

spark.read.option("basePath", basePath2).json(paths.toSeq:_*)

Base path是你要读取的所有路径中完全不修改的路径中最长的字符串。

关于scala - 使用正则表达式时 Spark S3 访问被拒绝,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/39761622/

26 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com