gpt4 book ai didi

Joining multiple sub folders inside an azure datalake using adf join(使用adf-join连接azure数据仓库内的多个子文件夹)

转载 作者:bug小助手 更新时间:2023-10-22 17:34:29 26 4
gpt4 key购买 nike



I am trying to join 2 folders inside a gen2 container, using azure data factory join activity. Each folder has multiple sub folders.
The structure of the container is as follows:

我正在尝试使用azure数据工厂联接活动来联接第二代容器中的两个文件夹。每个文件夹都有多个子文件夹。容器的结构如下:


conianer
folder1
sub folder1/
file/
sub folder 2/
file
folder2/
sub folder 1/
file/
sub folder 2/
file/

When trying to preview the dataset in adf I get the following error:

当尝试在adf中预览数据集时,我得到以下错误:



(at Source 'csvparquet': Path abfss://contai[email protected]/directory does not resolve to any file(s). Please make sure the file/folder exists and is not hidden. At the same time, please ensure special character is not included in file/folder name, for example, name starting with _)



I renamed all the folders in the datalake removing all special characters, and I still get the same error.
How would I use wildcard path to select all the files inside the sub folder for each folder to be joined together.

我重命名了datalake中的所有文件夹,删除了所有特殊字符,但仍然出现了同样的错误。我将如何使用通配符路径来选择子文件夹中的所有文件,以便将每个文件夹连接在一起。


更多回答
优秀答案推荐


I tried the above scenario in my environment, and I got same error in ADF dataflow.

我在我的环境中尝试了上述场景,但在ADF数据流中也出现了同样的错误。


This is my path in the parquet dataset.

这是我在镶木地板数据集中的路径。


enter image description here


It gave the below error.

它给出了以下错误。


enter image description here


ADF dataflow requires a file name or the last child folder name to get the data preview.

ADF数据流需要文件名或最后一个子文件夹名才能获得数据预览。



How would I use wildcard path to select all the files inside the sub folder for each folder to be joined together



Joining is a different concept in ADF dataflow which involves inner join, outer join etc... with Data flow join transformation.

联接是ADF数据流中的一个不同概念,它涉及内部联接、外部联接等。


But if your ask is to merge the files from different sub folders of folder1 and wants to get a preview of it, then give the wild card path in the dataflow like below and your file path in the dataset is same as above(data/folder1/).

但是,如果你的要求是合并folder1的不同子文件夹中的文件,并想预览它,那么在数据流中给出通配符路径,如下所示,数据集中的文件路径与上述相同(data/folder1/)。


**/*.parquet

**/*.镶木地板


enter image description here


If you want to get the files from all folders foler1,folder2,.. from the container data, give the path till the container in the dataset. And in the wild card paths of the dataflow, chane the expression like this **/**/*.parquet.

如果你想从文件夹1、文件夹2、…中获取文件,。。从容器数据中,给出数据集中容器的路径。在数据流的通配符路径中,更改如下表达式**/**/*.parquet。


Data preview:

数据预览:


enter image description here


更多回答

thank you for your response. I am now able to review the datasets, however I am getting null values in one of the data sources, adf dataflow is unable to read the dataset, I was able to read the same dataset in azure databricks perfectly. What do you think the issue is?

感谢您的回复。我现在可以查看数据集,但我在其中一个数据源中得到了null值,adf数据流无法读取数据集,我能够完美地读取azure数据块中的相同数据集。你认为问题出在哪里?

are your files have same schema? ADF dataflow will give null values for the extra columns if the schema of all files is not same? Can you provide your data preview image, file path in dataset image and wild card path image as well if possible?

你们的文件有相同的架构吗?如果所有文件的架构不相同,ADF数据流将为额外的列提供null值?如果可能的话,你能提供你的数据预览图像、数据集中的文件路径图像和通配符路径图像吗?

I can't seem to upload image, it's saying i it's loo large to upload.

我似乎无法上传图片,它说我上传太大了。

data preview: contains 5 columns all null values, however it also contains the 17 columns from the other data source which contains data.

数据预览:包含5列,全部为空值,但它也包含来自包含数据的其他数据源的17列。

Try to import projection in the dataflow and check again in the dataflow.

尝试导入数据流中的投影,然后在数据流中再次检查。

26 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com