gpt4 book ai didi

python - 在 Azure ML Pipeline 的 train.py 中读取/安装 csv 文件

转载 作者:行者123 更新时间:2023-12-03 18:11:17 27 4
gpt4 key购买 nike

我们正在从 Eventhub 和 AppInsight 收集数据并将其存储在 azure blob 中。通过使用 AzureML 管道,我想将数据集传递到 train.py 中,并经历两种不同的逻辑(一种用于机器学习,另一种用于欺诈分析)。

但我无法从 train.py 内部读取 csv 文件以进行进一步处理。

这是我的train.py贯穿 PythonScriptStep在 Azure 机器学习管道中

import argparse
import os
import pandas as pd

print("In train.py")

parser = argparse.ArgumentParser("train")

parser.add_argument("--input_data", type=str, help="input data")
parser.add_argument("--output_train", type=str, help="output_train directory")

args = parser.parse_args()

print("Argument 1: %s" % args.input_data)
df = pd.read_csv(args.input_data)
print(df.head())

print("Argument 2: %s" % args.output_train)

if not (args.output_train is None):
os.makedirs(args.output_train, exist_ok=True)
print("%s created" % args.output_train)

这是运行管道的代码

ws = Workspace.from_config()
def_blob_store = Datastore(ws, "basic_data_store")
aml_compute_target = "test-cluster"
try:
aml_compute = AmlCompute(ws, aml_compute_target)
print("found existing compute target.")
except ComputeTargetException:
print("Error")

source_directory = './train'

blob_input_data = DataReference(
datastore=def_blob_store,
data_reference_name="device_data",
path_on_datastore="_fraud_data/test.csv")
trainStep = PythonScriptStep(
script_name="train.py",
arguments=["--input_data", blob_input_data, "--output_train", processed_data1],
inputs=[blob_input_data],
outputs=[processed_data1],
compute_target=aml_compute,
source_directory=source_directory,
runconfig=run_config
)
pipeline1 = Pipeline(workspace=ws, steps=[compareStep])
pipeline_run1 = Experiment(ws, 'Data_dependency').submit(pipeline1)

在输出跟踪的下方,您可以看到输出 Argument 1正在打印文件的路径

Argument 1: /mnt/batch/tasks/shared/LS_root/jobs/pipeline-shohoz/azureml/d92be2ab-e63f-4883-a14b-a64fa5bb431d/mounts/basic_data_store/_fraud_data/test.csv

所以我已经成功传递了数据集,但无法在线读取train.py内的文件pd.read_csv(args.input_data) 。正在显示

FileNotFoundError: [Errno 2] File b'/mnt/batch/tasks/shared/LS_root/jobs/pipeline-shohoz/azureml/d92be2ab-e63f-4883-a14b-a64fa5bb431d/mounts/basic_data_store/_fraud_data/test.csv'

这是来自 70_driver_log.txt 的完整跟踪我从 azureml 日志下载的,

Preparing to call script [ train.py ] with arguments: ['--input_data', '/mnt/batch/tasks/shared/LS_root/jobs/pipeline-shohoz/azureml/d92be2ab-e63f-4883-a14b-a64fa5bb431d/mounts/basic_data_store/_fraud_data/test.csv', '--output_train', '/mnt/batch/tasks/shared/LS_root/jobs/pipeline-shohoz/azureml/d92be2ab-e63f-4883-a14b-a64fa5bb431d/mounts/basic_data_store/azureml/d92be2ab-e63f-4883-a14b-a64fa5bb431d/processed_data1']
After variable expansion, calling script [ train.py ] with arguments: ['--input_data', '/mnt/batch/tasks/shared/LS_root/jobs/pipeline-shohoz/azureml/d92be2ab-e63f-4883-a14b-a64fa5bb431d/mounts/basic_data_store/_fraud_data/test.csv', '--output_train', '/mnt/batch/tasks/shared/LS_root/jobs/pipeline-shohoz/azureml/d92be2ab-e63f-4883-a14b-a64fa5bb431d/mounts/basic_data_store/azureml/d92be2ab-e63f-4883-a14b-a64fa5bb431d/processed_data1']

In train.py
Argument 1: /mnt/batch/tasks/shared/LS_root/jobs/pipeline-shohoz/azureml/d92be2ab-e63f-4883-a14b-a64fa5bb431d/mounts/basic_data_store/_fraud_data/test.csv


The experiment failed. Finalizing run...
Cleaning up all outstanding Run operations, waiting 300.0 seconds
1 items cleaning up...
Cleanup took 0.001172780990600586 seconds
Starting the daemon thread to refresh tokens in background for process with pid = 136
Traceback (most recent call last):
File "train.py", line 18, in <module>
df = pd.read_csv(args.input_data) #str()
File "/azureml-envs/azureml_eb042e80b9a6abdb5821a78683153a38/lib/python3.6/site-packages/pandas/io/parsers.py", line 685, in parser_f
return _read(filepath_or_buffer, kwds)
File "/azureml-envs/azureml_eb042e80b9a6abdb5821a78683153a38/lib/python3.6/site-packages/pandas/io/parsers.py", line 457, in _read
parser = TextFileReader(fp_or_buf, **kwds)
File "/azureml-envs/azureml_eb042e80b9a6abdb5821a78683153a38/lib/python3.6/site-packages/pandas/io/parsers.py", line 895, in __init__
self._make_engine(self.engine)
File "/azureml-envs/azureml_eb042e80b9a6abdb5821a78683153a38/lib/python3.6/site-packages/pandas/io/parsers.py", line 1135, in _make_engine
self._engine = CParserWrapper(self.f, **self.options)
File "/azureml-envs/azureml_eb042e80b9a6abdb5821a78683153a38/lib/python3.6/site-packages/pandas/io/parsers.py", line 1917, in __init__
self._reader = parsers.TextReader(src, **kwds)
File "pandas/_libs/parsers.pyx", line 382, in pandas._libs.parsers.TextReader.__cinit__
File "pandas/_libs/parsers.pyx", line 689, in pandas._libs.parsers.TextReader._setup_parser_source
FileNotFoundError: [Errno 2] File b'/mnt/batch/tasks/shared/LS_root/jobs/pipeline-shohoz/azureml/d92be2ab-e63f-4883-a14b-a64fa5bb431d/mounts/basic_data_store/_fraud_data/test.csv' does not exist: b'/mnt/batch/tasks/shared/LS_root/jobs/pipeline-shohoz/azureml/d92be2ab-e63f-4883-a14b-a64fa5bb431d/mounts/basic_data_store/_fraud_data/test.csv'

我尝试过相对路径

azureml/8d2b7bee-6cc5-4c8c-a685-1300a240de8f/mounts/basic_data_store/_fraud_data/test.csv

还有 Uri

wasbs://<a href="https://stackoverflow.com/cdn-cgi/l/email-protection" class="__cf_email__" data-cfemail="2e5d4641464154034d41405a4f47404b5c6e5d46414641544a5d004c42414c004d415c4b005947404a41595d00404b5a" rel="noreferrer noopener nofollow">[email protected]</a>/azureml/azureml/8d2b7bee-6cc5-4c8c-a685-1300a240de8f/mounts/basic_data_store/_fraud_data/test.csv

但以相同的 FileNotFoundError 结尾结果。在过去的三四天里,我一直在用头撞墙。任何帮助都会拯救我的大脑。

最佳答案

您可以使用 PipelineDataset 对象将已注册的数据集包含在 PythonScriptStep 中 - 请参阅 https://learn.microsoft.com/en-us/python/api/azureml-pipeline-core/azureml.pipeline.core.pipelinedataset?view=azure-ml-py了解更多详细信息和示例。

enter image description here

关于python - 在 Azure ML Pipeline 的 train.py 中读取/安装 csv 文件,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/60202243/

27 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com