gpt4 book ai didi

hadoop - yarn java进程没有被杀死

转载 作者:可可西里 更新时间:2023-11-01 16:30:54 24 4
gpt4 key购买 nike

我已经安装了 Apache Samza,它使用 Yarn 来管理作业。它在虚拟机上的两个 Debian 服务器上运行。 Samza 是 0.9.1 版本。 Hadoop 的版本是 2.6.0。我看到两个不同的问题,我不确定它们是否相关,但看起来 Yarn 都没有做它应该做的事情。

  • 当我尝试使用 samza (kill-yarn-job.sh) 提供的脚本终止作业时,我在 Web 界面中看到作业的状态从正在运行或已接受更改为已终止,但 java 进程仍在运行。很长一段时间后,杀死他们的唯一方法就是用艰难的方式做到这一点:kill -9。
  • 虽然我一直在更改 yarn-site.xml 的值,但我只能运行一个作业。我的机器有 4 Gb 内存和 4 个 cpu 内核。这是
  • 的内容

yarn 网站.xml:

<configuration>
<property>
<name>yarn.resourcemanager.hostname</name>
<value>kfk-samza01</value>
</property>
<property>
<name>yarn.nodemanager.resource.memory-mb</name>
<value>2048</value>
</property>
<property>
<name>yarn.scheduler.minimum-allocation-mb</name>
<value>128</value>
</property>
<property>
<name>yarn.nodemanager.resource.cpu-vcores</name>
<value>3</value>
</property>
</configuration>

在我配置的作业选项文件中添加了以下内容:

yarn.container.memory.mb=256
yarn.am.container.memory.mb=256

task.opts= -Xms128M -Xmx128M

当作业运行时,我可以看到 -Xms128M -Xmx128M 选项被忽略并使用默认值。

我看到了以下错误。似乎某些内存限制阻止了作业从已接受到正在运行,但我找不到解决方法。

Container [pid=23007,containerID=container_1443454508386_0003_01_000001] is running beyond virtual memory limits. Current usage: 13.9 MB of 256 MB physical memory used; 1.1 GB of 537.6 MB virtual memory used. Killing container

实际上作业只是干净的函数,所以我的代码都不应该引入噪音。

知道问题出在哪里吗?

更新:在ACCEPTED状态停留10分钟左右后就进入FAILED。这是我在 yarn-root-resourcemanager-kfk-samza01.out 日志中看到的部分内容:

2015-09-30 14:08:07,000 INFO  [ResourceManager Event Processor] resourcemanager.RMAuditLogger (RMAuditLogger.java:logSuccess(106)) - USER=root  OPERATION=AM Allocated Container     TARGET=SchedulerApp     RESULT=SUCCESS  APPID=application_1443613686881_0001    CONTAINERID=container_1443613686881_0001_02_000001
2015-09-30 14:08:07,000 INFO [ResourceManager Event Processor] scheduler.SchedulerNode (SchedulerNode.java:allocateContainer(153)) - Assigned container container_1443613686881_0001_02_000001 of capacity <memory:1024, vCores:1> on host kfk-samza01:44816, which has 1 containers, <memory:1024, vCores:1> used and <memory:7168, vCores:7> available after allocation
2015-09-30 14:08:07,001 INFO [ResourceManager Event Processor] capacity.LeafQueue (LeafQueue.java:assignContainer(1580)) - assignedContainer application attempt=appattempt_1443613686881_0001_000002 container=Container: [ContainerId: container_1443613686881_0001_02_000001, NodeId: kfk-samza01:44816, NodeHttpAddress: kfk-samza01:8042, Resource: <memory:1024, vCores:1>, Priority: 0, Token: null, ] queue=default: capacity=1.0, absoluteCapacity=1.0, usedResources=<memory:0, vCores:0>, usedCapacity=0.0, absoluteUsedCapacity=0.0, numApps=1, numContainers=0 clusterResource=<memory:16384, vCores:16>
2015-09-30 14:08:07,002 INFO [ResourceManager Event Processor] capacity.ParentQueue (ParentQueue.java:assignContainersToChildQueues(559)) - Re-sorting assigned queue: root.default stats: default: capacity=1.0, absoluteCapacity=1.0, usedResources=<memory:1024, vCores:1>, usedCapacity=0.0625, absoluteUsedCapacity=0.0625, numApps=1, numContainers=1
2015-09-30 14:08:07,002 INFO [ResourceManager Event Processor] capacity.ParentQueue (ParentQueue.java:assignContainers(424)) - assignedContainer queue=root usedCapacity=0.0625 absoluteUsedCapacity=0.0625 used=<memory:1024, vCores:1> cluster=<memory:16384, vCores:16>
2015-09-30 14:08:07,005 INFO [AsyncDispatcher event handler] security.NMTokenSecretManagerInRM (NMTokenSecretManagerInRM.java:createAndGetNMToken(200)) - Sending NMToken for nodeId : kfk-samza01:44816 for container : container_1443613686881_0001_02_000001
2015-09-30 14:08:07,009 INFO [AsyncDispatcher event handler] rmcontainer.RMContainerImpl (RMContainerImpl.java:handle(408)) - container_1443613686881_0001_02_000001 Container Transitioned from ALLOCATED to ACQUIRED
2015-09-30 14:08:07,009 INFO [AsyncDispatcher event handler] security.NMTokenSecretManagerInRM (NMTokenSecretManagerInRM.java:clearNodeSetForAttempt(146)) - Clear node set for appattempt_1443613686881_0001_000002
2015-09-30 14:08:07,010 INFO [AsyncDispatcher event handler] attempt.RMAppAttemptImpl (RMAppAttemptImpl.java:storeAttempt(1830)) - Storing attempt: AppId: application_1443613686881_0001 AttemptId: appattempt_1443613686881_0001_000002 MasterContainer: Container: [ContainerId: container_1443613686881_0001_02_000001, NodeId: kfk-samza01:44816, NodeHttpAddress: kfk-samza01:8042, Resource: <memory:1024, vCores:1>, Priority: 0, Token: Token { kind: ContainerToken, service: 192.168.15.92:44816 }, ]
2015-09-30 14:08:07,010 INFO [AsyncDispatcher event handler] attempt.RMAppAttemptImpl (RMAppAttemptImpl.java:handle(762)) - appattempt_1443613686881_0001_000002 State change from SCHEDULED to ALLOCATED_SAVING
2015-09-30 14:08:07,011 INFO [AsyncDispatcher event handler] attempt.RMAppAttemptImpl (RMAppAttemptImpl.java:handle(762)) - appattempt_1443613686881_0001_000002 State change from ALLOCATED_SAVING to ALLOCATED
2015-09-30 14:08:07,012 INFO [pool-1-thread-3] amlauncher.AMLauncher (AMLauncher.java:run(253)) - Launching masterappattempt_1443613686881_0001_000002
2015-09-30 14:08:07,018 INFO [pool-1-thread-3] amlauncher.AMLauncher (AMLauncher.java:launch(106)) - Setting up container Container: [ContainerId: container_1443613686881_0001_02_000001, NodeId: kfk-samza01:44816, NodeHttpAddress: kfk-samza01:8042, Resource: <memory:1024, vCores:1>, Priority: 0, Token: Token { kind: ContainerToken, service: 192.168.15.92:44816 }, ] for AM appattempt_1443613686881_0001_000002
2015-09-30 14:08:07,019 INFO [pool-1-thread-3] amlauncher.AMLauncher (AMLauncher.java:createAMContainerLaunchContext(191)) - Command to launch container container_1443613686881_0001_02_000001 : export SAMZA_LOG_DIR=<LOG_DIR> && ln -sfn <LOG_DIR> logs && exec ./__package/bin/run-am.sh 1>logs/stdout 2>logs/stderr
2015-09-30 14:08:07,020 INFO [pool-1-thread-3] security.AMRMTokenSecretManager (AMRMTokenSecretManager.java:createAndGetAMRMToken(195)) - Create AMRMToken for ApplicationAttempt: appattempt_1443613686881_0001_000002
2015-09-30 14:08:07,020 INFO [pool-1-thread-3] security.AMRMTokenSecretManager (AMRMTokenSecretManager.java:createPassword(307)) - Creating password for appattempt_1443613686881_0001_000002
2015-09-30 14:08:07,064 INFO [pool-1-thread-3] amlauncher.AMLauncher (AMLauncher.java:launch(127)) - Done launching container Container: [ContainerId: container_1443613686881_0001_02_000001, NodeId: kfk-samza01:44816, NodeHttpAddress: kfk-samza01:8042, Resource: <memory:1024, vCores:1>, Priority: 0, Token: Token { kind: ContainerToken, service: 192.168.15.92:44816 }, ] for AM appattempt_1443613686881_0001_000002
2015-09-30 14:08:07,065 INFO [AsyncDispatcher event handler] attempt.RMAppAttemptImpl (RMAppAttemptImpl.java:handle(762)) - appattempt_1443613686881_0001_000002 State change from ALLOCATED to LAUNCHED
2015-09-30 14:08:08,001 INFO [ResourceManager Event Processor] rmcontainer.RMContainerImpl (RMContainerImpl.java:handle(408)) - container_1443613686881_0001_02_000001 Container Transitioned from ACQUIRED to RUNNING
2015-09-30 14:21:26,930 INFO [Ping Checker] util.AbstractLivelinessMonitor (AbstractLivelinessMonitor.java:run(127)) - Expired:appattempt_1443613686881_0001_000002 Timed out after 600 secs
2015-09-30 14:21:26,931 INFO [AsyncDispatcher event handler] attempt.RMAppAttemptImpl (RMAppAttemptImpl.java:rememberTargetTransitionsAndStoreState(1125)) - Updating application attempt appattempt_1443613686881_0001_000002 with final state: FAILED, and exit status: -1000
2015-09-30 14:21:26,931 INFO [AsyncDispatcher event handler] attempt.RMAppAttemptImpl (RMAppAttemptImpl.java:handle(762)) - appattempt_1443613686881_0001_000002 State change from LAUNCHED to FINAL_SAVING
2015-09-30 14:21:26,932 INFO [AsyncDispatcher event handler] resourcemanager.ApplicationMasterService (ApplicationMasterService.java:unregisterAttempt(677)) - Unregistering app attempt : appattempt_1443613686881_0001_000002
2015-09-30 14:21:26,932 INFO [AsyncDispatcher event handler] security.AMRMTokenSecretManager (AMRMTokenSecretManager.java:applicationMasterFinished(124)) - Application finished, removing password for appattempt_1443613686881_0001_000002
2015-09-30 14:21:26,933 INFO [AsyncDispatcher event handler] attempt.RMAppAttemptImpl (RMAppAttemptImpl.java:handle(762)) - appattempt_1443613686881_0001_000002 State change from FINAL_SAVING to FAILED
2015-09-30 14:21:26,933 INFO [AsyncDispatcher event handler] rmapp.RMAppImpl (RMAppImpl.java:transition(1208)) - The number of failed attempts is 2. The max attempts is 2
2015-09-30 14:21:26,935 INFO [AsyncDispatcher event handler] rmapp.RMAppImpl (RMAppImpl.java:rememberTargetTransitionsAndStoreState(995)) - Updating application application_1443613686881_0001 with final state: FAILED
2015-09-30 14:21:26,937 INFO [AsyncDispatcher event handler] rmapp.RMAppImpl (RMAppImpl.java:handle(721)) - application_1443613686881_0001 State change from ACCEPTED to FINAL_SAVING
2015-09-30 14:21:26,938 INFO [ResourceManager Event Processor] capacity.CapacityScheduler (CapacityScheduler.java:doneApplicationAttempt(790)) - Application Attempt appattempt_1443613686881_0001_000002 is done. finalState=FAILED
2015-09-30 14:21:26,938 INFO [AsyncDispatcher event handler] recovery.RMStateStore (RMStateStore.java:transition(161)) - Updating info for app: application_1443613686881_0001
2015-09-30 14:21:26,939 INFO [ResourceManager Event Processor] rmcontainer.RMContainerImpl (RMContainerImpl.java:handle(408)) - container_1443613686881_0001_02_000001 Container Transitioned from RUNNING to KILLED
2015-09-30 14:21:26,939 INFO [ResourceManager Event Processor] fica.FiCaSchedulerApp (FiCaSchedulerApp.java:containerCompleted(113)) - Completed container: container_1443613686881_0001_02_000001 in state: KILLED event:KILL
2015-09-30 14:21:26,939 INFO [ResourceManager Event Processor] resourcemanager.RMAuditLogger (RMAuditLogger.java:logSuccess(106)) - USER=root OPERATION=AM Released Container TARGET=SchedulerApp RESULT=SUCCESS APPID=application_1443613686881_0001 CONTAINERID=container_1443613686881_0001_02_000001
2015-09-30 14:21:26,940 INFO [ResourceManager Event Processor] scheduler.SchedulerNode (SchedulerNode.java:releaseContainer(216)) - Released container container_1443613686881_0001_02_000001 of capacity <memory:1024, vCores:1> on host kfk-samza01:44816, which currently has 0 containers, <memory:0, vCores:0> used and <memory:8192, vCores:8> available, release resources=true
2015-09-30 14:21:26,940 INFO [AsyncDispatcher event handler] rmapp.RMAppImpl (RMAppImpl.java:transition(945)) - Application application_1443613686881_0001 failed 2 times due to ApplicationMaster for attempt appattempt_1443613686881_0001_000002 timed out. Failing the application.
2015-09-30 14:21:26,940 INFO [ResourceManager Event Processor] capacity.LeafQueue (LeafQueue.java:releaseResource(1732)) - default used=<memory:0, vCores:0> numContainers=0 user=root user-resources=<memory:0, vCores:0>
2015-09-30 14:21:26,943 INFO [ResourceManager Event Processor] capacity.LeafQueue (LeafQueue.java:completedContainer(1683)) - completedContainer container=Container: [ContainerId: container_1443613686881_0001_02_000001, NodeId: kfk-samza01:44816, NodeHttpAddress: kfk-samza01:8042, Resource: <memory:1024, vCores:1>, Priority: 0, Token: Token { kind: ContainerToken, service: 192.168.15.92:44816 }, ] queue=default: capacity=1.0, absoluteCapacity=1.0, usedResources=<memory:0, vCores:0>, usedCapacity=0.0, absoluteUsedCapacity=0.0, numApps=1, numContainers=0 cluster=<memory:16384, vCores:16>
2015-09-30 14:21:26,943 INFO [ResourceManager Event Processor] capacity.ParentQueue (ParentQueue.java:completedContainer(604)) - completedContainer queue=root usedCapacity=0.0 absoluteUsedCapacity=0.0 used=<memory:0, vCores:0> cluster=<memory:16384, vCores:16>
2015-09-30 14:21:26,944 INFO [ResourceManager Event Processor] capacity.ParentQueue (ParentQueue.java:completedContainer(622)) - Re-sorting completed queue: root.default stats: default: capacity=1.0, absoluteCapacity=1.0, usedResources=<memory:0, vCores:0>, usedCapacity=0.0, absoluteUsedCapacity=0.0, numApps=1, numContainers=0
2015-09-30 14:21:26,944 INFO [ResourceManager Event Processor] capacity.CapacityScheduler (CapacityScheduler.java:completedContainer(1274)) - Application attempt appattempt_1443613686881_0001_000002 released container container_1443613686881_0001_02_000001 on node: host: kfk-samza01:44816 #containers=0 available=8192 used=0 with event: KILL
2015-09-30 14:21:26,945 INFO [ResourceManager Event Processor] scheduler.AppSchedulingInfo (AppSchedulingInfo.java:clearRequests(115)) - Application application_1443613686881_0001 requests cleared
2015-09-30 14:21:26,945 INFO [ResourceManager Event Processor] capacity.LeafQueue (LeafQueue.java:removeApplicationAttempt(682)) - Application removed - appId: application_1443613686881_0001 user: root queue: default #user-pending-applications: 0 #user-active-applications: 0 #queue-pending-applications: 0 #queue-active-applications: 0
2015-09-30 14:21:26,946 INFO [pool-1-thread-4] amlauncher.AMLauncher (AMLauncher.java:run(267)) - Cleaning master appattempt_1443613686881_0001_000002
2015-09-30 14:21:26,948 INFO [AsyncDispatcher event handler] rmapp.RMAppImpl (RMAppImpl.java:handle(721)) - application_1443613686881_0001 State change from FINAL_SAVING to FAILED
2015-09-30 14:21:26,949 INFO [ResourceManager Event Processor] capacity.ParentQueue (ParentQueue.java:removeApplication(372)) - Application removed - appId: application_1443613686881_0001 user: root leaf-queue of parent: root #applications: 0
2015-09-30 14:21:26,951 WARN [AsyncDispatcher event handler] resourcemanager.RMAuditLogger (RMAuditLogger.java:logFailure(263)) - USER=root OPERATION=Application Finished - Failed TARGET=RMAppManager RESULT=FAILURE DESCRIPTION=App failed with state: FAILED PERMISSIONS=Application application_1443613686881_0001 failed 2 times due to ApplicationMaster for attempt appattempt_1443613686881_0001_000002 timed out. Failing the application. APPID=application_1443613686881_0001
2015-09-30 14:21:26,955 INFO [AsyncDispatcher event handler] resourcemanager.RMAppManager$ApplicationSummary (RMAppManager.java:logAppSummary(179)) - appId=application_1443613686881_0001,name=flow.Router_1,user=root,queue=default,state=FAILED,trackingUrl=http://kfk-samza01:8088/cluster/app/application_1443613686881_0001,appMasterHost=N/A,startTime=1443614243319,finishTime=1443615686935,finalStatus=FAILED

有什么线索吗?

最佳答案

请尝试以下作业配置属性来限制容器内存分配。

mapreduce.map.memory.mb
mapreduce.reduce.memory.mb

根据您的情况,这两个属性值可以是 256MB

同时配置以下两个属性,

mapreduce.map.java.opts
mapreduce.reduce.java.opts

根据您的情况,这 2 个属性的值应为 128MB

[注意:以上两个*.java.opts值必须略低于各自的*.memory.mb属性]

如果您仍然遇到虚拟内存问题,请尝试通过配置以下属性来降低虚拟内存分配的比率值。

yarn.nodemanager.vmem-pmem-ratio

默认是 2.1,如果您仍然遇到虚拟内存问题,请尝试降低它。

正确设置这些属性后,您将在成功完成后清除容器。

希望这对您有所帮助。

关于hadoop - yarn java进程没有被杀死,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/32838650/

24 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com