• 【MindSpore功能】昇腾910上跑Mindspore.ops中算子,AIcore利用率为0,启动多个进程报错


    昇腾910A训练卡上跑Mindspore.ops中算子,执行时npu AIcore利用率为0,启动多个进程运行,AIcore利用率依然为0,且报错。
    如何提高AIcore利用率呢?
    算子代码示例:
    def foo():
    x = Tensor(np.ones([10, 32, 32, 32]), mindspore.float32)
    weight = Tensor(np.ones([32, 32, 3, 3]), mindspore.float32)
    conv2d = ops.Conv2D(out_channel=32, kernel_size=3)
    output = conv2d(x, weight)
    print(output.shape)

    for i in range(10):
    p = Process(target=foo)
    p.start()

    【操作步骤&问题现象】

    Process Process-6:
    Traceback (most recent call last):
    File "/usr/local/python3.7.5/lib/python3.7/multiprocessing/process.py", line 297, in _bootstrap
    self.run()
    File "/usr/local/python3.7.5/lib/python3.7/multiprocessing/process.py", line 99, in run
    self._target(*self._args, **self._kwargs)
    File "testops.py", line 12, in foo
    output = conv2d(x, weight)
    File "/usr/local/python3.7.5/lib/python3.7/site-packages/mindspore/ops/primitive.py", line 247, in __call__
    return _run_op(self, self.name, args)
    File "/usr/local/python3.7.5/lib/python3.7/site-packages/mindspore/common/api.py", line 78, in wrapper
    results = fn(*arg, **kwargs)
    File "/usr/local/python3.7.5/lib/python3.7/site-packages/mindspore/ops/primitive.py", line 682, in _run_op
    output = real_run_op(obj, op_name, args)
    RuntimeError: mindspore/ccsrc/runtime/device/ascend/ascend_kernel_runtime.cc:364 Init] Ascend error occurred, error message: EE9999: Inner Error!
    [driver interface] halMemAlloc failed: device_id=0, size=32212254720, type=2, env_type=3, drvRetCode=6![FUNC:DevMemAllocHugePageManaged][FILE:npu_driver.cc][LINE:700]
    [driver interface] halMemAlloc failed: size=32212254720, deviceId=0, type=2, env_type=3, drvRetCode=6![FUNC:DevMemAllocManaged][FILE:npu_driver.cc][LINE:739]
    DevMemAlloc huge page failed: deviceId=0, type=2, size=32212254720, retCode=117571606![FUNC:DevMemAllocOnline][FILE:npu_driver.cc][LINE:858]
    Device malloc failed, size=32212254720, type=2.[FUNC:DevMalloc][FILE:logger.cc][LINE:349]
    rtMalloc execute failed, reason=[driver error:out of memory][FUNC:ReportFuncErrorReason][FILE:error_message_manage.cc][LINE:41]

    First error scene API: mindspore/ccsrc/runtime/device/ascend/ascend_memory_manager.cc:64 MallocDeviceMemory] Malloc device memory failed, size[32212254720], ret[207001], Device 0 may be other processes occupying this card, check as: ps -ef|grep python
    #
    #
    [EXCEPTION] DEVICE(118536,ffff66fdf200,python3.7):2021-12-28-11:19:16.296.973 [mindspore/ccsrc/runtime/device/ascend/ascend_memory_manager.cc:64] MallocDeviceMemory] Malloc device memory failed, size[32212254720], ret[207001], Device 0 may be other processes occupying this card, check as: ps -ef|grep python
    [EXCEPTION] DEVICE(118536,ffff66fdf200,python3.7):2021-12-28-11:19:16.297.154 [mindspore/ccsrc/runtime/device/ascend/ascend_kernel_runtime.cc:364] Init] Ascend error occurred, error message: EE9999: Inner Error!
    [driver interface] halMemAlloc failed: device_id=0, size=32212254720, type=2, env_type=3, drvRetCode=6![FUNC:DevMemAllocHugePageManaged][FILE:npu_driver.cc][LINE:700]
    [driver interface] halMemAlloc failed: size=32212254720, deviceId=0, type=2, env_type=3, drvRetCode=6![FUNC:DevMemAllocManaged][FILE:npu_driver.cc][LINE:739]
    DevMemAlloc huge page failed: deviceId=0, type=2, size=32212254720, retCode=117571606![FUNC:DevMemAllocOnline][FILE:npu_driver.cc][LINE:858]
    Device malloc failed, size=32212254720, type=2.[FUNC:DevMalloc][FILE:logger.cc][LINE:349]
    rtMalloc execute failed, reason=[driver error:out of memory][FUNC:ReportFuncErrorReason][FILE:error_message_manage.cc][LINE:41]

    First error scene API: mindspore/ccsrc/runtime/device/ascend/ascend_memory_manager.cc:64 MallocDeviceMemory] Malloc device memory failed, size[32212254720], ret[207001], Device 0 may be other processes occupying this card, check as: ps -ef|grep python

    检查用例跑的是不是Ascend上的算子,脚本中context里面的target设置是不是Ascend。 另外单算子执行时间本来就很短,可能在你查看利用率的时候用例已经执行结束了。如果一定要查看利用率,建议开两个终端窗口,一个执行用例,另一个实时监控。

  • 相关阅读:
    C++ 求 最长连号
    数据结构——线性表
    Vue核心 Vue生命周期
    《乔布斯传》英文原著重点词汇笔记(六)【 chapter four 】
    计算机毕业设计 SSM消防物资存储系统 物资存储系统 应急物资库智慧存储系统Java Vue MySQL数据库 远程调试 代码讲解
    学个Antenna:Matlab天线工具箱知多少(二)
    C# 使用SIMD向量类型加速浮点数组求和运算(1):使用Vector4、Vector<T>
    C Primer Plus(6) 中文版 第14章 结构和其他数据形式 14.5 嵌套结构
    Ubuntu上安装docker,并连接vscode详细教程
    error=13, Permission denied
  • 原文地址:https://blog.csdn.net/weixin_45666880/article/details/126501684