Mask-RCNN入门(balloon数据集、TensorFlow-DirectML)的N个坑

一、准备工作

Mask-RCNN：
地址：https://github.com/matterport/Mask_RCNN
可以直接下载代码压缩包(Mask_RCNN-master.zip)。
balloon数据集：
地址：https://github.com/matterport/Mask_RCNN/releases/download/v2.1/balloon_dataset.zip
COCO数据集weights：
地址：https://github.com/matterport/Mask_RCNN/releases/download/v2.0/mask_rcnn_coco.h5
可以不下载，程序会自动下载。下载完成后放到Mask-RCNN根目录。
安装TensorFlow-DirectML
根据电脑显卡情况，安装TensorFlow版本，这里用的是Windows10和AMD的APU，所以安装TensorFlow-DirectML。
安装requirements.txt里面依赖包
注意：安装前把tensorflow和keras剔除，keras单独安装2.2.5。安装完成后版本如下：

tensorflow-directml==1.15.7
keras==2.2.5
numpy==1.18.5
scipy==1.7.3
pillow==9.2.0
cython==0.29.32
matplotlib==3.5.3
scikit-image==0.19.3
opencv-python==4.6.0.66
h5py==2.10.0
imgaug==0.4.0
IPython[all]==7.34.0
1
2
3
4
5
6
7
8
9
10
11
12

二、训练balloon数据集

balloon数据集训练用samples\balloon\balloon.py，训练前先修改Run Configuration的Parameters，如下：

train
--dataset="X:\xxxx\balloon"
--weights=coco
1
2
3

①dataset是balloon数据集根目录，可以是绝对路径，也可以是balloon.py的相对路径。
②weights是训练预权重文件，可以是绝对路径，也可以是balloon.py的相对路径。coco就代指根目录下mask_rcnn_coco.h5。

训练产生的h5权重文件全部在logs目录下，每Epoch都会保存一个h5文件。
可以使用tensorboard查看训练过程。

三、测试

测试用samples\balloon\balloon.py，测试前先修改Run Configuration的Parameters，如下：

splash
--weights="X:\xxxx\xxxx.h5"
--image="X:\xxxx\xxxx.png"
1
2
3

①weights是训练预权重文件，可以是绝对路径，也可以是balloon.py的相对路径。
②image是测试图片的路径，可以是绝对路径，也可以是balloon.py的相对路径。

测试结果图片在samples\balloon目录下，是测试图片的除了气球区域外的灰度图，如下：

四、N个坑

4.1 mrcnn\utils.py的resize

报错信息：

 File "X:\xxx\mrcnn\model.py", line 1709, in data_generator
    use_mini_mask=config.USE_MINI_MASK)
  File "X:\xxx\mrcnn\model.py", line 1280, in load_image_gt
    mask = utils.minimize_mask(bbox, mask, config.MINI_MASK_SHAPE)
  File "X:\xxx\mrcnn\utils.py", line 532, in minimize_mask
    m = resize(m, mini_shape)
  File "X:\xxx\mrcnn\utils.py", line 905, in resize
    anti_aliasing_sigma=anti_aliasing_sigma)
  File "X:\xxxx\lib\site-packages\skimage\transform\_warps.py", line 160, in resize
    order = _validate_interpolation_order(input_type, order)
  File "X:\xxxx\lib\site-packages\skimage\_shared\utils.py", line 725, in _validate_interpolation_order
 "Input image dtype is bool. Interpolation is not defined "
 ValueError: Input image dtype is bool. Interpolation is not defined with bool data type. 
 Please set order to 0 or explicitely cast input image to another data type.
1
2
3
4
5
6
7
8
9
10
11
12
13
14

这个主要是scikit-image版本，可以把版本降低到0.16.2，或者修改mrcnn\utils.py的resize代码如下：

	imgf = image.astype(np.float32)

    if LooseVersion(skimage.__version__) >= LooseVersion("0.14"):
        # New in 0.14: anti_aliasing. Default it to False for backward
        # compatibility with skimage 0.13.
        return skimage.transform.resize(
            imgf, output_shape,
            order=order, mode=mode, cval=cval, clip=clip,
            preserve_range=preserve_range, anti_aliasing=anti_aliasing,
            anti_aliasing_sigma=anti_aliasing_sigma)
    else:
        return skimage.transform.resize(
            imgf, output_shape,
            order=order, mode=mode, cval=cval, clip=clip,
            preserve_range=preserve_range)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15

4.2 samples\balloon\balloon.py的load_mask

报错如下：

  File "X:\xxx\mrcnn\model.py", line 1709, in data_generator
    use_mini_mask=config.USE_MINI_MASK)
  File "X:\xxx\mrcnn\model.py", line 1212, in load_image_gt
    mask, class_ids = dataset.load_mask(image_id)
  File "X:\xxx/samples/balloon/balloon.py", line 176, in load_mask
    mask[rr, cc, i] = 1
  IndexError: index 1024 is out of bounds for axis 1 with size 1024
1
2
3
4
5
6
7

这个错误是因为数据集里标注的多边形坐标超出了图片，可以重新标注，或者修改如下：

        for i, p in enumerate(info["polygons"]):
            # Get indexes of pixels inside the polygon and set them to 1
            rr, cc = skimage.draw.polygon(p['all_points_y'], p['all_points_x'])

            rr[rr > mask.shape[0] - 1] = mask.shape[0] - 1
            cc[cc > mask.shape[1] - 1] = mask.shape[1] - 1

            mask[rr, cc, i] = 1
1
2
3
4
5
6
7
8

4.3 OOM

报错如下：

tensorflow.python.framework.errors_impl.ResourceExhaustedError: OOM when allocating a buffer of 178192384 bytes
1

OOM(Out Of Memory)是训练中经常遇到的错误，程序运行显存不够，处理方法是减小batchsize。本例可以修改samples\balloon\balloon.py里BalloonConfig的IMAGES_PER_GPU参数降低对显存要求，如下：

IMAGES_PER_GPU = 1
1

这里使用的AMD APU是R5-4650G，把显存从512MB扩到2GB，是可以训练Mask-RCNN的，只是时间比较长，IMAGES_PER_GPU = 1，其它参数没有修改(STEPS_PER_EPOCH = 100)，每Epoch需要7~8分钟左右。

4.4 加载COCO数据集weights文件错误

报错如下：

  File "X:\xxx\samples\balloon\balloon.py", line 374, in <module>
    model.load_weights(weights_path, by_name=True)
  File "X:\xxx\mrcnn\model.py", line 2130, in load_weights
    saving.load_weights_from_hdf5_group_by_name(f, layers)
  File "X:\xxxx\lib\site-packages\keras\engine\saving.py", line 1290, in load_weights_from_hdf5_group_by_name
    str(weight_values[i].shape) + '.')
  ValueError: Layer #389 (named "mrcnn_bbox_fc"), weight  has shape (1024, 8), but the saved weight has shape (1024, 324).
1
2
3
4
5
6
7
8

这个错误原因是COCO数据集的NUM_CLASSES = 1 + 80，而balloon数据集的NUM_CLASSES = 1 + 1，两者的训练网络有所不同。如果要使用COCO数据集weights文件作为训练预权重，就要特殊处理，设置参数时要用–weights=coco，不能用使用相对或绝对路径，原因见如下samples/balloon/balloon.py中代码段：

    if args.weights.lower() == "coco":
        # Exclude the last layers because they require a matching
        # number of classes
        model.load_weights(weights_path, by_name=True, exclude=[
            "mrcnn_class_logits", "mrcnn_bbox_fc",
            "mrcnn_bbox", "mrcnn_mask"])
    else:
        model.load_weights(weights_path, by_name=True)
1
2
3
4
5
6
7
8

4.5 Converting sparse IndexedSlices to a dense Tensor of unknown shape警告

警告信息如下：

X:\xxxx\lib\site-packages\tensorflow_core\python\framework\indexed_slices.py:424: 
UserWarning: Converting sparse IndexedSlices to a dense Tensor of unknown shape. 
This may consume a large amount of memory."Converting sparse IndexedSlices to a dense Tensor of unknown shape. "
1
2
3

这个警告不用管，程序可以正常运行。

4.6 h5py版本

报错如下：

File “X:\xxxx\lib\site-packages\keras\engine\saving.py”, line 1224, in load_weights_from_hdf5_group_by_name
original_keras_version = f.attrs[‘keras_version’].decode(‘utf8’)
AttributeError: ‘str’ object has no attribute ‘decode’
1
2
3

直接原因是Python2和Python3在字符串编码上的区别。但本例主要是h5py版本问题，可以把版本调到2.10.0。

4.7 Mask R-CNN 2.1 Releases版scipy.misc.imresize

有使用Releases版Mask R-CNN 2.1的（可以从https://codeload.github.com/matterport/Mask_RCNN/zip/refs/tags/v2.1下载），此版由于时间的关系，依赖包版本较低，安装时候尽量安装建议的最低版本。
使用此版本特别注意：在处理scipy.misc.imresize报错(如下)时，最简单处理方式把scipy版本调到1.2.1(同时要求Pillow版本降到6.0.0)。

AttributeError: module 'scipy.misc' has no attribute 'imresize'
1

网上有很多用skimage.transform.resize替代的处理。这里建议最好不要用，在skimage多个版本中，skimage.transform.resize变化太多。最坏情况可能出现，训练程序正常运行，但使用训练出来的weights文件测试，预测结果全部失败。

相关阅读:
文件或目录损坏且无法读取
 一文彻底搞懂协程（coroutine）是什么，值得收藏
 电脑是怎样上网的 (二) 从网线到网络设备
 JavaScript系列之箭头函数
 PTFE恒压分液漏斗150ml耐酸碱白色四氟材质塑料漏斗
 Xlinx 裸机编程 Dcahe 问题
 解析java中的多线程的基本概念
 互联网摸鱼日报（2022-12-01）
Vue3.0跨端Web SDK访问微信小程序云储存，文件上传路径不存在/文件受损无法显示问题（已解决）
Apache Airflow (十三) ：Airflow分布式集群搭建及使用-原因及
原文地址：https://blog.csdn.net/yangowen/article/details/126692736