在用深度学习做分类的时候,常常需要进行交叉验证,目前pytorch没有通用的一套代码来实现这个功能。可以借助 sklearn中的 StratifiedKFold,KFold来实现,其中StratifiedKFold可以根据类别的样本量,进行数据划分。以5折为例,它可以实现每个类别的样本都是4:1划分。
代码简单的示例如下:
- from sklearn.model_selection import StratifiedKFold
- skf = StratifiedKFold(n_splits=5)
- for i, (train_idx, val_idx) in enumerate(skf.split(imgs, labels)):
- trainset, valset = np.array(imgs)[[train_idx]],np.array(imgs)[[val_idx]]
- traintag, valtag = np.array(labels)[[train_idx]],np.array(labels)[[val_idx]]
以上示例是将所有imgs列表与对应的labels列表进行split,得到train_idx代表训练集的下标,val_idx代表验证集的下标。后续代码只需要将split完成的trainset与valset输入dataset即可。
接下来用我自己数据集的实例来完整地实现整个过程,即从读取数据,到开始训练。如果你的数据集存储方式和我不同,改一下数据读取代码即可。关键是如何获取到imgs和对应的labels。
我的数据存储方式是这样的(类别为文件夹名,属于该类别的图像在该文件夹下):
"""A generic data loader where the images are arranged in this way: ::
root/dog/xxx.png
root/dog/xxy.png
root/dog/xxz.png
root/cat/123.png
root/cat/nsdf3.png
root/cat/asd932_.png
以下代码是获取imgs与labels的过程:
- import os
- import numpy as np
-
- IMG_EXTENSIONS = ('.jpg', '.jpeg', '.png')
-
- def is_image_file(filename):
- return filename.lower().endswith(IMG_EXTENSIONS)
-
- def find_classes(dir):
- classes = [d.name for d in os.scandir(dir) if d.is_dir()]
- classes.sort()
- class_to_idx = {cls_name: i for i, cls_name in enumerate(classes)}
- return classes, class_to_idx
-
- if __name__ == "__main__":
- dir = 'your root path'
- classes, class_to_idx = find_classes(dir)
- imgs = []
- labels = []
- for target_class in sorted(class_to_idx.keys()):
- class_index = class_to_idx[target_class]
- target_dir = os.path.join(dir, target_class)
- if not os.path.isdir(target_dir):
- continue
- for root, _, fnames in sorted(os.walk(target_dir, followlinks=True)):
- for fname in sorted(fnames):
- path = os.path.join(root, fname)
- if is_image_file(path):
- imgs.append(path)
- labels.append(class_index)
上述代码只需要把dir改为自己的root路径即可。接下来对所有数据进行5折split。其中我自己写了MyDataset类,可以直接照搬用。
- from sklearn.model_selection import StratifiedKFold
- skf = StratifiedKFold(n_splits=5) #5折
- for i, (train_idx, val_idx) in enumerate(skf.split(imgs, labels)):
- trainset, valset = np.array(imgs)[[train_idx]],np.array(imgs)[[val_idx]]
- traintag, valtag = np.array(labels)[[train_idx]],np.array(labels)[[val_idx]]
- train_dataset = MyDataset(trainset, traintag, data_transforms['train'] )
- val_dataset = MyDataset(valset, valtag, data_transforms['val'])
- from PIL import Image
- import torch
- from torch.utils.data import Dataset, DataLoader
-
-
- class MyDataset(Dataset):
-
- def __init__(self, imgs, labels, transform=None,target_transform=None):
-
- self.imgs = imgs
- self.labels = labels
- self.transform = transform
- self.target_transform = target_transform
-
- def __len__(self):
- return len(self.imgs)
-
- def __getitem__(self, idx):
- if torch.is_tensor(idx):
- idx = idx.tolist()
-
- path = self.imgs[idx]
- target = self.labels[idx]
-
- with open(path, 'rb') as f:
- img = Image.open(f)
- img = img.convert('RGB')
-
- if self.transform:
- img = self.transform(img)
-
- if self.target_transform is not None:
- target = self.target_transform(target)
-
- return img, target
有了数据集之后,就可以创建dataloader了,后面就是正常的训练代码:
- from sklearn.model_selection import StratifiedKFold
- skf = StratifiedKFold(n_splits=5) #5折
- for i, (train_idx, val_idx) in enumerate(skf.split(imgs, labels)):
- trainset, valset = np.array(imgs)[[train_idx]],np.array(imgs)[[val_idx]]
- traintag, valtag = np.array(labels)[[train_idx]],np.array(labels)[[val_idx]]
- train_dataset = MyDataset(trainset, traintag, data_transforms['train'] )
- val_dataset = MyDataset(valset, valtag, data_transforms['val'])
- train_dataloader = torch.utils.data.DataLoader(train_dataset, batch_size=args.batch_size,
- shuffle=True, num_workers=args.workers)
- test_dataloader = torch.utils.data.DataLoader(val_dataset, batch_size=args.batch_size,
- shuffle=True, num_workers=args.workers)
-
- # define model
- model = resnet18().cuda()
- # define criterion
- criterion = torch.nn.CrossEntropyLoss()
- # Observe that all parameters are being optimized.
- optimizer = optim.SGD(model.parameters(),
- lr=args.lr,
- momentum=args.momentum,
- weight_decay=args.weight_decay)
- for epoch in range(args.epoch):
- train_acc, train_loss = train(train_dataloader, model, criterion, args)
- test_acc, tect_acc_top5, test_loss = validate(test_dataloader, model, criterion, args)
为了保证每次跑的时候分的数据都是一致的,注意shuffle=False(默认)
StratifiedKFold(n_splits=5,shuffle=False)
以上就是实现的基本代码,之所以在代码层面实现k折而不是在数据层面做,比如预先把数据等分为5份。是因为这个代码可以支持数据样本的随意增减,不需要人为地再去分数据,十分方便。