Dragon Arrow written by Tatsuya Nakaji, all rights reserved animated-dragon-image-0164

RuntimeError: CUDA error: device-side assert triggered

Mar 01, 2020

RuntimeError: CUDA error: device-side assert triggered



General cause



This error occurs due to the following two reasons:

  1. Inconsistency between the number of labels/classes and the number of output units
  2. The input of the loss function may be incorrect.


The error messages you get when running into this error may not be very descriptive. To make sure you get the complete and useful stack trace, have this at the very beginning of your code and run it before anything else:

CUDA_LAUNCH_BLOCKING="1"

export Environment variable
$ export CUDA_LAUNCH_BLOCKING="1"



Cause of my case


.ipynb_checkpoints are cause when my case.

mtcnn_detect_resized/
 ├ train/
 │ ├ REAL/
 │ │  ├ 8537.png
 │ │  └ ...
 │ ├ FAKE/
 │ │  ├ 2857.png
 │ |   └ ... 
 │ ├ .ipynb_checkpoints



The reason why I notice


I check train image label and validation image label like below code.


# load library
import torch
import torch.nn as nn
import torch.optim as optim
from torch.optim import lr_scheduler
import numpy as np
import torchvision
from torchvision import datasets, models, transforms

data_transforms = {
    'train': transforms.Compose([
        transforms.ToTensor(),
        transforms.Normalize((0.5, 0.5, 0.5), (0.5, 0.5, 0.5))
    ]),
    'val': transforms.Compose([
        transforms.ToTensor(),
        transforms.Normalize((0.5, 0.5, 0.5), (0.5, 0.5, 0.5))
    ]),
}


data_dir = './mtcnn_detect_resized'
image_datasets = {
    x: datasets.ImageFolder(os.path.join(data_dir, x), data_transforms[x]) for x in ['train', 'val']
}


dataloaders = {
    x: torch.utils.data.DataLoader(image_datasets[x], batch_size=4, shuffle=True, num_workers=4) for x in ['train', 'val']
}


dataset_sizes = {
    x: len(image_datasets[x]) for x in ['train', 'val']
}


class_names = image_datasets['train'].classes

device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")


When I check train classes, label is "FAKE" and "REAL"

image_datasets['train'].classes
['FAKE', 'REAL']


When I check valid classes, label is "FAKE" and "REAL" and strange ".ipynb_checkpoints".

This is not label I wnat to classify.

image_datasets['val'].classes
['.ipynb_checkpoints','FAKE', 'REAL']


Solution of my case


Search .ipynb_checkpoints

mtcnn_detect_resized/val$ sudo find ./ -name .ipynb_checkpoints

./.ipynb_checkpoints


Delete .ipynb_checkpoints

mtcnn_detect_resized/val$ rm -rf ./.ipynb_checkpoints


After this, error is gone.

This is the solutoin of when .ipynb_checkpoints prevent pytorch classes.


Check if you can't solve this problem


  1. image shape is correct?
  2. labels/classes is correct?
  3. The input of loss function is correct?