Kernel not responding


#1

Hi guys,

So I am having an issue with my kernel not responding while training a CNN.

At first I thought the issue was with having a batch size that was too large, so I reduced it from 64, to 48, to 32, to 16. But it still does not work as the kernel still becomes not responsive.

All these training was done with image sizes of 256. I have trained the CNN on 128 with batch size 64 will no issue. Also, when training image size 256 with any of the batch sizes above, the kernel always becomes non responsive at 10% during the training.

I am using the planet-understanding-the-amazon-from-space training dataset.

Hoping anyone can provide me some insight on this issue!

Thanks!


#2

Hi Isaac,

Can you tell us what your VM setup is and what the error message is?

I consistently ran into the out of memory error when I tried to train CNNs on a single CPU VM, the problem went away when I added more CPUs and increased their memory.


#3

Hi James,

I am using a n1-highmem-4 on Google Cloud (4 vCPUs, 26 GB memory) with a NVIDIA Tesla K80 GPU.

Unfortunately there was no error message, the training progress bar just stopped at 10% every time.

I have had cuda out of memory errors before, but that error did not appear for these instances.

Thanks for replying


#4

Alright seems like your VCPUs aren’t an issue. If it stopped at 10% it might be a faulty/corrupt image. Try running it on 5% of your training set and see what happens.


#5

I tried training the whole dataset on img_sz 128 though, and it worked out well, not sure how the resizing works. Do you mean just taking out the first 5% of the images, in terms of index, into a separate folder and then training on that?


#6

wanted to confirm if it was a memory issue. oh, cool it managed to work with 128 sized imgs. if this is the fastai library you’re using im not too familiar with the internal workings either :confused:


#7

That sounds weird. Which architecture are you using?
Could you upload the code to a gist and share it here?


#8

Hi there, sorry for the delayed reply, my laptop crashed.

I tried using resnext101, resnet50, and resnet34. Every architecture stopped at around the same part.

It seems when I remove the val_pct and seed parameters for get_cv_idxs method and set it to its default, the training works for awhile.

The vm hangs one epoch though.

The following is the gpu and cpu stats when it crashes. After the epoch the %CPU for python goes to 370 at times, is that normal?

Any ideas why?