Week-1 Jupyter notebook on laptop


#1

Hello All,

I have a laptop with Nvidia 940MX graphics card with 4GB memory, so I thought I could run the code in notebooks locally. I have anaconda Python 3.6 + PyTorch + CUDA+CUDNN installed.

When I try to run the lesson-1.ipynb notebook (dogs vs cats), my browser (and laptop) hangs at the line “ConvLearner.pretrained(resnet34, data, precompute=True)” step. I see a message “Web socket timed out after 111585 ms” in the terminal. I used different browsers (Chrome, FireFox), but it still persists. I have tried using the following code, but I still see it:
http://forums.fast.ai/t/very-slow-loading-of-the-convnet-pretrained-model-on-lesson-1/7170/4

To check whether my GPU is being used, I used “nvidia-smi” in the terminal, and it shows the process “~/anaconda3/python” using 1289 MB, so I guess the code is NOT being run by CPU.

The dogs vs cats runs fine with a small sample dir (each 10 images), but I’m unable to run with full data, probably due to the websocket time out issue. Can someone help me with this?


#2

I have a laptop running Linux with NVIDIA GeForce 840M GPU with CUDA Compute Capability of 5.0. It works fine for lesson-1 notebook. I think your computer cannot handle the load and ‘hang’ when you run the training.

Jupyter notebook (client) is using web socket (like HTTP protocol) to communicate with the backend Python processes. As your computer is overwhelm with the training process, it is not responding to Jupyter notebook requests. This is the the reason you are seeing the message "Web socket timed out …” in Jupyter notebook (not terminal as you mentioned). I think the full data Python training process will still runs fine, even if the Jupyter notebook hanged. May be you should wait a little longer and see what happen next? Have you try changing the batch size, bs to a smaller number? In my case, I switch from SATA to SSD drive and it helps immensely in resolving the ‘hang’ problem due to bottleneck in data loading from disk to GPU.


#3

Thanks. I have GEForce 940MX which has a CUDA capability 5.0 as well, and with 4GB memory, so it mostly isn’t a GPU issue AFAIK.

After I posted my query, I tried running the lesson-1’s python code inside a python shell, and it hung up as well (~15 mins). On pressing Ctrl+C, the stack trace points to the function “predict_to_bcolz(m, gen, arr, workers=2)” in /fastai/model.py, which also has a threading.Lock() . I guess the issue has something to do with this function. I remember seeing fast.ai forums about some discussion on this (unable to find it now) :frowning:

Also, I don’t have an SSD, so I’ll have to make do with SATA. I’ll try changing the batch size.


#4

I think you are referring to this:

Hmmm… OK, I’m not sure why you’re seeing this.

According to the replies in the fast.ai forum, the random hang was due to a race condition caused by OpenCV. As far as I can tell, by now, most of the bugs and problems have been fixed in the fastai repo.

Have you try updating your environment and fastai repo to the latest version? You can do that by cding into the fastai directory and pulling the latest fixes by doing a git pull.

If it doesn’t work, can you try this workarounds?

  • In your notebook, set the num_worker to 0 to avoid spawning sub processes to circumvent the need to lock. Although you’ll find that it takes much longer to train. Do this by passing num_workers = 0 to ImageClassifierData.from_paths() function:

    ImageClassifierData.from_paths(PATH, bs=2, tfms=tfms, num_workers=0)

  • Delete the tmp folder in dogscats directory (/data/dogscats/tmp). You have to delete it before using ImageClassifierData and ConvLearner.pretrained every time.

Let me know how it goes.


#5

Thanks for the help. I went thru the fast.ai forums and found “num_workers” might be causing some issue. In the fastai/ dir, in the file dataset.py, the following line caused the issue:

def from_paths(cls, path, bs=64, tfms=(None,None), trn_name='train', val_name='valid', test_name=None, num_workers=8):

I changed the num_workers=2, and now it looks:

def from_paths(cls, path, bs=64, tfms=(None,None), trn_name='train', val_name='valid', test_name=None, num_workers=2):

Now I see the code running, albeit very slowly (took 15-16 mins). I’ll try to write a post on the issue I faced, and how I resolved it so that other users may find it useful.

Thanks again.