Issue with torchvision::nms using custom Pytorch and TorchVision #181

mmaybeno · 2020-03-12T06:00:04Z

I've been trying to run some training jobs using the torch pytorch-training:1.4.0-cpu-py3 image and have been running into this RuntimeError: No such operator torchvision::nms error. From what I can tell it works if you uninstall the custom torch and torchvision packages and install the ones from pypi. Comparing the two it looks like torch is not loading the torchvision library.

https://github.com/aws/sagemaker-pytorch-container/blob/e87ca0714862ccdba4b380944db3d828cb8c7871/docker/1.4.0/py3/Dockerfile.cpu#L101

$ docker run --rm -it 763104351884.dkr.ecr.us-east-1.amazonaws.com/pytorch-training:1.4.0-cpu-py3 bash
root@e1e9293e2bd8:/# python
Python 3.6.6 |Anaconda, Inc.| (default, Oct  9 2018, 12:34:16) 
[GCC 7.3.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import torch
>>> import torchvision
>>> torch.ops.torchvision.nms
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/opt/conda/lib/python3.6/site-packages/torch/_ops.py", line 61, in __getattr__
    op = torch._C._jit_get_operation(qualified_op_name)
RuntimeError: No such operator torchvision::nms
>>> torch.ops.loaded_libraries
set()

After pip uninstall and install

root@e1e9293e2bd8:/# python
Python 3.6.6 |Anaconda, Inc.| (default, Oct  9 2018, 12:34:16) 
[GCC 7.3.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import torch
>>> import torchvision
>>> torch.ops.torchvision.nms
<built-in method nms of PyCapsule object at 0x7ff2ca8bd9f0>
>>> torch.ops.loaded_libraries
{'/opt/conda/lib/python3.6/site-packages/torchvision/_C.so'}

I've been trying to manually build that image locally and having some issues that are related to #141 but that is another issue I'm working through.

The text was updated successfully, but these errors were encountered:

mmaybeno · 2020-03-12T06:12:41Z

I can confirm after building the image myself I still have the error.

mmaybeno · 2020-03-12T06:33:27Z

And by process of elimination, choosing one of the aws versions of torch or torchvision with one of the remaining from pypi, the issue results when you have the aws torch and pypi torchvision. So my guess it's something related to the aws torch build.

harshp8l · 2020-03-31T17:38:47Z

We currently have a proposed fix for this, and we will have this fixed in the next release of PyTorch containers

mmaybeno · 2020-04-07T15:30:14Z

I tried the lastest docker image (hash 1ffa39bc0201) and this issue still is there.

docker run -it --rm 763104351884.dkr.ecr.us-east-1.amazonaws.com/pytorch-training:1.4.0-cpu-py3 "python3.6 -c 'import torch;import torchvision;torch.ops.torchvision.nms'"

Traceback (most recent call last):
  File "<string>", line 1, in <module>
  File "/opt/conda/lib/python3.6/site-packages/torch/_ops.py", line 61, in __getattr__
    op = torch._C._jit_get_operation(qualified_op_name)
RuntimeError: No such operator torchvision::nms

harshp8l · 2020-04-07T23:27:38Z

The last release was a patch release for a different issue, the fix for this issue has not yet been merged in.

mmaybeno · 2020-04-08T00:33:54Z

Got it. Thanks for the reply. Indeed that patch fixed another issue I had so thankful for that :).

mmhealey1 · 2020-04-24T14:52:11Z

@harshp8l Is there an eta for when this will get pushed. I'm stuck without the ability to train our models and this is crippling me and my company

mmaybeno · 2020-04-24T15:04:13Z

@mmhealey1 if you need an immediate solution, created a custom Dockerfile image based off theirs.

FROM 763104351884.dkr.ecr.us-east-1.amazonaws.com/pytorch-training:1.4.0-gpu-py3

RUN pip uninstall torch -y && pip install --no-cache-dir -U torch

harshp8l · 2020-04-24T20:45:56Z

Quick solution to help mitigate this (from within a container):
export CMAKE_PREFIX_PATH=${CONDA_PREFIX:-"$(dirname $(which conda))/../"}
pip uninstall torchvision -y
git clone https://github.com/pytorch/vision.git
cd vision/
git checkout v0.5.0
python setup.py install

Let me know if you have any success with this

harshp8l · 2020-04-24T21:03:21Z

We are prioritizing this fix for our upcoming release for Pytorch 1.5

Vedaad-Shakib · 2020-07-14T20:38:53Z

@harshp8l Has this been fixed? I'm getting this issue as well

harshp8l · 2020-07-14T21:48:00Z

This fix should be addressed now (which version of torch are you using? - this was addressed in later versions)

Vedaad-Shakib · 2020-07-14T21:51:37Z

@harshp8l I am using version 1.4.0. I tried setting the framework version to 1.5.1, and I get RuntimeError: one of the variables needed for gradient computation has been modified by an inplace operation: [torch.cuda.FloatTensor [3, 64, 3, 3]], which is output 888 of BroadcastBackward, is at version 2; expected version 1 instead. Hint: enable anomaly detection to find the operation that failed to compute its gradient, with torch.autograd.set_detect_anomaly(True).

I assumed there was a backwards compatibility issue with the most recent version (1.5.1) and the model that I'm using, which was initially built using Pytorch 1.1.0.

harshp8l · 2020-07-14T22:02:07Z

PyTorch has noted some backwards compatibility issues with 1.5.1, are you able to use 1.5.0?

Vedaad-Shakib · 2020-07-14T22:02:48Z

I will run with 1.5.0 and see if it works.

Vedaad-Shakib · 2020-07-15T01:28:09Z

1.5.0 has the same issue, but for some reason, when I use the 1.4.0 Sagemaker Pytorch docker container and then re-install Pytorch 1.1.0 it works fine. It's hacky and unclean but it's the only thing that I've been able to get to work.

mmaybeno · 2020-07-15T02:17:02Z

@Vedaad-Shakib it has something to do with their custom build of pytorch/torch vision. So reinstalling just replaces them with the general distribution. I assume they have some custom optimizations in their package.

harshp8l · 2020-07-15T04:36:03Z

What exactly is the issue you are running into here? Are you able to provide steps to reproduce?
Is it regarding porting code from 1.1.0 -> 1.5.0?

If it is the same error you mentioned above, can you provide the stack trace and run with: torch.autograd.set_detect_anomaly(True).

At an initial glance, this seems to be an issue with using inplace operations ...
(try setting operators that use this to false with (inplace = False) i.e. torch.nn.ReLU(inplace=False) or convert forms of these inplace operations such as x += y -> x = x + y) .
Note: This may impact performance / memory usage

I am noticing the torchvision op for nms being loaded on my end:

docker run --rm -it 763104351884.dkr.ecr.us-east-1.amazonaws.com/pytorch-training:1.5.0-gpu-py36-cu101-ubuntu16.04 bash
root@8b0e5eaebf0e:/# python
Python 3.6.6 |Anaconda, Inc.| (default, Oct  9 2018, 12:34:16)
[GCC 7.3.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import torch
>>> import torchvision
>>> torch.ops.torchvision.nms
<built-in method nms of PyCapsule object at 0x7f6dcfda5bd0>
>>> torch.ops.loaded_libraries
{'/opt/conda/lib/python3.6/site-packages/torchvision/_C.so'}
>>> print(torch.__version__)
1.5.0
>>> print(torchvision.__version__)
0.6.0

mmaybeno · 2020-07-16T17:04:05Z

I can confirm that 1.5.0 fixes the torchvision::nms issue but it would be nice if the older image versions were also rebuilt with the fix :).

ydaiming · 2021-09-28T17:33:58Z

This is issue is extensively discussed and summarized in this pytorch issue.

For future references, here's the quote:

The primary issue is resolve with a sim
ple naming change (below, thanks to @feiyuhuahuo)

OLD (bad):
torch.ops.torchvision.nms(boxes, scores, iou_thres)
NEW (better):
import torchvision # top of file
torchvision.ops.nms(boxes, scores, iou_thres)

mmaybeno mentioned this issue Mar 15, 2020

Training with aws Sagemaker stuck if more than one epoch aws/sagemaker-python-sdk#1353

Closed

mmaybeno mentioned this issue Mar 30, 2020

Stalled PyTorch training job on SageMaker with custom image aws/sagemaker-python-sdk#1372

Closed

ajaykarpur closed this as completed Dec 17, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Issue with torchvision::nms using custom Pytorch and TorchVision #181

Issue with torchvision::nms using custom Pytorch and TorchVision #181

mmaybeno commented Mar 12, 2020 •

edited

Loading

mmaybeno commented Mar 12, 2020

mmaybeno commented Mar 12, 2020

harshp8l commented Mar 31, 2020

mmaybeno commented Apr 7, 2020 •

edited

Loading

harshp8l commented Apr 7, 2020

mmaybeno commented Apr 8, 2020

mmhealey1 commented Apr 24, 2020

mmaybeno commented Apr 24, 2020

harshp8l commented Apr 24, 2020 •

edited

Loading

harshp8l commented Apr 24, 2020

Vedaad-Shakib commented Jul 14, 2020

harshp8l commented Jul 14, 2020

Vedaad-Shakib commented Jul 14, 2020 •

edited

Loading

harshp8l commented Jul 14, 2020

Vedaad-Shakib commented Jul 14, 2020

Vedaad-Shakib commented Jul 15, 2020

mmaybeno commented Jul 15, 2020

harshp8l commented Jul 15, 2020 •

edited

Loading

mmaybeno commented Jul 16, 2020

ydaiming commented Sep 28, 2021

Issue with torchvision::nms using custom Pytorch and TorchVision #181

Issue with torchvision::nms using custom Pytorch and TorchVision #181

Comments

mmaybeno commented Mar 12, 2020 • edited Loading

mmaybeno commented Mar 12, 2020

mmaybeno commented Mar 12, 2020

harshp8l commented Mar 31, 2020

mmaybeno commented Apr 7, 2020 • edited Loading

harshp8l commented Apr 7, 2020

mmaybeno commented Apr 8, 2020

mmhealey1 commented Apr 24, 2020

mmaybeno commented Apr 24, 2020

harshp8l commented Apr 24, 2020 • edited Loading

harshp8l commented Apr 24, 2020

Vedaad-Shakib commented Jul 14, 2020

harshp8l commented Jul 14, 2020

Vedaad-Shakib commented Jul 14, 2020 • edited Loading

harshp8l commented Jul 14, 2020

Vedaad-Shakib commented Jul 14, 2020

Vedaad-Shakib commented Jul 15, 2020

mmaybeno commented Jul 15, 2020

harshp8l commented Jul 15, 2020 • edited Loading

mmaybeno commented Jul 16, 2020

ydaiming commented Sep 28, 2021

mmaybeno commented Mar 12, 2020 •

edited

Loading

mmaybeno commented Apr 7, 2020 •

edited

Loading

harshp8l commented Apr 24, 2020 •

edited

Loading

Vedaad-Shakib commented Jul 14, 2020 •

edited

Loading

harshp8l commented Jul 15, 2020 •

edited

Loading