Skip to content

Issue with torchvision::nms using custom Pytorch and TorchVision #181

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
mmaybeno opened this issue Mar 12, 2020 · 20 comments
Closed

Issue with torchvision::nms using custom Pytorch and TorchVision #181

mmaybeno opened this issue Mar 12, 2020 · 20 comments

Comments

@mmaybeno
Copy link

mmaybeno commented Mar 12, 2020

I've been trying to run some training jobs using the torch pytorch-training:1.4.0-cpu-py3 image and have been running into this RuntimeError: No such operator torchvision::nms error. From what I can tell it works if you uninstall the custom torch and torchvision packages and install the ones from pypi. Comparing the two it looks like torch is not loading the torchvision library.

https://github.com/aws/sagemaker-pytorch-container/blob/e87ca0714862ccdba4b380944db3d828cb8c7871/docker/1.4.0/py3/Dockerfile.cpu#L101

$ docker run --rm -it 763104351884.dkr.ecr.us-east-1.amazonaws.com/pytorch-training:1.4.0-cpu-py3 bash
root@e1e9293e2bd8:/# python
Python 3.6.6 |Anaconda, Inc.| (default, Oct  9 2018, 12:34:16) 
[GCC 7.3.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import torch
>>> import torchvision
>>> torch.ops.torchvision.nms
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/opt/conda/lib/python3.6/site-packages/torch/_ops.py", line 61, in __getattr__
    op = torch._C._jit_get_operation(qualified_op_name)
RuntimeError: No such operator torchvision::nms
>>> torch.ops.loaded_libraries
set()

After pip uninstall and install

root@e1e9293e2bd8:/# python
Python 3.6.6 |Anaconda, Inc.| (default, Oct  9 2018, 12:34:16) 
[GCC 7.3.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import torch
>>> import torchvision
>>> torch.ops.torchvision.nms
<built-in method nms of PyCapsule object at 0x7ff2ca8bd9f0>
>>> torch.ops.loaded_libraries
{'/opt/conda/lib/python3.6/site-packages/torchvision/_C.so'}

I've been trying to manually build that image locally and having some issues that are related to #141 but that is another issue I'm working through.

@mmaybeno
Copy link
Author

I can confirm after building the image myself I still have the error.

@mmaybeno
Copy link
Author

And by process of elimination, choosing one of the aws versions of torch or torchvision with one of the remaining from pypi, the issue results when you have the aws torch and pypi torchvision. So my guess it's something related to the aws torch build.

@harshp8l
Copy link

We currently have a proposed fix for this, and we will have this fixed in the next release of PyTorch containers

@mmaybeno
Copy link
Author

mmaybeno commented Apr 7, 2020

I tried the lastest docker image (hash 1ffa39bc0201) and this issue still is there.

docker run -it --rm 763104351884.dkr.ecr.us-east-1.amazonaws.com/pytorch-training:1.4.0-cpu-py3 "python3.6 -c 'import torch;import torchvision;torch.ops.torchvision.nms'"

Traceback (most recent call last):
  File "<string>", line 1, in <module>
  File "/opt/conda/lib/python3.6/site-packages/torch/_ops.py", line 61, in __getattr__
    op = torch._C._jit_get_operation(qualified_op_name)
RuntimeError: No such operator torchvision::nms

@harshp8l
Copy link

harshp8l commented Apr 7, 2020

The last release was a patch release for a different issue, the fix for this issue has not yet been merged in.

@mmaybeno
Copy link
Author

mmaybeno commented Apr 8, 2020

Got it. Thanks for the reply. Indeed that patch fixed another issue I had so thankful for that :).

@mmhealey1
Copy link

@harshp8l Is there an eta for when this will get pushed. I'm stuck without the ability to train our models and this is crippling me and my company

@mmaybeno
Copy link
Author

@mmhealey1 if you need an immediate solution, created a custom Dockerfile image based off theirs.

FROM 763104351884.dkr.ecr.us-east-1.amazonaws.com/pytorch-training:1.4.0-gpu-py3

RUN pip uninstall torch -y && pip install --no-cache-dir -U torch

@harshp8l
Copy link

harshp8l commented Apr 24, 2020

Quick solution to help mitigate this (from within a container):
export CMAKE_PREFIX_PATH=${CONDA_PREFIX:-"$(dirname $(which conda))/../"}
pip uninstall torchvision -y
git clone https://github.com/pytorch/vision.git
cd vision/
git checkout v0.5.0
python setup.py install

Let me know if you have any success with this

@harshp8l
Copy link

We are prioritizing this fix for our upcoming release for Pytorch 1.5

@Vedaad-Shakib
Copy link

@harshp8l Has this been fixed? I'm getting this issue as well

@harshp8l
Copy link

This fix should be addressed now (which version of torch are you using? - this was addressed in later versions)

@Vedaad-Shakib
Copy link

Vedaad-Shakib commented Jul 14, 2020

@harshp8l I am using version 1.4.0. I tried setting the framework version to 1.5.1, and I get RuntimeError: one of the variables needed for gradient computation has been modified by an inplace operation: [torch.cuda.FloatTensor [3, 64, 3, 3]], which is output 888 of BroadcastBackward, is at version 2; expected version 1 instead. Hint: enable anomaly detection to find the operation that failed to compute its gradient, with torch.autograd.set_detect_anomaly(True).

I assumed there was a backwards compatibility issue with the most recent version (1.5.1) and the model that I'm using, which was initially built using Pytorch 1.1.0.

@harshp8l
Copy link

PyTorch has noted some backwards compatibility issues with 1.5.1, are you able to use 1.5.0?

@Vedaad-Shakib
Copy link

I will run with 1.5.0 and see if it works.

@Vedaad-Shakib
Copy link

1.5.0 has the same issue, but for some reason, when I use the 1.4.0 Sagemaker Pytorch docker container and then re-install Pytorch 1.1.0 it works fine. It's hacky and unclean but it's the only thing that I've been able to get to work.

@mmaybeno
Copy link
Author

@Vedaad-Shakib it has something to do with their custom build of pytorch/torch vision. So reinstalling just replaces them with the general distribution. I assume they have some custom optimizations in their package.

@harshp8l
Copy link

harshp8l commented Jul 15, 2020

What exactly is the issue you are running into here? Are you able to provide steps to reproduce?
Is it regarding porting code from 1.1.0 -> 1.5.0?

If it is the same error you mentioned above, can you provide the stack trace and run with: torch.autograd.set_detect_anomaly(True).

At an initial glance, this seems to be an issue with using inplace operations ...
(try setting operators that use this to false with (inplace = False) i.e. torch.nn.ReLU(inplace=False) or convert forms of these inplace operations such as x += y -> x = x + y) .
Note: This may impact performance / memory usage

I am noticing the torchvision op for nms being loaded on my end:

docker run --rm -it 763104351884.dkr.ecr.us-east-1.amazonaws.com/pytorch-training:1.5.0-gpu-py36-cu101-ubuntu16.04 bash
root@8b0e5eaebf0e:/# python
Python 3.6.6 |Anaconda, Inc.| (default, Oct  9 2018, 12:34:16)
[GCC 7.3.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import torch
>>> import torchvision
>>> torch.ops.torchvision.nms
<built-in method nms of PyCapsule object at 0x7f6dcfda5bd0>
>>> torch.ops.loaded_libraries
{'/opt/conda/lib/python3.6/site-packages/torchvision/_C.so'}
>>> print(torch.__version__)
1.5.0
>>> print(torchvision.__version__)
0.6.0

@mmaybeno
Copy link
Author

I can confirm that 1.5.0 fixes the torchvision::nms issue but it would be nice if the older image versions were also rebuilt with the fix :).

@ydaiming
Copy link

This is issue is extensively discussed and summarized in this pytorch issue.

For future references, here's the quote:

The primary issue is resolve with a sim
ple naming change (below, thanks to @feiyuhuahuo)

OLD (bad):
torch.ops.torchvision.nms(boxes, scores, iou_thres)
NEW (better):
import torchvision # top of file
torchvision.ops.nms(boxes, scores, iou_thres)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants