Skip to content

Sagemaker Tensorflow_p36 kernel notebook not using GPU #476

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
Kuntal-G opened this issue Nov 13, 2018 · 30 comments
Closed

Sagemaker Tensorflow_p36 kernel notebook not using GPU #476

Kuntal-G opened this issue Nov 13, 2018 · 30 comments

Comments

@Kuntal-G
Copy link

Kuntal-G commented Nov 13, 2018

System Information

  • Framework (e.g. TensorFlow) / Algorithm (e.g. KMeans): Tensorflow
  • Framework Version: NA
  • Python Version: 3.6.6
  • CPU or GPU: GPU (ml.p2.8xlarge)
  • Python SDK Version: NA
  • Are you using a custom image:NA

Describe the problem

I'm running Sagemaker notebook (ml.p2.8xlarge) and checking the number of GPU's available before running my Tensorflow code.
But when I checked the available devices from tensorflow and keras, it was not showing any GPU information, only CPU information is getting printed.

Using nvidia-smi from the notebook or shell in sagemaker was printing the GPUS properly. Also using pytorch environment and using to check GPUS is working fine-
torch.cuda.get_device_name(0)

Now after I upgrade the tensorflow with conda from the notebook and then restarting the notebook instance solved the problem with Tensorflow and GPU information is working fine.

Now my questions are:

  1. Is Sagemaker notebook/platform not tested well with tensorflow gpu settings properly, so that by default the notebook works properly with GPU without manual upgrade or uninstall/install of tensorflow-gpu package?

  2. Am I making anything wrong, because I launched a new Sagemaker notebook instance,so its the latest that AWS is providing now?

  3. Also why the log device placement information from the tensorflow is not getting printed in the sagemaker notebook?
    tf.Session(config=tf.ConfigProto(log_device_placement=True))

  • Exact command to reproduce:
    Error/Issue before manual conda install/upgrade
import tensorflow as tf
tf.__version__

Output
'1.10.0'


from tensorflow.python.client import device_lib
device_lib.list_local_devices()

Output
[name: "/device:CPU:0"
device_type: "CPU"
memory_limit: 268435456
locality {
}


After conda upgrade in the notebook- !y|conda install tensorflow-gpu

import tensorflow as tf
tf.__version__

Output:
'1.11.0'

from tensorflow.python.client import device_lib
device_lib.list_local_devices()

Output

[name: "/device:CPU:0"
device_type: "CPU"
memory_limit: 268435456
locality {
}
incarnation: 13414246756793993509, name: "/device:GPU:0"
device_type: "GPU"
memory_limit: 11279561524
locality {
bus_id: 1
links {
link {
device_id: 1
type: "StreamExecutor"
strength: 1
}
link {
device_id: 2
type: "StreamExecutor"
strength: 1
}
link {
device_id: 3
type: "StreamExecutor"
strength: 1
}
link {
device_id: 4
type: "StreamExecutor"
strength: 1
}
link {
device_id: 5
type: "StreamExecutor"
strength: 1
}
link {
device_id: 6
type: "StreamExecutor"
strength: 1
}
link {
device_id: 7
type: "StreamExecutor"
strength: 1
}
}
}
incarnation: 9978466201706397067
physical_device_desc: "device: 0, name: Tesla K80, pci bus id: 0000:00:17.0, compute capability: 3.7", name: "/device:GPU:1"
device_type: "GPU"
memory_limit: 11279561524
locality {
bus_id: 1
links {
link {
type: "StreamExecutor"
strength: 1
} . . .


Validated the GPU before and after upgrade with nvidia-smi as well.
!nvidia-smi -l

@laurenyu
Copy link
Contributor

hello @Kuntal-G, can you provide the code you are using to actually run the TensorFlow job?

@Kuntal-G
Copy link
Author

Kuntal-G commented Nov 14, 2018

Hi @laurenyu

I won't be able to paste or upload the training code as it is confidential. But you can simulate the same with a simple tensorflow example below that I have tested just now.

import tensorflow as tf

#automatic device placement
a = tf.constant([1.0, 2.0, 3.0, 4.0, 5.0, 6.0], shape=[2, 3], name='a')
b = tf.constant([1.0, 2.0, 3.0, 4.0, 5.0, 6.0], shape=[3, 2], name='b')
c = tf.matmul(a, b)

Not able to verify the device mapping as the notebook is not showing device mapping ifnormation.

sess = tf.Session(config=tf.ConfigProto(log_device_placement=True))
print(sess.run(c))

But let say when I try to manually assign device to the same code using /gpu:0 or /gpu:1 in the tf.device(), it is failing as tensorflow is not able to find the gpu device.

Error:

InvalidArgumentError: Cannot assign a device for operation 'a_1': Operation was explicitly assigned to /device:GPU:0 but available devices are [ /job:localhost/replica:0/task:0/device:CPU:0 ]. Make sure the device specification refers to a valid device.
	 [[Node: a_1 = Const[dtype=DT_FLOAT, value=Tensor<type: float shape: [2,3] values: [1 2 3]...>, _device="/device:GPU:0"]()]]


During handling of the above exception, another exception occurred:

InvalidArgumentError                      Traceback (most recent call last)
<ipython-input-7-2b4d579367fc> in <module>()
      6 sess = tf.Session(config=tf.ConfigProto(log_device_placement=True))
      7 # Runs the op.
----> 8 print(sess.run(c))

Specifying /cpu:0 is working fine.

with tf.device('/cpu:0'):
  a = tf.constant([1.0, 2.0, 3.0, 4.0, 5.0, 6.0], shape=[2, 3], name='a')
  b = tf.constant([1.0, 2.0, 3.0, 4.0, 5.0, 6.0], shape=[3, 2], name='b')
c = tf.matmul(a, b)

sess = tf.Session(config=tf.ConfigProto(log_device_placement=True))

print(sess.run(c))

Output

[[22. 28.]
 [49. 64.]]

And after I perform the manual conda install upgrade with tensorflow-gpu , restart the notebook (which changes the version from 1.10.0 to 1.11.0) and then using the tf.device("/gpu:0) or /gpu:1 (depending on which p2 instance I'm using), the code is working fine.

with tf.device('/gpu:1'):
  a = tf.constant([1.0, 2.0, 3.0, 4.0, 5.0, 6.0], shape=[2, 3], name='a')
  b = tf.constant([1.0, 2.0, 3.0, 4.0, 5.0, 6.0], shape=[3, 2], name='b')
c = tf.matmul(a, b)
sess = tf.Session(config=tf.ConfigProto(log_device_placement=True))
print(sess.run(c))

Output

[[22. 28.]
 [49. 64.]]

Doesn't it make sense that the default sagemaker notebook tensorflow kernel is not recognizing the underlying GPU's?

Could you please let me know why is it happening?

Let me know if you need any other information from my side.

@laurenyu
Copy link
Contributor

@Kuntal-G thanks for sharing. To clarify, are you running this code directly in a notebook on a SageMaker Notebook Instance or are you running this as part of a SageMaker training job that runs remotely?

@Kuntal-G
Copy link
Author

@laurenyu Running as Sagemaker tensorflow estimator training job seems to work well.
But when I'm trying to check my code on the notebook instance itself with sample dataset, the GPU's are not being used and code taking longer to execute on CPU's only.

@laurenyu
Copy link
Contributor

@Kuntal-G when you run it locally on the notebook instance, are you still using the SageMaker estimator (and using local mode) or not using any SageMaker/AWS SDKs at all?

@Kuntal-G
Copy link
Author

Hi @laurenyu

I'm not using the sagemaker sdk estimator in local mode, because using Tensorflow with sagemaker sdk require to return EstimatorSpec from the model_fn(), and the EstimatorSpec class doesn't have any parameter/config setting to set the device specific information through Run config (as shown below), which could be done with Estimator without sagemaker sdk (or using tensorflow alone).

run_config = tf.estimator.RunConfig().replace(
        session_config=tf.ConfigProto(log_device_placement=True,
                                      device_count={'gpu': 0}))

Can you please tell me, how can I verify whether sagemaker sdk with tensorflow is utilizing gpu in local mode as well as remote mode??

Also, the sagemaker notebookbook environment should be properly set up irrespective of whether I'm using the Estimatorspec with sagemaker sdk or directly running tensorflow code?

It also makes no sense at all, that running the sagemaker sdk with tensorflow in the local mode will change the fact that tensorflow is not able to identify gpu's in the notebook instance.

Could you please illustrate a bit more details?

@laurenyu
Copy link
Contributor

@Kuntal-G understood. I don't think there's a built-in way you could do the analysis using the SageMaker Python SDK's local mode; I think you'd just have to run a separate process somewhere else to analyze.

I'm going to forward this issue onto the team that owns the AMI and kernels for the SageMaker Notebook Instances and see if they have any insight into the issue you're experiencing. Thanks for your patience!

@scotttag
Copy link

I'm seeing this too - Tensorflow models I was running in tensorflow_p36 successfully using the GPU for a long time suddenly stopped working a few days ago, run on CPU only now

Simple test in a new notebook instance on a p3 instance to show the issue:

from tensorflow.python.client import device_lib

device_lib.list_local_devices()

Running on the kernel both in the notebook and the console provides:

name: "/device:CPU:0"
device_type: "CPU"
memory_limit: 268435456

No GPU, nvidia-smi confirms GPU is available:

sh-4.2$ nvidia-smi -L
GPU 0: Tesla V100-SXM2-16GB

Again, running code that previously utilised the GPU now no longer uses the GPU

@neelamgehlot
Copy link

Hi @Kuntal-G,

Sorry for the inconvenience caused.

There was a bug on our end which has been fixed. Please restart your Notebook Instance to get the fix.

Thanks,
Neelam

@Kuntal-G
Copy link
Author

Hi @neelamgehlot

Thank you for the information. I have validated and it is working fine.

@jasonachonu
Copy link

jasonachonu commented Mar 22, 2019

@laurenyu
I also have issues running my Keras code on AWS Sagemaker notebook instance and using
Python Version: 3.6
CPU or GPU: GPU (ml.p2.16xlarge).
Framework (e.g. TensorFlow) / Algorithm (e.g. KMeans): Tensorflow
Framework Version: NA
Python SDK Version: NA
Are you using a custom image: NA

It is very slow which makes me wonder if it's actually using any of the 16 GPUs.
Also when I tried to use the keras multi_gpus function to train, it became worst.
Please is there a way to resolve this to speed up training.

Thanks

@laurenyu
Copy link
Contributor

@jasonachonu just to confirm - this is code that you're running directly in your notebook (as opposed to part of a SageMaker training job)? If that is the case, I'll share this with the team that owns SageMaker Notebook Instances.

@jasonachonu
Copy link

@laurenyu Yes, this is my own code that I wrote and am trying to run on an AWS SageMaker Notebook.

@laurenyu
Copy link
Contributor

@jasonachonu thanks for confirming! I've reached out to the relevant SageMaker team with your issue.

@mckev-amazon
Copy link
Contributor

Hi @jasonachonu - can you confirm that keras can see the GPUs on the Notebook Instance? There was a bug that existed in November 2018 (which was resolved also in November), I just want to confirm that this issue didn't re-occur in SageMaker Notebooks. You can check if the GPU is in use by following the instructions here.

As far as why multi_gpus is making training slower, sadly not all models will benefit from multi_gpu. There is a good explanation of this on the Keras GitHub Issue #9204. You can consider increasing the batch size when enabling multi_cpu, some users have reported that changing the performance of their training.

@ghost
Copy link

ghost commented Jul 20, 2019

@mckev-amazon Sagemaker notebook still doesn't have Kernels that support GPU?
I have similar problem that is described above, if I run tensorflow/keras using Sagemaker notebook, the Kernels are not coming with pre-installed tensorflow-gpu.

@devforfu
Copy link

I created a custom lifecycle configuration to create a new venv and install tensorflow-gpu package there. But would be great to have it out of the box without manual steps required.

@mckev-amazon
Copy link
Contributor

Hi @devforfu and @jazzman37 - I just tested this on a new p2.xlarge instance type without any custom Lifecycle Configuration, and ran the following cell in the conda_tensorflow_p36 kernel:

from tensorflow.python.client import device_lib
device_lib.list_local_devices()

The response was:

[name: "/device:CPU:0"
 device_type: "CPU"
 memory_limit: 268435456
 locality {
 }
 incarnation: 16246156114757100454, name: "/device:XLA_GPU:0"
 device_type: "XLA_GPU"
 memory_limit: 17179869184
 locality {
 }
 incarnation: 12681908079210571448
 physical_device_desc: "device: XLA_GPU device", name: "/device:XLA_CPU:0"
 device_type: "XLA_CPU"
 memory_limit: 17179869184
 locality {
 }
 incarnation: 3373592910440755984
 physical_device_desc: "device: XLA_CPU device", name: "/device:GPU:0"
 device_type: "GPU"
 memory_limit: 11330115994
 locality {
   bus_id: 1
   links {
   }
 }
 incarnation: 8216100504176947182
 physical_device_desc: "device: 0, name: Tesla K80, pci bus id: 0000:00:1e.0, compute capability: 3.7"]

This result shows that the GPUs are being properly detected by TensorFlow from the p2.xlarge's hardware. Are you getting the same or similar response in your SageMaker notebook instance?

@devforfu
Copy link

devforfu commented Jul 26, 2019

Hey @mckev-amazon, last time when I tried a notebook instance, the GPU(s) wasn't/weren't detected with conda_tensorflow_p36 kernel (or any other kernel). Even when using with tf.device('/gpu:0') directive to force GPU usage, it was still training on CPU. Not sure if anything changed since I tried it last time. Because now I mostly use a custom lifecycle configuration instead. (And I didn't try to list devices with device_lib).

I'll try one more time without additional configurations to see if I have the same response as you.

@devforfu
Copy link

Recently I tried one more time. I created a notebook instance with ml.p3.2xlarge type and picked conda_amazonei_tensorflow_p36 kernel. Using the following code to build a model:

def build_model(input_shape, n_classes):
    with tf.device('/gpu:0'):
        i = L.Input(shape=input_shape)
        x = L.Conv2D(32, 3, activation='relu')(i)
        x = L.Conv2D(64, 3, activation='relu')(x)
        x = L.MaxPool2D()(x)
        x = L.Flatten()(x)
        x = L.Dense(128, activation='relu')(x)
        x = L.Dropout(0.5)(x)
        x = L.Dense(n_classes, activation='softmax')(x)
        m = models.Model(inputs=i, outputs=x)
        m.compile(optimizer='adam', loss='sparse_categorical_crossentropy')
    return m

However, it ignores GPU. The nvidia-smi util reports that no GPU memory is used. Re-installing a package makes things work as expected.

~/anaconda3/envs/amazonei_tensorflow_p36/bin/python -m pip uninstall tensorflow
~/anaconda3/envs/amazonei_tensorflow_p36/bin/python -m pip install tensorflow-gpu

@ghost
Copy link

ghost commented Jul 26, 2019

Same for me, I only can make GPU work in Sagemaker notebook if I install separate env with keras-gpu or tensorflow-gpu.

@mckev-amazon
Copy link
Contributor

@devforfu thanks for trying this out! Really appreciate your help in debugging this issue. Can you try the same with the conda_tensorflow_p36 kernel? The *amazonei* kernels are specific to the Elastic Inference-enabled notebook instances, which might have unintended effects if you're not using one.

@devforfu
Copy link

devforfu commented Jul 30, 2019

@mckev-amazon Yeah, you're totally right. The conda_tensorflow_p36 seems to properly use GPU right out of the box, without any additional changes. I had to do a more proper investigation before creating a custom environment.

gpu_utilization

@m1sta
Copy link

m1sta commented Aug 1, 2019

Would anyone here mind posting a quick summary of what is and isn't working when it comes to SageMaker and Tensorflow on GPU?

@ghost
Copy link

ghost commented Aug 1, 2019

@mckev-amazon its working.

@mckev-amazon
Copy link
Contributor

@jazzman37 and @devforfu great to hear! If you continue to have issues with GPU on SageMaker notebooks, please open a new issue as the original issue reported has been resolved.

Thanks for using SageMaker!

@nectario
Copy link

I am having a similar issue. I opened: #1346

@nectario
Copy link

@devforfu thanks for trying this out! Really appreciate your help in debugging this issue. Can you try the same with the conda_tensorflow_p36 kernel? The *amazonei* kernels are specific to the Elastic Inference-enabled notebook instances, which might have unintended effects if you're not using one.

So how do I enable GPU using amazonei?

@dleen
Copy link

dleen commented Mar 20, 2020

Hi nectario,

You can find a tutorial here: https://docs.aws.amazon.com/sagemaker/latest/dg/ei.html

metrizable pushed a commit to metrizable/sagemaker-python-sdk that referenced this issue Dec 1, 2020
@Tylersuard
Copy link

I may be having the same problem but I'm not sure. When I train my model using Google Colab, I complete one epoch every 45 minutes. Now I am using the same code in a Sagemaker notebook (just training on the local notebook). I am using a ml.p2.xlarge instance which comes with a GPU, and I have followed the above instructions to make sure that Keras sees a gpu. However, my model is training extremely slowly: it takes over 5.5 hours to train one epoch! I am not sure if AWS just gave me a super slow GPU, or if my GPU is not being used during the training.

I ran this code:
tf.config.experimental.list_physical_devices('GPU')

and my output was:
[PhysicalDevice(name='/physical_device:GPU:0', device_type='GPU')]

Does that mean my GPU is currently in use, and it is just really slow?
Thank you for your help.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests