Sagemaker Tensorflow_p36 kernel notebook not using GPU #476

Kuntal-G · 2018-11-13T16:15:32Z

System Information

Framework (e.g. TensorFlow) / Algorithm (e.g. KMeans): Tensorflow
Framework Version: NA
Python Version: 3.6.6
CPU or GPU: GPU (ml.p2.8xlarge)
Python SDK Version: NA
Are you using a custom image:NA

Describe the problem

I'm running Sagemaker notebook (ml.p2.8xlarge) and checking the number of GPU's available before running my Tensorflow code.
But when I checked the available devices from tensorflow and keras, it was not showing any GPU information, only CPU information is getting printed.

Using nvidia-smi from the notebook or shell in sagemaker was printing the GPUS properly. Also using pytorch environment and using to check GPUS is working fine-
torch.cuda.get_device_name(0)

Now after I upgrade the tensorflow with conda from the notebook and then restarting the notebook instance solved the problem with Tensorflow and GPU information is working fine.

Now my questions are:

Is Sagemaker notebook/platform not tested well with tensorflow gpu settings properly, so that by default the notebook works properly with GPU without manual upgrade or uninstall/install of tensorflow-gpu package?
Am I making anything wrong, because I launched a new Sagemaker notebook instance,so its the latest that AWS is providing now?
Also why the log device placement information from the tensorflow is not getting printed in the sagemaker notebook?
tf.Session(config=tf.ConfigProto(log_device_placement=True))

Exact command to reproduce:
Error/Issue before manual conda install/upgrade

import tensorflow as tf
tf.__version__

Output
'1.10.0'

from tensorflow.python.client import device_lib
device_lib.list_local_devices()

Output
[name: "/device:CPU:0"
device_type: "CPU"
memory_limit: 268435456
locality {
}

After conda upgrade in the notebook- !y|conda install tensorflow-gpu

import tensorflow as tf
tf.__version__

Output:
'1.11.0'

from tensorflow.python.client import device_lib
device_lib.list_local_devices()

Output

[name: "/device:CPU:0"
device_type: "CPU"
memory_limit: 268435456
locality {
}
incarnation: 13414246756793993509, name: "/device:GPU:0"
device_type: "GPU"
memory_limit: 11279561524
locality {
bus_id: 1
links {
link {
device_id: 1
type: "StreamExecutor"
strength: 1
}
link {
device_id: 2
type: "StreamExecutor"
strength: 1
}
link {
device_id: 3
type: "StreamExecutor"
strength: 1
}
link {
device_id: 4
type: "StreamExecutor"
strength: 1
}
link {
device_id: 5
type: "StreamExecutor"
strength: 1
}
link {
device_id: 6
type: "StreamExecutor"
strength: 1
}
link {
device_id: 7
type: "StreamExecutor"
strength: 1
}
}
}
incarnation: 9978466201706397067
physical_device_desc: "device: 0, name: Tesla K80, pci bus id: 0000:00:17.0, compute capability: 3.7", name: "/device:GPU:1"
device_type: "GPU"
memory_limit: 11279561524
locality {
bus_id: 1
links {
link {
type: "StreamExecutor"
strength: 1
} . . .

Validated the GPU before and after upgrade with nvidia-smi as well.
!nvidia-smi -l

The text was updated successfully, but these errors were encountered:

laurenyu · 2018-11-13T21:27:28Z

hello @Kuntal-G, can you provide the code you are using to actually run the TensorFlow job?

Kuntal-G · 2018-11-14T13:45:47Z

Hi @laurenyu

I won't be able to paste or upload the training code as it is confidential. But you can simulate the same with a simple tensorflow example below that I have tested just now.

import tensorflow as tf

#automatic device placement
a = tf.constant([1.0, 2.0, 3.0, 4.0, 5.0, 6.0], shape=[2, 3], name='a')
b = tf.constant([1.0, 2.0, 3.0, 4.0, 5.0, 6.0], shape=[3, 2], name='b')
c = tf.matmul(a, b)

Not able to verify the device mapping as the notebook is not showing device mapping ifnormation.

sess = tf.Session(config=tf.ConfigProto(log_device_placement=True))
print(sess.run(c))

But let say when I try to manually assign device to the same code using /gpu:0 or /gpu:1 in the tf.device(), it is failing as tensorflow is not able to find the gpu device.

Error:

InvalidArgumentError: Cannot assign a device for operation 'a_1': Operation was explicitly assigned to /device:GPU:0 but available devices are [ /job:localhost/replica:0/task:0/device:CPU:0 ]. Make sure the device specification refers to a valid device.
	 [[Node: a_1 = Const[dtype=DT_FLOAT, value=Tensor<type: float shape: [2,3] values: [1 2 3]...>, _device="/device:GPU:0"]()]]


During handling of the above exception, another exception occurred:

InvalidArgumentError                      Traceback (most recent call last)
<ipython-input-7-2b4d579367fc> in <module>()
      6 sess = tf.Session(config=tf.ConfigProto(log_device_placement=True))
      7 # Runs the op.
----> 8 print(sess.run(c))

Specifying /cpu:0 is working fine.

with tf.device('/cpu:0'):
  a = tf.constant([1.0, 2.0, 3.0, 4.0, 5.0, 6.0], shape=[2, 3], name='a')
  b = tf.constant([1.0, 2.0, 3.0, 4.0, 5.0, 6.0], shape=[3, 2], name='b')
c = tf.matmul(a, b)

sess = tf.Session(config=tf.ConfigProto(log_device_placement=True))

print(sess.run(c))

Output

[[22. 28.]
 [49. 64.]]

And after I perform the manual conda install upgrade with tensorflow-gpu , restart the notebook (which changes the version from 1.10.0 to 1.11.0) and then using the tf.device("/gpu:0) or /gpu:1 (depending on which p2 instance I'm using), the code is working fine.

with tf.device('/gpu:1'):
  a = tf.constant([1.0, 2.0, 3.0, 4.0, 5.0, 6.0], shape=[2, 3], name='a')
  b = tf.constant([1.0, 2.0, 3.0, 4.0, 5.0, 6.0], shape=[3, 2], name='b')
c = tf.matmul(a, b)
sess = tf.Session(config=tf.ConfigProto(log_device_placement=True))
print(sess.run(c))

Output

[[22. 28.]
 [49. 64.]]

Doesn't it make sense that the default sagemaker notebook tensorflow kernel is not recognizing the underlying GPU's?

Could you please let me know why is it happening?

Let me know if you need any other information from my side.

laurenyu · 2018-11-14T18:36:03Z

@Kuntal-G thanks for sharing. To clarify, are you running this code directly in a notebook on a SageMaker Notebook Instance or are you running this as part of a SageMaker training job that runs remotely?

Kuntal-G · 2018-11-14T21:05:30Z

@laurenyu Running as Sagemaker tensorflow estimator training job seems to work well.
But when I'm trying to check my code on the notebook instance itself with sample dataset, the GPU's are not being used and code taking longer to execute on CPU's only.

laurenyu · 2018-11-14T21:15:13Z

@Kuntal-G when you run it locally on the notebook instance, are you still using the SageMaker estimator (and using local mode) or not using any SageMaker/AWS SDKs at all?

Kuntal-G · 2018-11-15T14:47:22Z

Hi @laurenyu

I'm not using the sagemaker sdk estimator in local mode, because using Tensorflow with sagemaker sdk require to return EstimatorSpec from the model_fn(), and the EstimatorSpec class doesn't have any parameter/config setting to set the device specific information through Run config (as shown below), which could be done with Estimator without sagemaker sdk (or using tensorflow alone).

run_config = tf.estimator.RunConfig().replace(
        session_config=tf.ConfigProto(log_device_placement=True,
                                      device_count={'gpu': 0}))

Can you please tell me, how can I verify whether sagemaker sdk with tensorflow is utilizing gpu in local mode as well as remote mode??

Also, the sagemaker notebookbook environment should be properly set up irrespective of whether I'm using the Estimatorspec with sagemaker sdk or directly running tensorflow code?

It also makes no sense at all, that running the sagemaker sdk with tensorflow in the local mode will change the fact that tensorflow is not able to identify gpu's in the notebook instance.

Could you please illustrate a bit more details?

laurenyu · 2018-11-15T18:01:42Z

@Kuntal-G understood. I don't think there's a built-in way you could do the analysis using the SageMaker Python SDK's local mode; I think you'd just have to run a separate process somewhere else to analyze.

I'm going to forward this issue onto the team that owns the AMI and kernels for the SageMaker Notebook Instances and see if they have any insight into the issue you're experiencing. Thanks for your patience!

scotttag · 2018-11-17T12:08:45Z

I'm seeing this too - Tensorflow models I was running in tensorflow_p36 successfully using the GPU for a long time suddenly stopped working a few days ago, run on CPU only now

Simple test in a new notebook instance on a p3 instance to show the issue:

from tensorflow.python.client import device_lib

device_lib.list_local_devices()

Running on the kernel both in the notebook and the console provides:

name: "/device:CPU:0"
device_type: "CPU"
memory_limit: 268435456

No GPU, nvidia-smi confirms GPU is available:

sh-4.2$ nvidia-smi -L
GPU 0: Tesla V100-SXM2-16GB

Again, running code that previously utilised the GPU now no longer uses the GPU

neelamgehlot · 2018-11-20T02:11:18Z

Hi @Kuntal-G,

Sorry for the inconvenience caused.

There was a bug on our end which has been fixed. Please restart your Notebook Instance to get the fix.

Thanks,
Neelam

Kuntal-G · 2018-11-20T20:41:56Z

Hi @neelamgehlot

Thank you for the information. I have validated and it is working fine.

jasonachonu · 2019-03-22T00:14:41Z

@laurenyu
I also have issues running my Keras code on AWS Sagemaker notebook instance and using
Python Version: 3.6
CPU or GPU: GPU (ml.p2.16xlarge).
Framework (e.g. TensorFlow) / Algorithm (e.g. KMeans): Tensorflow
Framework Version: NA
Python SDK Version: NA
Are you using a custom image: NA

It is very slow which makes me wonder if it's actually using any of the 16 GPUs.
Also when I tried to use the keras multi_gpus function to train, it became worst.
Please is there a way to resolve this to speed up training.

Thanks

laurenyu · 2019-03-22T00:28:16Z

@jasonachonu just to confirm - this is code that you're running directly in your notebook (as opposed to part of a SageMaker training job)? If that is the case, I'll share this with the team that owns SageMaker Notebook Instances.

jasonachonu · 2019-03-22T00:30:17Z

@laurenyu Yes, this is my own code that I wrote and am trying to run on an AWS SageMaker Notebook.

laurenyu · 2019-03-22T22:31:18Z

@jasonachonu thanks for confirming! I've reached out to the relevant SageMaker team with your issue.

mckev-amazon · 2019-04-11T19:10:32Z

Hi @jasonachonu - can you confirm that keras can see the GPUs on the Notebook Instance? There was a bug that existed in November 2018 (which was resolved also in November), I just want to confirm that this issue didn't re-occur in SageMaker Notebooks. You can check if the GPU is in use by following the instructions here.

As far as why multi_gpus is making training slower, sadly not all models will benefit from multi_gpu. There is a good explanation of this on the Keras GitHub Issue #9204. You can consider increasing the batch size when enabling multi_cpu, some users have reported that changing the performance of their training.

ghost · 2019-07-20T13:54:19Z

@mckev-amazon Sagemaker notebook still doesn't have Kernels that support GPU?
I have similar problem that is described above, if I run tensorflow/keras using Sagemaker notebook, the Kernels are not coming with pre-installed tensorflow-gpu.

devforfu · 2019-07-25T11:43:14Z

I created a custom lifecycle configuration to create a new venv and install tensorflow-gpu package there. But would be great to have it out of the box without manual steps required.

mckev-amazon · 2019-07-25T17:52:11Z

Hi @devforfu and @jazzman37 - I just tested this on a new p2.xlarge instance type without any custom Lifecycle Configuration, and ran the following cell in the conda_tensorflow_p36 kernel:

from tensorflow.python.client import device_lib
device_lib.list_local_devices()

The response was:

[name: "/device:CPU:0"
 device_type: "CPU"
 memory_limit: 268435456
 locality {
 }
 incarnation: 16246156114757100454, name: "/device:XLA_GPU:0"
 device_type: "XLA_GPU"
 memory_limit: 17179869184
 locality {
 }
 incarnation: 12681908079210571448
 physical_device_desc: "device: XLA_GPU device", name: "/device:XLA_CPU:0"
 device_type: "XLA_CPU"
 memory_limit: 17179869184
 locality {
 }
 incarnation: 3373592910440755984
 physical_device_desc: "device: XLA_CPU device", name: "/device:GPU:0"
 device_type: "GPU"
 memory_limit: 11330115994
 locality {
   bus_id: 1
   links {
   }
 }
 incarnation: 8216100504176947182
 physical_device_desc: "device: 0, name: Tesla K80, pci bus id: 0000:00:1e.0, compute capability: 3.7"]

This result shows that the GPUs are being properly detected by TensorFlow from the p2.xlarge's hardware. Are you getting the same or similar response in your SageMaker notebook instance?

devforfu · 2019-07-26T05:03:23Z

Hey @mckev-amazon, last time when I tried a notebook instance, the GPU(s) wasn't/weren't detected with conda_tensorflow_p36 kernel (or any other kernel). Even when using with tf.device('/gpu:0') directive to force GPU usage, it was still training on CPU. Not sure if anything changed since I tried it last time. Because now I mostly use a custom lifecycle configuration instead. (And I didn't try to list devices with device_lib).

I'll try one more time without additional configurations to see if I have the same response as you.

devforfu · 2019-07-26T09:56:11Z

Recently I tried one more time. I created a notebook instance with ml.p3.2xlarge type and picked conda_amazonei_tensorflow_p36 kernel. Using the following code to build a model:

def build_model(input_shape, n_classes):
    with tf.device('/gpu:0'):
        i = L.Input(shape=input_shape)
        x = L.Conv2D(32, 3, activation='relu')(i)
        x = L.Conv2D(64, 3, activation='relu')(x)
        x = L.MaxPool2D()(x)
        x = L.Flatten()(x)
        x = L.Dense(128, activation='relu')(x)
        x = L.Dropout(0.5)(x)
        x = L.Dense(n_classes, activation='softmax')(x)
        m = models.Model(inputs=i, outputs=x)
        m.compile(optimizer='adam', loss='sparse_categorical_crossentropy')
    return m

However, it ignores GPU. The nvidia-smi util reports that no GPU memory is used. Re-installing a package makes things work as expected.

~/anaconda3/envs/amazonei_tensorflow_p36/bin/python -m pip uninstall tensorflow
~/anaconda3/envs/amazonei_tensorflow_p36/bin/python -m pip install tensorflow-gpu

ghost · 2019-07-26T13:50:45Z

Same for me, I only can make GPU work in Sagemaker notebook if I install separate env with keras-gpu or tensorflow-gpu.

mckev-amazon · 2019-07-26T19:15:07Z

@devforfu thanks for trying this out! Really appreciate your help in debugging this issue. Can you try the same with the conda_tensorflow_p36 kernel? The *amazonei* kernels are specific to the Elastic Inference-enabled notebook instances, which might have unintended effects if you're not using one.

devforfu · 2019-07-30T14:04:12Z

@mckev-amazon Yeah, you're totally right. The conda_tensorflow_p36 seems to properly use GPU right out of the box, without any additional changes. I had to do a more proper investigation before creating a custom environment.

m1sta · 2019-08-01T05:39:58Z

Would anyone here mind posting a quick summary of what is and isn't working when it comes to SageMaker and Tensorflow on GPU?

ghost · 2019-08-01T05:46:48Z

@mckev-amazon its working.

mckev-amazon · 2019-08-02T03:32:02Z

@jazzman37 and @devforfu great to hear! If you continue to have issues with GPU on SageMaker notebooks, please open a new issue as the original issue reported has been resolved.

Thanks for using SageMaker!

nectario · 2020-03-10T17:13:09Z

I am having a similar issue. I opened: #1346

nectario · 2020-03-10T17:17:01Z

@devforfu thanks for trying this out! Really appreciate your help in debugging this issue. Can you try the same with the conda_tensorflow_p36 kernel? The *amazonei* kernels are specific to the Elastic Inference-enabled notebook instances, which might have unintended effects if you're not using one.

So how do I enable GPU using amazonei?

dleen · 2020-03-20T23:48:31Z

Hi nectario,

You can find a tutorial here: https://docs.aws.amazon.com/sagemaker/latest/dg/ei.html

)

Tylersuard · 2021-07-08T18:51:13Z

I may be having the same problem but I'm not sure. When I train my model using Google Colab, I complete one epoch every 45 minutes. Now I am using the same code in a Sagemaker notebook (just training on the local notebook). I am using a ml.p2.xlarge instance which comes with a GPU, and I have followed the above instructions to make sure that Keras sees a gpu. However, my model is training extremely slowly: it takes over 5.5 hours to train one epoch! I am not sure if AWS just gave me a super slow GPU, or if my GPU is not being used during the training.

I ran this code:
tf.config.experimental.list_physical_devices('GPU')

and my output was:
[PhysicalDevice(name='/physical_device:GPU:0', device_type='GPU')]

Does that mean my GPU is currently in use, and it is just really slow?
Thank you for your help.

Kuntal-G closed this as completed Nov 20, 2018

metrizable pushed a commit to metrizable/sagemaker-python-sdk that referenced this issue Dec 1, 2020

feature: add SageMaker FeatureStore support (aws#476) (aws#501) (aws#528

25fd445

)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Sagemaker Tensorflow_p36 kernel notebook not using GPU #476

Sagemaker Tensorflow_p36 kernel notebook not using GPU #476

Kuntal-G commented Nov 13, 2018 •

edited

Loading

laurenyu commented Nov 13, 2018

Kuntal-G commented Nov 14, 2018 •

edited

Loading

laurenyu commented Nov 14, 2018

Kuntal-G commented Nov 14, 2018

laurenyu commented Nov 14, 2018

Kuntal-G commented Nov 15, 2018

laurenyu commented Nov 15, 2018

scotttag commented Nov 17, 2018

neelamgehlot commented Nov 20, 2018

Kuntal-G commented Nov 20, 2018

jasonachonu commented Mar 22, 2019 •

edited

Loading

laurenyu commented Mar 22, 2019

jasonachonu commented Mar 22, 2019

laurenyu commented Mar 22, 2019

mckev-amazon commented Apr 11, 2019

ghost commented Jul 20, 2019 •

edited by ghost

Loading

devforfu commented Jul 25, 2019

mckev-amazon commented Jul 25, 2019

devforfu commented Jul 26, 2019 •

edited

Loading

devforfu commented Jul 26, 2019

ghost commented Jul 26, 2019

mckev-amazon commented Jul 26, 2019

devforfu commented Jul 30, 2019 •

edited

Loading

m1sta commented Aug 1, 2019 •

edited

Loading

ghost commented Aug 1, 2019

mckev-amazon commented Aug 2, 2019

nectario commented Mar 10, 2020

nectario commented Mar 10, 2020

dleen commented Mar 20, 2020

Tylersuard commented Jul 8, 2021

Sagemaker Tensorflow_p36 kernel notebook not using GPU #476

Sagemaker Tensorflow_p36 kernel notebook not using GPU #476

Comments

Kuntal-G commented Nov 13, 2018 • edited Loading

System Information

Describe the problem

laurenyu commented Nov 13, 2018

Kuntal-G commented Nov 14, 2018 • edited Loading

laurenyu commented Nov 14, 2018

Kuntal-G commented Nov 14, 2018

laurenyu commented Nov 14, 2018

Kuntal-G commented Nov 15, 2018

laurenyu commented Nov 15, 2018

scotttag commented Nov 17, 2018

neelamgehlot commented Nov 20, 2018

Kuntal-G commented Nov 20, 2018

jasonachonu commented Mar 22, 2019 • edited Loading

laurenyu commented Mar 22, 2019

jasonachonu commented Mar 22, 2019

laurenyu commented Mar 22, 2019

mckev-amazon commented Apr 11, 2019

ghost commented Jul 20, 2019 • edited by ghost Loading

devforfu commented Jul 25, 2019

mckev-amazon commented Jul 25, 2019

devforfu commented Jul 26, 2019 • edited Loading

devforfu commented Jul 26, 2019

ghost commented Jul 26, 2019

mckev-amazon commented Jul 26, 2019

devforfu commented Jul 30, 2019 • edited Loading

m1sta commented Aug 1, 2019 • edited Loading

ghost commented Aug 1, 2019

mckev-amazon commented Aug 2, 2019

nectario commented Mar 10, 2020

nectario commented Mar 10, 2020

dleen commented Mar 20, 2020

Tylersuard commented Jul 8, 2021

Kuntal-G commented Nov 13, 2018 •

edited

Loading

Kuntal-G commented Nov 14, 2018 •

edited

Loading

jasonachonu commented Mar 22, 2019 •

edited

Loading

ghost commented Jul 20, 2019 •

edited by ghost

Loading

devforfu commented Jul 26, 2019 •

edited

Loading

devforfu commented Jul 30, 2019 •

edited

Loading

m1sta commented Aug 1, 2019 •

edited

Loading