-
Notifications
You must be signed in to change notification settings - Fork 1.2k
Sagemaker Tensorflow_p36 kernel notebook not using GPU #476
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
hello @Kuntal-G, can you provide the code you are using to actually run the TensorFlow job? |
Hi @laurenyu I won't be able to paste or upload the training code as it is confidential. But you can simulate the same with a simple tensorflow example below that I have tested just now.
Not able to verify the device mapping as the notebook is not showing device mapping ifnormation.
But let say when I try to manually assign device to the same code using /gpu:0 or /gpu:1 in the tf.device(), it is failing as tensorflow is not able to find the gpu device. Error:
Specifying /cpu:0 is working fine.
Output
And after I perform the manual
Output
Doesn't it make sense that the default sagemaker notebook tensorflow kernel is not recognizing the underlying GPU's? Could you please let me know why is it happening? Let me know if you need any other information from my side. |
@Kuntal-G thanks for sharing. To clarify, are you running this code directly in a notebook on a SageMaker Notebook Instance or are you running this as part of a SageMaker training job that runs remotely? |
@laurenyu Running as Sagemaker tensorflow estimator training job seems to work well. |
@Kuntal-G when you run it locally on the notebook instance, are you still using the SageMaker estimator (and using local mode) or not using any SageMaker/AWS SDKs at all? |
Hi @laurenyu I'm not using the sagemaker sdk estimator in local mode, because using Tensorflow with sagemaker sdk require to return EstimatorSpec from the model_fn(), and the EstimatorSpec class doesn't have any parameter/config setting to set the device specific information through Run config (as shown below), which could be done with Estimator without sagemaker sdk (or using tensorflow alone).
Can you please tell me, how can I verify whether sagemaker sdk with tensorflow is utilizing gpu in local mode as well as remote mode?? Also, the sagemaker notebookbook environment should be properly set up irrespective of whether I'm using the Estimatorspec with sagemaker sdk or directly running tensorflow code? It also makes no sense at all, that running the sagemaker sdk with tensorflow in the local mode will change the fact that tensorflow is not able to identify gpu's in the notebook instance. Could you please illustrate a bit more details? |
@Kuntal-G understood. I don't think there's a built-in way you could do the analysis using the SageMaker Python SDK's local mode; I think you'd just have to run a separate process somewhere else to analyze. I'm going to forward this issue onto the team that owns the AMI and kernels for the SageMaker Notebook Instances and see if they have any insight into the issue you're experiencing. Thanks for your patience! |
I'm seeing this too - Tensorflow models I was running in tensorflow_p36 successfully using the GPU for a long time suddenly stopped working a few days ago, run on CPU only now Simple test in a new notebook instance on a p3 instance to show the issue:
Running on the kernel both in the notebook and the console provides:
No GPU, nvidia-smi confirms GPU is available:
Again, running code that previously utilised the GPU now no longer uses the GPU |
Hi @Kuntal-G, Sorry for the inconvenience caused. There was a bug on our end which has been fixed. Please restart your Notebook Instance to get the fix. Thanks, |
Thank you for the information. I have validated and it is working fine. |
@laurenyu It is very slow which makes me wonder if it's actually using any of the 16 GPUs. Thanks |
@jasonachonu just to confirm - this is code that you're running directly in your notebook (as opposed to part of a SageMaker training job)? If that is the case, I'll share this with the team that owns SageMaker Notebook Instances. |
@laurenyu Yes, this is my own code that I wrote and am trying to run on an AWS SageMaker Notebook. |
@jasonachonu thanks for confirming! I've reached out to the relevant SageMaker team with your issue. |
Hi @jasonachonu - can you confirm that keras can see the GPUs on the Notebook Instance? There was a bug that existed in November 2018 (which was resolved also in November), I just want to confirm that this issue didn't re-occur in SageMaker Notebooks. You can check if the GPU is in use by following the instructions here. As far as why multi_gpus is making training slower, sadly not all models will benefit from multi_gpu. There is a good explanation of this on the Keras GitHub Issue #9204. You can consider increasing the batch size when enabling multi_cpu, some users have reported that changing the performance of their training. |
@mckev-amazon Sagemaker notebook still doesn't have Kernels that support GPU? |
I created a custom lifecycle configuration to create a new venv and install |
Hi @devforfu and @jazzman37 - I just tested this on a new p2.xlarge instance type without any custom Lifecycle Configuration, and ran the following cell in the conda_tensorflow_p36 kernel:
The response was:
This result shows that the GPUs are being properly detected by TensorFlow from the p2.xlarge's hardware. Are you getting the same or similar response in your SageMaker notebook instance? |
Hey @mckev-amazon, last time when I tried a notebook instance, the GPU(s) wasn't/weren't detected with I'll try one more time without additional configurations to see if I have the same response as you. |
Recently I tried one more time. I created a notebook instance with def build_model(input_shape, n_classes):
with tf.device('/gpu:0'):
i = L.Input(shape=input_shape)
x = L.Conv2D(32, 3, activation='relu')(i)
x = L.Conv2D(64, 3, activation='relu')(x)
x = L.MaxPool2D()(x)
x = L.Flatten()(x)
x = L.Dense(128, activation='relu')(x)
x = L.Dropout(0.5)(x)
x = L.Dense(n_classes, activation='softmax')(x)
m = models.Model(inputs=i, outputs=x)
m.compile(optimizer='adam', loss='sparse_categorical_crossentropy')
return m However, it ignores GPU. The ~/anaconda3/envs/amazonei_tensorflow_p36/bin/python -m pip uninstall tensorflow
~/anaconda3/envs/amazonei_tensorflow_p36/bin/python -m pip install tensorflow-gpu |
Same for me, I only can make GPU work in Sagemaker notebook if I install separate env with keras-gpu or tensorflow-gpu. |
@devforfu thanks for trying this out! Really appreciate your help in debugging this issue. Can you try the same with the |
@mckev-amazon Yeah, you're totally right. The |
Would anyone here mind posting a quick summary of what is and isn't working when it comes to SageMaker and Tensorflow on GPU? |
@mckev-amazon its working. |
@jazzman37 and @devforfu great to hear! If you continue to have issues with GPU on SageMaker notebooks, please open a new issue as the original issue reported has been resolved. Thanks for using SageMaker! |
I am having a similar issue. I opened: #1346 |
So how do I enable GPU using amazonei? |
Hi nectario, You can find a tutorial here: https://docs.aws.amazon.com/sagemaker/latest/dg/ei.html |
I may be having the same problem but I'm not sure. When I train my model using Google Colab, I complete one epoch every 45 minutes. Now I am using the same code in a Sagemaker notebook (just training on the local notebook). I am using a ml.p2.xlarge instance which comes with a GPU, and I have followed the above instructions to make sure that Keras sees a gpu. However, my model is training extremely slowly: it takes over 5.5 hours to train one epoch! I am not sure if AWS just gave me a super slow GPU, or if my GPU is not being used during the training. I ran this code: and my output was: Does that mean my GPU is currently in use, and it is just really slow? |
System Information
Describe the problem
I'm running Sagemaker notebook (ml.p2.8xlarge) and checking the number of GPU's available before running my Tensorflow code.
But when I checked the available devices from tensorflow and keras, it was not showing any GPU information, only CPU information is getting printed.
Using nvidia-smi from the notebook or shell in sagemaker was printing the GPUS properly. Also using pytorch environment and using to check GPUS is working fine-
torch.cuda.get_device_name(0)
Now after I upgrade the tensorflow with conda from the notebook and then restarting the notebook instance solved the problem with Tensorflow and GPU information is working fine.
Now my questions are:
Is Sagemaker notebook/platform not tested well with tensorflow gpu settings properly, so that by default the notebook works properly with GPU without manual upgrade or uninstall/install of tensorflow-gpu package?
Am I making anything wrong, because I launched a new Sagemaker notebook instance,so its the latest that AWS is providing now?
Also why the log device placement information from the tensorflow is not getting printed in the sagemaker notebook?
tf.Session(config=tf.ConfigProto(log_device_placement=True))
Error/Issue before manual conda install/upgrade
Output
'1.10.0'
Output
[name: "/device:CPU:0"
device_type: "CPU"
memory_limit: 268435456
locality {
}
After conda upgrade in the notebook-
!y|conda install tensorflow-gpu
Output:
'1.11.0'
Output
[name: "/device:CPU:0"
device_type: "CPU"
memory_limit: 268435456
locality {
}
incarnation: 13414246756793993509, name: "/device:GPU:0"
device_type: "GPU"
memory_limit: 11279561524
locality {
bus_id: 1
links {
link {
device_id: 1
type: "StreamExecutor"
strength: 1
}
link {
device_id: 2
type: "StreamExecutor"
strength: 1
}
link {
device_id: 3
type: "StreamExecutor"
strength: 1
}
link {
device_id: 4
type: "StreamExecutor"
strength: 1
}
link {
device_id: 5
type: "StreamExecutor"
strength: 1
}
link {
device_id: 6
type: "StreamExecutor"
strength: 1
}
link {
device_id: 7
type: "StreamExecutor"
strength: 1
}
}
}
incarnation: 9978466201706397067
physical_device_desc: "device: 0, name: Tesla K80, pci bus id: 0000:00:17.0, compute capability: 3.7", name: "/device:GPU:1"
device_type: "GPU"
memory_limit: 11279561524
locality {
bus_id: 1
links {
link {
type: "StreamExecutor"
strength: 1
} . . .
Validated the GPU before and after upgrade with nvidia-smi as well.
!nvidia-smi -l
The text was updated successfully, but these errors were encountered: