Skip to content

Cannot colocate nodes, Cannot merge devices with incompatible jobs: '/job:master/task:0' and '/job:ps/task:1' #328

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
gautiese opened this issue Jul 31, 2018 · 4 comments

Comments

@gautiese
Copy link

Please fill out the form below.

System Information

  • Framework (e.g. TensorFlow) / Algorithm (e.g. KMeans): Tensorflow (Keras)
  • Framework Version: 1.8
  • Python Version: 3.6
  • CPU or GPU: CPU
  • Python SDK Version:
  • Are you using a custom image: No

Describe the problem

I created a keras_model_fn and am trying to train the model on 3 c4 instances. Unfortunately, I get the following error (detailed below).
Stackoverflow suggest using soft_placement (dont know what that means, or how to use it)
Help!

Minimal repro / logs

InvalidArgumentError (see above for traceback): Cannot colocate nodes 'embedding_1/embeddings' and 'training/Adam/gradients/embedding_1/GatherV2_grad/Shape: Cannot merge devices with incompatible jobs: '/job:master/task:0' and '/job:ps/task:1'
#11 [[Node: embedding_1/embeddings = VariableV2_class=["loc:@embedding_1/embeddings"], container="", dtype=DT_FLOAT, shape=[28,300], shared_name="", _device="/job:ps/task:1"]]

@tsoi2
Copy link

tsoi2 commented Aug 14, 2018

sess = tf.Session(config=tf.ConfigProto(allow_soft_placement=True, log_device_placement=False))

@ChoiByungWook
Copy link
Contributor

Hello,

This will be difficult to diagnose without getting a minimal repro.

Thanks!

@khu834
Copy link

khu834 commented Aug 17, 2018

Distributed tensorflow training is not currently supported if you use the keras_model_fn.
You need to convert your model to use a tensorflow estimator through model_fn.

See the following:
https://github.com/aws/sagemaker-python-sdk/tree/master/src/sagemaker/tensorflow#using-a-keras-model-instead-of-a-model_fn

@ChoiByungWook
Copy link
Contributor

@khu834
Thanks for clarification!

I apologize that I wasn't able to recognize that this was the problem for @gautiese.

I'll close this issue, as it doesn't seem we can resolve the problem.

apacker pushed a commit to apacker/sagemaker-python-sdk that referenced this issue Nov 15, 2018
A new top-level directory requires a separate change to show up in Amazon SageMaker, so moving these back under existing top-level directories
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants