Skip to content

Where and how may I preprocess input data before making predictions? #100

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
mihi-r opened this issue Mar 16, 2018 · 6 comments
Closed

Where and how may I preprocess input data before making predictions? #100

mihi-r opened this issue Mar 16, 2018 · 6 comments

Comments

@mihi-r
Copy link

mihi-r commented Mar 16, 2018

I'd like to preprocess my data by sending in a string input through predictor.predict(data) and turning it into numerical embeddings just as my train_input_fn is doing with vocab_processor.fit_transform before going though my model_fn :

def train_input_fn(training_dir, hyperparmeters):
    return _input_fn(training_dir, 'meta_data_train.csv')

def _input_fn(training_dir, training_filename):
    training_set = pd.read_csv(os.path.join(training_dir, training_filename), dtype={'Classification class name': object}, encoding='cp1252')
    global n_words
  # Prepare training and testing data
    data = training_set['Features']
    target = pd.Series(training_set['Labels'])

    if training_filename == 'meta_data_test.csv':
        vocab_processor = tf.contrib.learn.preprocessing.VocabularyProcessor.restore("s3://sagemaker-blah/vocab")
        data = np.array(list(vocab_processor.transform(data)))
        return tf.estimator.inputs.numpy_input_fn(
            x={INPUT_TENSOR_NAME: data},
            y=target,
            num_epochs=100,
            shuffle=False)()
    elif training_filename == 'meta_data_train.csv':
        vocab_processor = tf.contrib.learn.preprocessing.VocabularyProcessor(MAX_DOCUMENT_LENGTH)
        data = np.array(list(vocab_processor.fit_transform(data)))
        vocab_processor.save("s3://sagemaker-blah/vocab")
        n_words = len(vocab_processor.vocabulary_)
        return tf.estimator.inputs.numpy_input_fn(
            x={INPUT_TENSOR_NAME: data},
            y=target,
            batch_size=len(data),
            num_epochs=None,
            shuffle=True)()

The documentation says to do it through serving_input_fn but I'm not sure how I can access and manipulate the data from my tensor using vocab_processor.transform. Here's my serving_input_fn for context:

def serving_input_fn(hyperparmeters):
    tensor = tf.placeholder(tf.int64, shape=[None, MAX_DOCUMENT_LENGTH])
    return build_raw_serving_input_receiver_fn({INPUT_TENSOR_NAME: tensor})()

I tried doing so through a input_fn instead:

def input_fn(serialized_input, content_type):
    vocab_processor = tf.contrib.learn.preprocessing.VocabularyProcessor(MAX_DOCUMENT_LENGTH)
    deserialized_input = pickle.loads(serialized_input)
    deserialized_input = np.array(list(vocab_processor.fit_transform(deserialized_input)))
    return deserialized_input

Here, I had an error deserializing:

KeyError: '['

What would be the best method to preprocess the data?

@winstonaws
Copy link
Contributor

input_fn is the correct function to override, so you're on the right track. Here's a simpler example from our tests which uses it in this way:

def input_fn(serialized_data, content_type):

When you define an input_fn, the raw request is passed in as the first argument, so you should have full control over the deserialization logic. Your function can do anything, but it just needs to return a dict of tensor name: tensor of the same shape as defined in your serving_input_fn.

From just the part of the stacktrace you provided, it's hard to tell whether the exception is occurring in your input_fn, or after. I'd recommend adding some debugging print statements inside your input_fn to determine that. If you're still having the problem after verifying you're deserializing correctly, can you provide the full stacktrace, and any debugging info you've obtained from you print statement, such as what you're returning from your input_fn?

@mihi-r
Copy link
Author

mihi-r commented Mar 16, 2018

I did some debugging and it looks like the issue is coming from the deserialization. First, I tested the model without the input_fn and preprocessed the data before sending it in:

data = ["blah blah blah"]
data = np.array(list(vocab_processor.transform(data)))
predictor.predict(data)

This works fine and I get back predictions.

The second time, I added back the input_fn and removed all of the preprocessing that was present before in my post (so I can test the unpickling) and preprocessed the data beforehand like above. Here's my modified input_fn:

def input_fn(serialized_input, content_type):
    deserialized_input = pickle.loads(serialized_input)
    return deserialized_input

This didn't work and I get the same error. Here's the entire stacktrace:

Traceback (most recent call last):
File "/opt/amazon/lib/python2.7/site-packages/container_support/serving.py", line 165, in _invoke
self.transformer.transform(content, input_content_type, requested_output_content_type)
File "/opt/amazon/lib/python2.7/site-packages/tf_container/serve.py", line 255, in transform
return self.transform_fn(data, content_type, accepts), accepts
File "/opt/amazon/lib/python2.7/site-packages/tf_container/serve.py", line 179, in f
input = input_fn(serialized_data, content_type)
File "/opt/ml/code/text_classification_cnn.py", line 156, in input_fn
deserialized_input = pickle.loads(serialized_input)
File "/usr/lib/python2.7/pickle.py", line 1388, in loads
return Unpickler(file).load()
File "/usr/lib/python2.7/pickle.py", line 864, in load
dispatchkey
KeyError: '['
[2018-03-16 18:11:25,795] ERROR in serving: '['
2018-03-16 18:11:25,795 ERROR - model server - '['

@mihi-r
Copy link
Author

mihi-r commented Mar 16, 2018

After further debugging, I found out I wasn't pickling correctly. Once I get that resolved, this method should work. Thanks for your help!

@mihi-r mihi-r closed this as completed Mar 16, 2018
@Brunods1001
Copy link

Hello @mihi-r, I am having the exact same key error message and I also think it's a deserialization issue. I followed the pickling shown in the aws documentation but it isn't working. Have you had any luck?

@mihi-r
Copy link
Author

mihi-r commented May 25, 2018

@Brunods1001 Yes, for me, simply pickling and and deserializing how it states in the documentation worked for me. When I first tried using input_fn, I was thinking the model would serialize it for you but in fact, you'll need to serialize it and deserialize it manually.

@mihi-r mihi-r reopened this May 25, 2018
@jesterhazy
Copy link
Contributor

This issue has been closed due to inactivity. If you still have questions, please re-open it.

apacker pushed a commit to apacker/sagemaker-python-sdk that referenced this issue Nov 15, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants