Where and how may I preprocess input data before making predictions? #100

mihi-r · 2018-03-16T14:30:17Z

I'd like to preprocess my data by sending in a string input through predictor.predict(data) and turning it into numerical embeddings just as my train_input_fn is doing with vocab_processor.fit_transform before going though my model_fn :

def train_input_fn(training_dir, hyperparmeters):
    return _input_fn(training_dir, 'meta_data_train.csv')

def _input_fn(training_dir, training_filename):
    training_set = pd.read_csv(os.path.join(training_dir, training_filename), dtype={'Classification class name': object}, encoding='cp1252')
    global n_words
  # Prepare training and testing data
    data = training_set['Features']
    target = pd.Series(training_set['Labels'])

    if training_filename == 'meta_data_test.csv':
        vocab_processor = tf.contrib.learn.preprocessing.VocabularyProcessor.restore("s3://sagemaker-blah/vocab")
        data = np.array(list(vocab_processor.transform(data)))
        return tf.estimator.inputs.numpy_input_fn(
            x={INPUT_TENSOR_NAME: data},
            y=target,
            num_epochs=100,
            shuffle=False)()
    elif training_filename == 'meta_data_train.csv':
        vocab_processor = tf.contrib.learn.preprocessing.VocabularyProcessor(MAX_DOCUMENT_LENGTH)
        data = np.array(list(vocab_processor.fit_transform(data)))
        vocab_processor.save("s3://sagemaker-blah/vocab")
        n_words = len(vocab_processor.vocabulary_)
        return tf.estimator.inputs.numpy_input_fn(
            x={INPUT_TENSOR_NAME: data},
            y=target,
            batch_size=len(data),
            num_epochs=None,
            shuffle=True)()

The documentation says to do it through serving_input_fn but I'm not sure how I can access and manipulate the data from my tensor using vocab_processor.transform. Here's my serving_input_fn for context:

def serving_input_fn(hyperparmeters):
    tensor = tf.placeholder(tf.int64, shape=[None, MAX_DOCUMENT_LENGTH])
    return build_raw_serving_input_receiver_fn({INPUT_TENSOR_NAME: tensor})()

I tried doing so through a input_fn instead:

def input_fn(serialized_input, content_type):
    vocab_processor = tf.contrib.learn.preprocessing.VocabularyProcessor(MAX_DOCUMENT_LENGTH)
    deserialized_input = pickle.loads(serialized_input)
    deserialized_input = np.array(list(vocab_processor.fit_transform(deserialized_input)))
    return deserialized_input

Here, I had an error deserializing:

KeyError: '['

What would be the best method to preprocess the data?

The text was updated successfully, but these errors were encountered:

winstonaws · 2018-03-16T17:41:35Z

input_fn is the correct function to override, so you're on the right track. Here's a simpler example from our tests which uses it in this way:

sagemaker-python-sdk/tests/data/cifar_10/source/resnet_cifar_10.py

Line 143 in f298f54

def input_fn(serialized_data, content_type):

When you define an input_fn, the raw request is passed in as the first argument, so you should have full control over the deserialization logic. Your function can do anything, but it just needs to return a dict of tensor name: tensor of the same shape as defined in your serving_input_fn.

From just the part of the stacktrace you provided, it's hard to tell whether the exception is occurring in your input_fn, or after. I'd recommend adding some debugging print statements inside your input_fn to determine that. If you're still having the problem after verifying you're deserializing correctly, can you provide the full stacktrace, and any debugging info you've obtained from you print statement, such as what you're returning from your input_fn?

mihi-r · 2018-03-16T18:28:29Z

I did some debugging and it looks like the issue is coming from the deserialization. First, I tested the model without the input_fn and preprocessed the data before sending it in:

data = ["blah blah blah"]
data = np.array(list(vocab_processor.transform(data)))
predictor.predict(data)

This works fine and I get back predictions.

The second time, I added back the input_fn and removed all of the preprocessing that was present before in my post (so I can test the unpickling) and preprocessed the data beforehand like above. Here's my modified input_fn:

def input_fn(serialized_input, content_type):
    deserialized_input = pickle.loads(serialized_input)
    return deserialized_input

This didn't work and I get the same error. Here's the entire stacktrace:

Traceback (most recent call last):
File "/opt/amazon/lib/python2.7/site-packages/container_support/serving.py", line 165, in _invoke
self.transformer.transform(content, input_content_type, requested_output_content_type)
File "/opt/amazon/lib/python2.7/site-packages/tf_container/serve.py", line 255, in transform
return self.transform_fn(data, content_type, accepts), accepts
File "/opt/amazon/lib/python2.7/site-packages/tf_container/serve.py", line 179, in f
input = input_fn(serialized_data, content_type)
File "/opt/ml/code/text_classification_cnn.py", line 156, in input_fn
deserialized_input = pickle.loads(serialized_input)
File "/usr/lib/python2.7/pickle.py", line 1388, in loads
return Unpickler(file).load()
File "/usr/lib/python2.7/pickle.py", line 864, in load
dispatchkey
KeyError: '['
[2018-03-16 18:11:25,795] ERROR in serving: '['
2018-03-16 18:11:25,795 ERROR - model server - '['

mihi-r · 2018-03-16T18:59:54Z

After further debugging, I found out I wasn't pickling correctly. Once I get that resolved, this method should work. Thanks for your help!

Brunods1001 · 2018-05-24T02:24:07Z

Hello @mihi-r, I am having the exact same key error message and I also think it's a deserialization issue. I followed the pickling shown in the aws documentation but it isn't working. Have you had any luck?

mihi-r · 2018-05-25T12:13:34Z

@Brunods1001 Yes, for me, simply pickling and and deserializing how it states in the documentation worked for me. When I first tried using input_fn, I was thinking the model would serialize it for you but in fact, you'll need to serialize it and deserialize it manually.

jesterhazy · 2018-06-07T22:20:10Z

This issue has been closed due to inactivity. If you still have questions, please re-open it.

LDA Topic Models

mihi-r closed this as completed Mar 16, 2018

mihi-r reopened this May 25, 2018

jesterhazy closed this as completed Jun 7, 2018

tlelson mentioned this issue Oct 1, 2018

Unable to invoke SageMaker API endpoint - Vague Error: KeyError: u'' #413

Closed

apacker pushed a commit to apacker/sagemaker-python-sdk that referenced this issue Nov 15, 2018

Merge pull request aws#100 from awslabs/lda_topic_models

638ced9

LDA Topic Models

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Where and how may I preprocess input data before making predictions? #100

Where and how may I preprocess input data before making predictions? #100

mihi-r commented Mar 16, 2018

winstonaws commented Mar 16, 2018

mihi-r commented Mar 16, 2018 •

edited

Loading

mihi-r commented Mar 16, 2018 •

edited

Loading

Brunods1001 commented May 24, 2018

mihi-r commented May 25, 2018

jesterhazy commented Jun 7, 2018

Where and how may I preprocess input data before making predictions? #100

Where and how may I preprocess input data before making predictions? #100

Comments

mihi-r commented Mar 16, 2018

winstonaws commented Mar 16, 2018

mihi-r commented Mar 16, 2018 • edited Loading

mihi-r commented Mar 16, 2018 • edited Loading

Brunods1001 commented May 24, 2018

mihi-r commented May 25, 2018

jesterhazy commented Jun 7, 2018

mihi-r commented Mar 16, 2018 •

edited

Loading

mihi-r commented Mar 16, 2018 •

edited

Loading