Skip to content

How to use Sagemaker to create an Inference Pipeline that tokenizes and converts text to word indices? #520

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
mshseek opened this issue Dec 1, 2018 · 3 comments

Comments

@mshseek
Copy link

mshseek commented Dec 1, 2018

I have a Sagemaker endpoint that is a tensorflow serving which expects as input a serialized TF Example.

Now I need to convert a string into a sequence of word indices by doing a lookup against a txt file, convert it into a serialised TF Example and pass it to the endpoint above.

Is Sagemaker the best tool to use to build the above preprocessing step? The closest example I could find is the inference_pipeline_sparkml_blazingtext_dbpedia notebook but this notebook does not show how to convert the tokenized text into word indices.

I was thinking of using scikit learn to do the feature transformation but am unsure how to do the word index lookup and whether the Scikit estimator in Sagemaker will allow me to call tensorflow functions to create the TF Example.

@orchidmajumder
Copy link
Contributor

orchidmajumder commented Dec 6, 2018

Hi, thanks for using Amazon SageMaker. For your use-case, the example that you looked is indeed the right approach.

Continue to use Spark

In order to modify the same example for your use-case, you need to use StringIndexer feature transformer from SparkML. In the Pipeline, after Tokenizer, you need to have a StringIndexer and a OneHotEncoder to build your feature processing step.

Take a look at the notebook mentioned below. Though the task is not related to text analytics, the dataset has categorical columns and the example uses the two feature processors I mentioned above.

https://github.com/awslabs/amazon-sagemaker-examples/tree/master/advanced_functionality/inference_pipeline_sparkml_xgboost_abalone

Use Scikit-learn instead of Spark

However, if you want to use Scikit-learn instead of Spark, you can use that as well. Here is a notebook on how to use Scikit-learn for an Inference Pipeline:
https://github.com/awslabs/amazon-sagemaker-examples/tree/master/sagemaker-python-sdk/scikit_learn_inference_pipeline

Again, you need to use Scikit-learn LabelEncoder to convert word indices into an integer mapping.

And you can indeed use Scikit-learn with SageMaker Tensorflow in an Inference Pipeline setup.

@laurenyu
Copy link
Contributor

closing due to inactivity. feel free to reopen if necessary.

@ansh997
Copy link

ansh997 commented Jun 25, 2019

In my use case, I have to feed a sparse matrix to an algorithm which is already deployed. I have used one hot encoder to convert an alphanumeric string into a sparse matrix as per my use-case to train the model. But, once the model is deployed, I can't use that one hot encoder inside the model. So, I tried to use a pipeline to process the input before feeding it into the deployed model, For which I have to deploy another model whose endpoint would act as a feed for my previous model. How should I achieve this?

I have already tried to make a preprocessing model but after deploying the model, I am unable to get the desired result. I need my preprocessing model to return a sparse matrix but I am getting error 500. Is there another way to tackle this?
Note: My model needs input in the form of a sparse matrix to make predictions.
ERR: ModelError: An error occurred (ModelError) when calling the InvokeEndpoint operation: Received server error (500) from model with message " <title>500 Internal Server Error</title>

Internal Server Error

The server encountered an internal error and was unable to complete your request. Either the server is overloaded or there is an error in the application.

ChoiByungWook pushed a commit that referenced this issue Dec 8, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants