Skip to content

Issues with prediction time not proportional w.r.t. number of trees in RF #681

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
soufianekhoudmi opened this issue Mar 4, 2019 · 3 comments

Comments

@soufianekhoudmi
Copy link

soufianekhoudmi commented Mar 4, 2019

Please fill out the form below.

System Information

  • Framework (e.g. TensorFlow) / Algorithm (e.g. KMeans): SKLearn/Custom
  • Framework Version: 0.20.0
  • Python Version: 3.5
  • CPU or GPU: CPU
  • Python SDK Version: 1.18.2
  • Are you using a custom image: No

Describe the problem

My prediction time is not proportional to the number of trees in a Random Forest

Minimal repro / logs

My estimation strategy consists on using a set of Random Forest models, each one concerns some
subset of data (ex : RF_A if feature == A). This has been said seek of completeness as I don't think this affects my issue.

My deployment strategy:

  • Fit: return a pickle that contains a dictionary of fitted sklearn Random Forest models
  • Deploy: load these dictionaries in memory.
  • Inference:
    --maps each observation to the correct model in the already loaded dictionary
    --for each observation, computes predictions given by each tree in order to allow for elementary confidence interval computation
    http://blog.datadive.net/prediction-intervals-for-random-forests/
    Note that this last operation is the most time consuming in the inference and the time is proportional to the number of trees in my RF (loop w.r.t. trees).

My code (my custom code in lib) :

import argparse
import os
import sys
import pandas as pd
from sklearn.externals import joblib
module_path = os.path.abspath('/opt/ml/code')
if module_path not in sys.path:
    sys.path.append(module_path)
from lib import training, prediction
from data.transactions import raw

if __name__ == '__main__':
    parser = argparse.ArgumentParser()
    parser.add_argument('--output-data-dir', type=str, default=os.environ['SM_OUTPUT_DATA_DIR'])
    parser.add_argument('--model-dir', type=str, default=os.environ['SM_MODEL_DIR'])
    args = parser.parse_args()
    grid_models_dict =\
        training.train_models_in_dict(raw_training_data=raw)
    joblib.dump(grid_models_dict, os.path.join(args.model_dir, "model"))
def model_fn(model_dir):
    grid_models_dict = joblib.load(os.path.join(model_dir, "model"))
    return grid_models_dict
def predict_fn(input_data, model):
    predicted = prediction.predict(input_data, model)
    return predicted

My problem :

I have two deployments scenarios : one with 100 trees/RF and one with 300 trees/RF.
Fit is performed without issues. On S3 : compressed 100 trees/RF pickle is 261 Mo and compressed 300 trees/RF is 784 Mo.
Deploy is done with some issues : some timeout with some workers with the 300 trees/RF already reported for example aws/amazon-sagemaker-examples#556, but it deploy at the end.
Prediction is performed :

  • with the 100 trees/RF in around 500 ms, always, with the same observation
  • with the 300 trees/RF: in paper, with the same observation, due my prediction nature which is a for loop w.r.t. trees, I am supposed to predict in maximum 1.5 seconds
  • with the 300 trees/RF : in practice, with the same observation
    -- sometimes (33% of cases) in 700 ms,
    -- sometimes (33% of cases) in 40 to 50 seconds,
    -- and sometimes (33% of cases) I have a timeout error (inference timeout is limited to 60 seconds)
  • This behavior remains when I deploy in a bigger/recent machine. (ml.t2.xlarge to ml.c5.4xlarge)

My guess is that there is a memory swapping mechanism or that the container's memory is not fully privately allocated to me after some threshold.

Is there any solution to predict consistently with more than 100trees/RF ?

Thanks in advance.

@soufianekhoudmi soufianekhoudmi changed the title Issues with prediction time Issues with prediction time not proportional w.r.t. number of trees in RF Mar 4, 2019
@laurenyu
Copy link
Contributor

hi @soufianekhoudmi, thanks for your patience! we've reached out to the team that is responsible for SageMaker's Scikit-learn support to see if they have any insight.

@asadoughi
Copy link

Hi @soufianekhoudmi, I would suggest profiling your predict function's performance outside of SageMaker (e.g. on SageMaker notebooks or vanilla EC2) to better understand its bottlenecks. If you see different performance characteristics outside of SageMaker, please report what you find along with relevant code and models to assist in reproducing the issue, if possible. Thanks.

@laurenyu
Copy link
Contributor

laurenyu commented Sep 6, 2019

closing due to inactivity

@laurenyu laurenyu closed this as completed Sep 6, 2019
mizanfiu pushed a commit to mizanfiu/sagemaker-python-sdk that referenced this issue Dec 13, 2022
Co-authored-by: Xinlu Tu <[email protected]>
Co-authored-by: Xinlu Tu <[email protected]>

@xinlutu2
feat: Close feature gaps between Python SageMaker SDK and CreateAutoMLJob API includes ENSEMBLING mode (aws#681)

@xinlutu2
feature: add AutoMLStep for SageMaker Pipelines Workflows (aws#693) 

@xinlutu2
feature: add AutoMLStep integration test (aws#713)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants