You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
My prediction time is not proportional to the number of trees in a Random Forest
Minimal repro / logs
My estimation strategy consists on using a set of Random Forest models, each one concerns some
subset of data (ex : RF_A if feature == A). This has been said seek of completeness as I don't think this affects my issue.
My deployment strategy:
Fit: return a pickle that contains a dictionary of fitted sklearn Random Forest models
Deploy: load these dictionaries in memory.
Inference:
--maps each observation to the correct model in the already loaded dictionary
--for each observation, computes predictions given by each tree in order to allow for elementary confidence interval computation http://blog.datadive.net/prediction-intervals-for-random-forests/
Note that this last operation is the most time consuming in the inference and the time is proportional to the number of trees in my RF (loop w.r.t. trees).
My code (my custom code in lib) :
import argparse
import os
import sys
import pandas as pd
from sklearn.externals import joblib
module_path = os.path.abspath('/opt/ml/code')
if module_path not in sys.path:
sys.path.append(module_path)
from lib import training, prediction
from data.transactions import raw
if __name__ == '__main__':
parser = argparse.ArgumentParser()
parser.add_argument('--output-data-dir', type=str, default=os.environ['SM_OUTPUT_DATA_DIR'])
parser.add_argument('--model-dir', type=str, default=os.environ['SM_MODEL_DIR'])
args = parser.parse_args()
grid_models_dict =\
training.train_models_in_dict(raw_training_data=raw)
joblib.dump(grid_models_dict, os.path.join(args.model_dir, "model"))
def model_fn(model_dir):
grid_models_dict = joblib.load(os.path.join(model_dir, "model"))
return grid_models_dict
def predict_fn(input_data, model):
predicted = prediction.predict(input_data, model)
return predicted
My problem :
I have two deployments scenarios : one with 100 trees/RF and one with 300 trees/RF.
Fit is performed without issues. On S3 : compressed 100 trees/RF pickle is 261 Mo and compressed 300 trees/RF is 784 Mo.
Deploy is done with some issues : some timeout with some workers with the 300 trees/RF already reported for example aws/amazon-sagemaker-examples#556, but it deploy at the end.
Prediction is performed :
with the 100 trees/RF in around 500 ms, always, with the same observation
with the 300 trees/RF: in paper, with the same observation, due my prediction nature which is a for loop w.r.t. trees, I am supposed to predict in maximum 1.5 seconds
with the 300 trees/RF : in practice, with the same observation
-- sometimes (33% of cases) in 700 ms,
-- sometimes (33% of cases) in 40 to 50 seconds,
-- and sometimes (33% of cases) I have a timeout error (inference timeout is limited to 60 seconds)
This behavior remains when I deploy in a bigger/recent machine. (ml.t2.xlarge to ml.c5.4xlarge)
My guess is that there is a memory swapping mechanism or that the container's memory is not fully privately allocated to me after some threshold.
Is there any solution to predict consistently with more than 100trees/RF ?
Thanks in advance.
The text was updated successfully, but these errors were encountered:
soufianekhoudmi
changed the title
Issues with prediction time
Issues with prediction time not proportional w.r.t. number of trees in RF
Mar 4, 2019
hi @soufianekhoudmi, thanks for your patience! we've reached out to the team that is responsible for SageMaker's Scikit-learn support to see if they have any insight.
Hi @soufianekhoudmi, I would suggest profiling your predict function's performance outside of SageMaker (e.g. on SageMaker notebooks or vanilla EC2) to better understand its bottlenecks. If you see different performance characteristics outside of SageMaker, please report what you find along with relevant code and models to assist in reproducing the issue, if possible. Thanks.
Co-authored-by: Xinlu Tu <[email protected]>
Co-authored-by: Xinlu Tu <[email protected]>
@xinlutu2
feat: Close feature gaps between Python SageMaker SDK and CreateAutoMLJob API includes ENSEMBLING mode (aws#681)
@xinlutu2
feature: add AutoMLStep for SageMaker Pipelines Workflows (aws#693)
@xinlutu2
feature: add AutoMLStep integration test (aws#713)
Please fill out the form below.
System Information
Describe the problem
My prediction time is not proportional to the number of trees in a Random Forest
Minimal repro / logs
My estimation strategy consists on using a set of Random Forest models, each one concerns some
subset of data (ex : RF_A if feature == A). This has been said seek of completeness as I don't think this affects my issue.
My deployment strategy:
--maps each observation to the correct model in the already loaded dictionary
--for each observation, computes predictions given by each tree in order to allow for elementary confidence interval computation
http://blog.datadive.net/prediction-intervals-for-random-forests/
Note that this last operation is the most time consuming in the inference and the time is proportional to the number of trees in my RF (loop w.r.t. trees).
My code (my custom code in lib) :
My problem :
I have two deployments scenarios : one with 100 trees/RF and one with 300 trees/RF.
Fit is performed without issues. On S3 : compressed 100 trees/RF pickle is 261 Mo and compressed 300 trees/RF is 784 Mo.
Deploy is done with some issues : some timeout with some workers with the 300 trees/RF already reported for example aws/amazon-sagemaker-examples#556, but it deploy at the end.
Prediction is performed :
-- sometimes (33% of cases) in 700 ms,
-- sometimes (33% of cases) in 40 to 50 seconds,
-- and sometimes (33% of cases) I have a timeout error (inference timeout is limited to 60 seconds)
My guess is that there is a memory swapping mechanism or that the container's memory is not fully privately allocated to me after some threshold.
Is there any solution to predict consistently with more than 100trees/RF ?
Thanks in advance.
The text was updated successfully, but these errors were encountered: