Unable to exactly reproduce MAP scores reported on validation dataset for SSD object detection model from a deployed endpoint #2641

MercyPrasanna · 2021-09-15T12:50:33Z

ISSUE DESCRIPTION

We are unable to reproduce the exact MAP scores reported by sagemaker platform on the validation dataset for SSD based object detection when scoring using the deployed model and calculating the MAP metrics using an external metrics repository.
We score on the same data used as the validation data while training the model.

EXPECTED BEHAVIOR
Expectation is that MAP scores reported by the sagmaker platform on validation data should exactly or very closely match the MAP metrics obtained when scored using the deployed model with the same data.

Measurement criteria:
The models were scored at IOU thesholds of 0.5 and 0.45 using the below repositories for VOC metrics without filtering on predictions based on confidence thresholds.

TO REPRODUCE
Train a SSD object detection model with Resnet50 using the built-in Object Detection algorithm support from sagemaker. Note the validation accuracy from the platform. Deploy the model as an endpoint using the Sagemaker deployment. Score the deployed model using the exact same validation dataset and calculate the MAP value using the measurement criteria mentioned above.

Dataset:
Training Dataset: VOC 2007 TrainVal + VOC 2012 TrainVal (16551 records)
Validation Dataset:VOC 2007 Test (4952 records)
Note: The dataset did not contain the gt_difficulty flag set both during the training/validation phase and the deployment scoring phase.

When training this dataset on the sagemaker platform (with no gt_difficulty flag set) we get an MAP of 0.735

Deploying this model as Sagemaker endpoint and scoring on the same VOC 2007 Test (4952 records) yields the below MAPs at 0.5 and 0.45 IOU thresholds

VOCMApMetric from gluoncv (#1 in the repo list mentioned above):

{'mAP_45': 0.7235910398082284}
{'mAP_50': 0.7048126660056392}

VOC MAP values from the vision-evaluation code (#2 in the above repo list):
{'mAP_45': 0.7209656341246536}
{'mAP_50': 0.7011370324989207}

ADDITIONAL DETAILS
We also downloaded the model locally and performed inferencing locally using the SSDDefaultValTransform preprocessing logic mentioned @https://github.com/dmlc/gluon-cv/blob/f22650a5d31c31956d9392530a0e619689cdb3c5/gluoncv/data/transforms/presets/ssd.py#L203 and get the exact same 0.735 MAP reported as validation MAP from the sagemaker platform.

We saw that he preprocessing logic follow different code paths for validation phase and deployed model. Is this the reason for the difference?

inferencing:
inferencing uses the below method for preprocessing -
transform_test which calls resize_short_within - https://github.com/dmlc/gluon-cv/blob/f22650a5d31c31956d9392530a0e619689cdb3c5/gluoncv/data/transforms/presets/ssd.py#L17

validation:
while doing validation, SSDDefaultValTransform is called for preperocessing which does resizing (square) by calling im_resize - https://github.com/dmlc/gluon-cv/blob/f22650a5d31c31956d9392530a0e619689cdb3c5/gluoncv/data/transforms/presets/ssd.py#L203

We also noted that overlap threshold is being set with nms threshold as seen from the below code in this link - https://github.com/dmlc/gluon-cv/blob/f22650a5d31c31956d9392530a0e619689cdb3c5/gluoncv/model_zoo/ssd/ssd.py#L188

if self.nms_thresh > 0 and self.nms_thresh < 1:
            result = F.contrib.box_nms(
                result, overlap_thresh=self.nms_thresh, topk=self.nms_topk, valid_thresh=0.01,
                id_index=0, score_index=1, coord_start=2, force_suppress=False)

So both overlap and nms threshold are being set to 0.45 during evaluation.

Conclusions
As we see that when we locally infer the model using the exact same preprocessing as SSDDefaultValTransform and with an IOU (overlap) threshold and nms threshold of 0.45 we get the same validation MAP as reported in the sagemaker platform, so we attribute the difference in scores we see from the deployed endpoint to the difference in the preprocessing layer while scoring on the deployed model.

Can you please confirm on this behavior.

The text was updated successfully, but these errors were encountered:

MercyPrasanna changed the title ~~Unable to exactly reproduce MAP scores reported on validation dataset for SSD object detection model from a deployed model~~ Unable to exactly reproduce MAP scores reported on validation dataset for SSD object detection model from a deployed endpoint Sep 15, 2021

ahsan-z-khan added the type: bug label Sep 30, 2021

martinRenou added the Gluon label Oct 6, 2023

knikure added component: autogluon Relates to SageMaker AutoGluon and removed Gluon labels Apr 24, 2024

nargokul assigned rsareddy0329 Feb 4, 2025

nargokul added the component: training Relates to the SageMaker Training Platform label Apr 14, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Unable to exactly reproduce MAP scores reported on validation dataset for SSD object detection model from a deployed endpoint #2641

Unable to exactly reproduce MAP scores reported on validation dataset for SSD object detection model from a deployed endpoint #2641

MercyPrasanna commented Sep 15, 2021 •

edited

Loading

Unable to exactly reproduce MAP scores reported on validation dataset for SSD object detection model from a deployed endpoint #2641

Unable to exactly reproduce MAP scores reported on validation dataset for SSD object detection model from a deployed endpoint #2641

Comments

MercyPrasanna commented Sep 15, 2021 • edited Loading

MercyPrasanna commented Sep 15, 2021 •

edited

Loading