Skip to content

Unable to exactly reproduce MAP scores reported on validation dataset for SSD object detection model from a deployed endpoint #2641

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
MercyPrasanna opened this issue Sep 15, 2021 · 0 comments
Assignees
Labels
component: autogluon Relates to SageMaker AutoGluon component: training Relates to the SageMaker Training Platform type: bug

Comments

@MercyPrasanna
Copy link

MercyPrasanna commented Sep 15, 2021

ISSUE DESCRIPTION

We are unable to reproduce the exact MAP scores reported by sagemaker platform on the validation dataset for SSD based object detection when scoring using the deployed model and calculating the MAP metrics using an external metrics repository.
We score on the same data used as the validation data while training the model.

EXPECTED BEHAVIOR
Expectation is that MAP scores reported by the sagmaker platform on validation data should exactly or very closely match the MAP metrics obtained when scored using the deployed model with the same data.

Measurement criteria:
The models were scored at IOU thesholds of 0.5 and 0.45 using the below repositories for VOC metrics without filtering on predictions based on confidence thresholds.

  1. https://cv.gluon.ai/_modules/gluoncv/utils/metrics/voc_detection.html (VOCMApMetric, VOC07MApMetric)
  2. https://github.com/microsoft/vision-evaluation/blob/e82f087a4993b8852b8873a2d6c8ec474faa95dd/vision_evaluation/evaluators.py#L373

TO REPRODUCE
Train a SSD object detection model with Resnet50 using the built-in Object Detection algorithm support from sagemaker. Note the validation accuracy from the platform. Deploy the model as an endpoint using the Sagemaker deployment. Score the deployed model using the exact same validation dataset and calculate the MAP value using the measurement criteria mentioned above.

Dataset:
Training Dataset: VOC 2007 TrainVal + VOC 2012 TrainVal (16551 records)
Validation Dataset:VOC 2007 Test (4952 records)
Note: The dataset did not contain the gt_difficulty flag set both during the training/validation phase and the deployment scoring phase.

When training this dataset on the sagemaker platform (with no gt_difficulty flag set) we get an MAP of 0.735

Deploying this model as Sagemaker endpoint and scoring on the same VOC 2007 Test (4952 records) yields the below MAPs at 0.5 and 0.45 IOU thresholds

VOCMApMetric from gluoncv (#1 in the repo list mentioned above):

{'mAP_45': 0.7235910398082284}
{'mAP_50': 0.7048126660056392}

VOC MAP values from the vision-evaluation code (#2 in the above repo list):
{'mAP_45': 0.7209656341246536}
{'mAP_50': 0.7011370324989207}

ADDITIONAL DETAILS
We also downloaded the model locally and performed inferencing locally using the SSDDefaultValTransform preprocessing logic mentioned @https://github.com/dmlc/gluon-cv/blob/f22650a5d31c31956d9392530a0e619689cdb3c5/gluoncv/data/transforms/presets/ssd.py#L203 and get the exact same 0.735 MAP reported as validation MAP from the sagemaker platform.

We saw that he preprocessing logic follow different code paths for validation phase and deployed model. Is this the reason for the difference?

inferencing:
inferencing uses the below method for preprocessing -
transform_test which calls resize_short_within - https://github.com/dmlc/gluon-cv/blob/f22650a5d31c31956d9392530a0e619689cdb3c5/gluoncv/data/transforms/presets/ssd.py#L17

validation:
while doing validation, SSDDefaultValTransform is called for preperocessing which does resizing (square) by calling im_resize - https://github.com/dmlc/gluon-cv/blob/f22650a5d31c31956d9392530a0e619689cdb3c5/gluoncv/data/transforms/presets/ssd.py#L203

We also noted that overlap threshold is being set with nms threshold as seen from the below code in this link - https://github.com/dmlc/gluon-cv/blob/f22650a5d31c31956d9392530a0e619689cdb3c5/gluoncv/model_zoo/ssd/ssd.py#L188

if self.nms_thresh > 0 and self.nms_thresh < 1:
            result = F.contrib.box_nms(
                result, overlap_thresh=self.nms_thresh, topk=self.nms_topk, valid_thresh=0.01,
                id_index=0, score_index=1, coord_start=2, force_suppress=False)

So both overlap and nms threshold are being set to 0.45 during evaluation.

Conclusions
As we see that when we locally infer the model using the exact same preprocessing as SSDDefaultValTransform and with an IOU (overlap) threshold and nms threshold of 0.45 we get the same validation MAP as reported in the sagemaker platform, so we attribute the difference in scores we see from the deployed endpoint to the difference in the preprocessing layer while scoring on the deployed model.

Can you please confirm on this behavior.

@MercyPrasanna MercyPrasanna changed the title Unable to exactly reproduce MAP scores reported on validation dataset for SSD object detection model from a deployed model Unable to exactly reproduce MAP scores reported on validation dataset for SSD object detection model from a deployed endpoint Sep 15, 2021
@knikure knikure added component: autogluon Relates to SageMaker AutoGluon and removed Gluon labels Apr 24, 2024
@nargokul nargokul added the component: training Relates to the SageMaker Training Platform label Apr 14, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
component: autogluon Relates to SageMaker AutoGluon component: training Relates to the SageMaker Training Platform type: bug
Projects
None yet
Development

No branches or pull requests

6 participants