Unable to exactly reproduce MAP scores reported on validation dataset for SSD object detection model from a deployed endpoint #2641
Labels
component: autogluon
Relates to SageMaker AutoGluon
component: training
Relates to the SageMaker Training Platform
type: bug
ISSUE DESCRIPTION
We are unable to reproduce the exact MAP scores reported by sagemaker platform on the validation dataset for SSD based object detection when scoring using the deployed model and calculating the MAP metrics using an external metrics repository.
We score on the same data used as the validation data while training the model.
EXPECTED BEHAVIOR
Expectation is that MAP scores reported by the sagmaker platform on validation data should exactly or very closely match the MAP metrics obtained when scored using the deployed model with the same data.
Measurement criteria:
The models were scored at IOU thesholds of 0.5 and 0.45 using the below repositories for VOC metrics without filtering on predictions based on confidence thresholds.
TO REPRODUCE
Train a SSD object detection model with Resnet50 using the built-in Object Detection algorithm support from sagemaker. Note the validation accuracy from the platform. Deploy the model as an endpoint using the Sagemaker deployment. Score the deployed model using the exact same validation dataset and calculate the MAP value using the measurement criteria mentioned above.
Dataset:
Training Dataset: VOC 2007 TrainVal + VOC 2012 TrainVal (16551 records)
Validation Dataset:VOC 2007 Test (4952 records)
Note: The dataset did not contain the gt_difficulty flag set both during the training/validation phase and the deployment scoring phase.
When training this dataset on the sagemaker platform (with no gt_difficulty flag set) we get an MAP of 0.735
Deploying this model as Sagemaker endpoint and scoring on the same VOC 2007 Test (4952 records) yields the below MAPs at 0.5 and 0.45 IOU thresholds
VOCMApMetric from gluoncv (#1 in the repo list mentioned above):
{'mAP_45': 0.7235910398082284}
{'mAP_50': 0.7048126660056392}
VOC MAP values from the vision-evaluation code (#2 in the above repo list):
{'mAP_45': 0.7209656341246536}
{'mAP_50': 0.7011370324989207}
ADDITIONAL DETAILS
We also downloaded the model locally and performed inferencing locally using the SSDDefaultValTransform preprocessing logic mentioned @https://github.com/dmlc/gluon-cv/blob/f22650a5d31c31956d9392530a0e619689cdb3c5/gluoncv/data/transforms/presets/ssd.py#L203 and get the exact same 0.735 MAP reported as validation MAP from the sagemaker platform.
We saw that he preprocessing logic follow different code paths for validation phase and deployed model. Is this the reason for the difference?
inferencing:
inferencing uses the below method for preprocessing -
transform_test which calls resize_short_within - https://github.com/dmlc/gluon-cv/blob/f22650a5d31c31956d9392530a0e619689cdb3c5/gluoncv/data/transforms/presets/ssd.py#L17
validation:
while doing validation, SSDDefaultValTransform is called for preperocessing which does resizing (square) by calling im_resize - https://github.com/dmlc/gluon-cv/blob/f22650a5d31c31956d9392530a0e619689cdb3c5/gluoncv/data/transforms/presets/ssd.py#L203
We also noted that overlap threshold is being set with nms threshold as seen from the below code in this link - https://github.com/dmlc/gluon-cv/blob/f22650a5d31c31956d9392530a0e619689cdb3c5/gluoncv/model_zoo/ssd/ssd.py#L188
So both overlap and nms threshold are being set to 0.45 during evaluation.
Conclusions
As we see that when we locally infer the model using the exact same preprocessing as SSDDefaultValTransform and with an IOU (overlap) threshold and nms threshold of 0.45 we get the same validation MAP as reported in the sagemaker platform, so we attribute the difference in scores we see from the deployed endpoint to the difference in the preprocessing layer while scoring on the deployed model.
Can you please confirm on this behavior.
The text was updated successfully, but these errors were encountered: