Skip to content

Published RecordIO (-protobuf?) deserializer & serializer #1994

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
athewsey opened this issue Nov 19, 2020 · 3 comments
Closed

Published RecordIO (-protobuf?) deserializer & serializer #1994

athewsey opened this issue Nov 19, 2020 · 3 comments

Comments

@athewsey
Copy link
Collaborator

Describe the feature you'd like

Please can we promote the existing RecordDeserializer and RecordSerializer into the proper documented serializers and deserializers packages?

MXNet RecordIO is not an easy format to read in a Python environment that you don't want to / can't install MXNet on.

  • The PyPI recordio package seems to be defunct - its GitHub repo has been archived
  • This webpage ranks high on Google for searches like "RecordIO format", but trying to implement it in Python doesn't really work because the MXNet implementation has some quirks/additional functionality.

Across both training and inference use cases, I have in the past wasted a lot of time:

  1. Trying to avoid the issue
  2. Looking for alternative libraries that might solve it for me
  3. Eventually digging through the MXNet source code enough to understand that it uses the implementation provided in dmlc-core, and then translating that logic into Python to make this hacky draft deserializer.

Imagine my surprise today, when I found a RecordDeserializer buried away in the not-really-documented src/amazon/common.py!

How would this feature be used? Please describe.

Moving these classes to the standard sagemaker.serializers and sagemaker.deserializers modules will greatly increase their discoverability, making it much easier for users to interact with SageMaker-provided algorithms like Semantic Segmentation - and to consider using MXNet RecordIO serializations for custom models too.

Describe alternatives you've considered

I'm not sure why they haven't been made visible in this way already?

Also would need to carefully consider what it means for the actual RecordIO and protobuf functionality these classes are based on: That'd be too much to copy over, so would mean promoting many of these utility functions from the undocumented amazon area to I guess some top-level API?

Additional context

I discovered these classes while fixing tests for PR #1993 (adding accept/content_type constructor argument overrides for all serializers & deserializers in the SDK).

Happy to try and help with implementation if needed, but would need some guidance on where you'd like the utility functions moved to when lifted out of .amazon, because it seems like too much stuff to drag in to serializers.py and deserializers.py

@hinchcliffz
Copy link

Oh my god I have not been able to find out how to do this anywhere. Thank you so much for writing that up!

@iaroslav-ai
Copy link

Would be really great if this could be addressed.

@pintaoz-aws
Copy link
Contributor

Fixed this in #5037

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

7 participants