-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
CLN: Refactor *SON RFC #9166
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
I think adding a serializee using monary would be fine. However trying to combine serializers is not s good idea in general. A They see all c/cython based and strive for the best perf possible. No reason to expose the intermediate form at all. |
see also this: #9146 |
The part of the serializer I want to combine is the 'packing' part so that JSON/BSON/msgpack share the same 'schema'. True, the JSON seraliser code is native, but the msgpack code, for example, is still in python (essentially |
@lJoublanc In theory this might be nice. But from a practical perspective. JSON is an already established format that is not completely roudtripable (though see the 'schema' in #9146). While msgpack is just a format I made up. What is the purpose here? |
The idea is to share code generation for the schema; i.e. the bit that transforms a python object graph into a document/dict, ready for bin/text serialization. That's point 1 in my monograph :/ In the second example I posted above, with |
The transformation part is a performance issue, e.g. its the reason why these are implemented in c/cython So I would be in favor of doing the following:
This would unify the JSON-like schemas and make them round-trippable as well as making DataFrames (and other pandas objects) easily describable to the world in general. back-compat is pretty important, so anything that is done should be able to preserve this. I don't think these are possible in a combined manner in an easy way. You will sacrifice performance if you implement the schema in python. So I think that route is too pie-in-the-sky. |
Closing and adding to a tracker issue #30407 for IO format requests, can re-open if interest is expressed. |
n.b. I'm looking at how to store a
DataFrame
in BSON format (see #4329) in MongoDB. This is my attempt to gather all the relevant information before adding yet another *SON serializer. Please don't treat this as a change request - I'm just trying to get my head around all the relevant bits. Maybe skip to the summary and start from there ...Refactoring *SON to use common schema
There are a number of *SON formats which are similar: e.g. JSON, Msgpack (in pandas), BSON (not in pandas) recently announced JBSON. They have some common components:
Currently, I don't think any of the code between JSON and Msgpack is shared. The schema used by Msgpack is entirely different from JSON. I'm guessing this is by design; JSON has been kept as human-readable as possible, whereas Msgpack is more focused on a serialization format. To illustrate this, two examples. First a simple Dataframe:
produces the following document before serialization in msgpack:
I produced this by calling
encode
recursively inpacking.py
, replicating whatpack
does. There have been plenty of discussions regarding storage of metadata in #3525, #3297. Now the result of callingto_json()
:'{"bid":{"1418638020000":0.5628044127,"1418638021000":0.3993987818,"1418638022000":0.7471914537},"offer":{"1418638020000":0.398797779,"1418638021000":0.8968090851,"1418638022000":0.0980482752}}'
Both appear to use blocking (see #9130 ). Also, as described in #3525, dates are stored as integers (in strings) for performance reasons.
The second example, which has a multi-index:
This kind of works for
to_msgpack()
but not forto_json()
(throws an exception):It would be nice to expose the API of the 'intermediate' representation (i.e. at the end of step 1) for the advanced user, in order to store the dataframe in multiple 'Documents'. If storing to a file, this doesn't make any sense - you want to have a single document with many embedded documents; This is the JSON paradigm. But if you're storing into a database, you would want flexibility in order to store each nested doc/dict in a different collection/table. There are reasons for this e.g. a technical reason is the MongoDB maximum record/document size; however, it would generally be driven by a use-case.
What would this intermediate document look like? A dictionary, with only python (and numpy - see below) primitives? What are the primitives that are different for *SON formats (e.g. dates)?
Serialization of vectorized data
A separate issue, that is somewhat at odds with the above, is the storage of vectorized data. i.e. the underlying numpy array in
to_msgpack()
is encoded into a string and stored in binary format. While this makes sense for performance reasons (see #3525 (comment)), it's giving up one of the advantages of using msgpack (or BSON going forward), in my opinion: portability. If you go to http://msgpack.org, you'll see they have APIs for every conceivable language (likewise for MongoDB). Wouldn't it be nice if you could use one of these APIs and load up your pandas data in e.g. Java, Haskell or C? At present, this is possible with the exception of the 'values' fields, which you would have to decode by hand. Sure you could argue to use JSON for portability, but then you're trading off performance.This is at odds with adding compression - which we still want. A compromise could be to use the native *SON list type when
compress=None
, when we assume that speed isn't important, and the current solution (encode as string) when compression is active, and speed is important (and compressed data is not handled by the abovementioned APIs)?Note also, as described in links in #4329 (comment), BSON/MongoDB appears to have some support for vectorised lists that avoids having to convert to a list and then serialize each element separately.
Summary
In adding yet another *SON serializer, it would not make sense to have a third codebase that handles the issues discussed above in a new way:
compress
argument into_msgpack
, to allow portability? Ideally also using compress? could we somehow replicate what monary is doing for BSON in the msgpack serializer, i.e. altogether avoiding conversion to lists and memory copies?Again, I want to stress I'm just trying to create some discussion here rather than opening a specific change req... Sorry if raising an issue wasn't the correct procedure.
The text was updated successfully, but these errors were encountered: