Skip to content

CLN: Refactor *SON RFC #9166

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
lJoublanc opened this issue Dec 29, 2014 · 7 comments
Closed

CLN: Refactor *SON RFC #9166

lJoublanc opened this issue Dec 29, 2014 · 7 comments
Labels
API Design Compat pandas objects compatability with Numpy or Python functions IO Data IO issues that don't fit into a more specific label IO JSON read_json, to_json, json_normalize Performance Memory or execution speed performance

Comments

@lJoublanc
Copy link

n.b. I'm looking at how to store a DataFrame in BSON format (see #4329) in MongoDB. This is my attempt to gather all the relevant information before adding yet another *SON serializer. Please don't treat this as a change request - I'm just trying to get my head around all the relevant bits. Maybe skip to the summary and start from there ...

Refactoring *SON to use common schema

There are a number of *SON formats which are similar: e.g. JSON, Msgpack (in pandas), BSON (not in pandas) recently announced JBSON. They have some common components:

  1. Conversion of python object graph into Document(s) (think dict) made up only of JSON/Msgpack/BSON primitives. You can think of this in terms of a schema, as discussed in ENH: support for msgpack serialization/deserialization #3525, WIP: Support metadata at de/serialization time #3297, Allow custom metadata to be attached to panel/df/series? #2485. In an ideal world, this schema would be shared by all formats.
  2. Serialization into a binary format (Msgpack/BSON only) or text (JSON). This is generally done by an external library separately for each format.

Currently, I don't think any of the code between JSON and Msgpack is shared. The schema used by Msgpack is entirely different from JSON. I'm guessing this is by design; JSON has been kept as human-readable as possible, whereas Msgpack is more focused on a serialization format. To illustrate this, two examples. First a simple Dataframe:

                          bid     offer
2014-12-15 10:07:00  0.562804  0.398798
2014-12-15 10:07:01  0.399399  0.896809
2014-12-15 10:07:02  0.747191  0.098048

produces the following document before serialization in msgpack:

{'axes': [{'data': ['bid', 'offer'],
   'dtype': 17,
   'klass': 'Index',
   'name': None,
   'typ': 'index'},
  {'data': '...',
   'dtype': 21,
   'freq': 'S',
   'klass': 'DatetimeIndex',
   'name': None,
   'typ': 'datetime_index',
   'tz': None}],
 'blocks': [{'compress': 'blosc',
   'dtype': 12,
   'items': {'data': ['bid', 'offer'],
    'dtype': 17,
    'klass': 'Index',
    'name': None,
    'typ': 'index'},
   'klass': 'FloatBlock',
   'shape': (2, 3),
   'values': '...'}],
 'klass': 'DataFrame',
 'typ': 'block_manager'}

I produced this by calling encode recursively in packing.py, replicating what pack does. There have been plenty of discussions regarding storage of metadata in #3525, #3297. Now the result of calling to_json():

'{"bid":{"1418638020000":0.5628044127,"1418638021000":0.3993987818,"1418638022000":0.7471914537},"offer":{"1418638020000":0.398797779,"1418638021000":0.8968090851,"1418638022000":0.0980482752}}'

Both appear to use blocking (see #9130 ). Also, as described in #3525, dates are stored as integers (in strings) for performance reasons.

The second example, which has a multi-index:

                       bid                                ask                \
                      95.0   95.5   96.0   96.5   97.0   97.5   98.0   98.5   
2014-12-15 10:07:00  25030  16800  42580  58560  75110  10400  89240   3990   
2014-12-15 10:07:01  42620  57860  80010  81500  98880  99610  65770  83930   
2014-12-15 10:07:02  94170  98040  74840  41690  48960  76510  88530  48770   
2014-12-15 10:07:03  65090  23700  16390  45700    500  29290  32370  68350  

This kind of works for to_msgpack() but not for to_json() (throws an exception):

{'axes': [{'data': [('bid', 95.0),
    ('bid', 95.5),
    ('bid', 96.0),
    ...
    ('ask', 99.5)],
   'dtype': 17,
   'klass': 'MultiIndex',
   'names': FrozenList([None, None]),
   'typ': 'multi_index'},
  {'data': '...',
   'dtype': 21,
   'freq': 'S',
   'klass': 'DatetimeIndex',
   'name': None,
   'typ': 'datetime_index',
   'tz': None}],
 'blocks': [{'compress': 'blosc',
   'dtype': 12,
   'items': {'data': [('bid', 95.0),
     ('bid', 95.5),
     ('bid', 96.0), 
     ...
     ('ask', 99.5)],
    'dtype': 17,
    'klass': 'MultiIndex',
    'names': FrozenList([None, None]),
    'typ': 'multi_index'},
   'klass': 'FloatBlock',
   'shape': (10, 4),
   'values': '...'}],
 'klass': 'DataFrame',
 'typ': 'block_manager'}

It would be nice to expose the API of the 'intermediate' representation (i.e. at the end of step 1) for the advanced user, in order to store the dataframe in multiple 'Documents'. If storing to a file, this doesn't make any sense - you want to have a single document with many embedded documents; This is the JSON paradigm. But if you're storing into a database, you would want flexibility in order to store each nested doc/dict in a different collection/table. There are reasons for this e.g. a technical reason is the MongoDB maximum record/document size; however, it would generally be driven by a use-case.

What would this intermediate document look like? A dictionary, with only python (and numpy - see below) primitives? What are the primitives that are different for *SON formats (e.g. dates)?

Serialization of vectorized data

A separate issue, that is somewhat at odds with the above, is the storage of vectorized data. i.e. the underlying numpy array in to_msgpack() is encoded into a string and stored in binary format. While this makes sense for performance reasons (see #3525 (comment)), it's giving up one of the advantages of using msgpack (or BSON going forward), in my opinion: portability. If you go to http://msgpack.org, you'll see they have APIs for every conceivable language (likewise for MongoDB). Wouldn't it be nice if you could use one of these APIs and load up your pandas data in e.g. Java, Haskell or C? At present, this is possible with the exception of the 'values' fields, which you would have to decode by hand. Sure you could argue to use JSON for portability, but then you're trading off performance.

This is at odds with adding compression - which we still want. A compromise could be to use the native *SON list type when compress=None, when we assume that speed isn't important, and the current solution (encode as string) when compression is active, and speed is important (and compressed data is not handled by the abovementioned APIs)?
Note also, as described in links in #4329 (comment), BSON/MongoDB appears to have some support for vectorised lists that avoids having to convert to a list and then serialize each element separately.

Summary

In adding yet another *SON serializer, it would not make sense to have a third codebase that handles the issues discussed above in a new way:

  • Would it make sense to combine existing msgpack/JSON 'packing' code, to have a standard intermediate 'schema', shared by all *SON? cons: breaks backward compatibility, makes JSON document much more complex, pros: support more complex pandas structures (DF, panel, sparse df etc.).
  • Could code for storing vector data be shared? Could we allow native storage of vector data (e.g. supported by BSON), for example, when not using the compress argument in to_msgpack, to allow portability? Ideally also using compress? could we somehow replicate what monary is doing for BSON in the msgpack serializer, i.e. altogether avoiding conversion to lists and memory copies?

Again, I want to stress I'm just trying to create some discussion here rather than opening a specific change req... Sorry if raising an issue wasn't the correct procedure.

@jreback
Copy link
Contributor

jreback commented Dec 29, 2014

I think adding a serializee using monary would be fine.

However trying to combine serializers is not s good idea in general. A They see all c/cython based and strive for the best perf possible. No reason to expose the intermediate form at all.

@jreback
Copy link
Contributor

jreback commented Dec 29, 2014

see also this: #9146

@lJoublanc
Copy link
Author

The part of the serializer I want to combine is the 'packing' part so that JSON/BSON/msgpack share the same 'schema'. True, the JSON seraliser code is native, but the msgpack code, for example, is still in python (essentially encode in pandas.io.packer).

@jreback
Copy link
Contributor

jreback commented Dec 29, 2014

@lJoublanc In theory this might be nice. But from a practical perspective. JSON is an already established format that is not completely roudtripable (though see the 'schema' in #9146). While msgpack is just a format I made up.

What is the purpose here?

@lJoublanc
Copy link
Author

The idea is to share code generation for the schema; i.e. the bit that transforms a python object graph into a document/dict, ready for bin/text serialization. That's point 1 in my monograph :/ In the second example I posted above, with MultiIndex, to_json() fails (don't think this is a bug - I just think the json converter doesn't handle this), but to_msgpack() works. Msgpack handles a lot of other type conversions (DatetimeIndex, date types etc.). so why not leverage that to allow to_json (and other json-like serializers) to handle other pandas primitives?

@jreback
Copy link
Contributor

jreback commented Dec 29, 2014

@lJoublanc

The transformation part is a performance issue, e.g. its the reason why these are implemented in c/cython

So I would be in favor of doing the following:

  • adopt a 'pandas' schema (ala Roundtrip orient for JSON #9146, may need some tweaks, but I think is capable of fully round-tripping all pandas objects
  • implement the schema for JSON (prob most/all will have to be in c, but shouldn't be too difficult). This needs to preserve back-compat so the parser needs to remain instact.
  • implement the same schema in msgpack (prob easy to have back-compat here). The current format is 'ad-hoc' and is not that general, but close. This should be straightforward.
  • implement the same schema in a new to_bson (possibly using monary or msgpack directly if possible).

This would unify the JSON-like schemas and make them round-trippable as well as making DataFrames (and other pandas objects) easily describable to the world in general.

back-compat is pretty important, so anything that is done should be able to preserve this.

I don't think these are possible in a combined manner in an easy way. You will sacrifice performance if you implement the schema in python. So I think that route is too pie-in-the-sky.

cc @Komnomnomnom

@shoyer @jorisvandenbossche @cpcloud

@jreback jreback added API Design Compat pandas objects compatability with Numpy or Python functions IO Data IO issues that don't fit into a more specific label Msgpack IO JSON read_json, to_json, json_normalize Performance Memory or execution speed performance labels Dec 29, 2014
@jreback jreback added this to the Someday milestone Dec 29, 2014
@jbrockmendel
Copy link
Member

Closing and adding to a tracker issue #30407 for IO format requests, can re-open if interest is expressed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
API Design Compat pandas objects compatability with Numpy or Python functions IO Data IO issues that don't fit into a more specific label IO JSON read_json, to_json, json_normalize Performance Memory or execution speed performance
Projects
None yet
Development

No branches or pull requests

4 participants