Skip to content

Add JSON export option for DataFrame #631

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wesm opened this issue Jan 14, 2012 · 22 comments
Closed

Add JSON export option for DataFrame #631

wesm opened this issue Jan 14, 2012 · 22 comments
Milestone

Comments

@wesm
Copy link
Member

wesm commented Jan 14, 2012

No description provided.

@aman-thakral
Copy link
Contributor

I actually need to do this for a current project I'm working on. I'll get started on tackling this if it is open issue. I will probably be using the gviz api as reference (http://code.google.com/p/google-visualization-python/).

@wesm
Copy link
Member Author

wesm commented Jan 18, 2012

By all means go right ahead. @mikedewar may also be interested for his project https://github.com/mikedewar/D3py

@mikedewar
Copy link

Would be happy to see this exist! In fact I made a gist a while ago to do it:

https://gist.github.com/1486027

Please feel free to use as a starting point! Probably could do with a bit more consideration in terms of multiple levels of keys and other stuff about data frames that I don't know about yet.

@wesm
Copy link
Member Author

wesm commented Jan 18, 2012

Now if we want to be truly hardcore (and why wouldn't we be?) we should fork UltraJSON and make it DataFrame-specific to get the best performance

@aman-thakral
Copy link
Contributor

An interesting idea. I'll have to examine the code, although my experience with C is somewhat limited. I may need to do some serious review, but it will be excellent practice nonetheless. Also, I had a look at the google-visualization-python api and I like the use of a "table description" that you can pass it to define the desired structure of the json string. This provides a great deal of flexibility that would be really useful, and would make using the string in something like google charts really easy.

@Komnomnomnom
Copy link
Contributor

Hi all,

I've done some preliminary work in this direction. In my fork of usjon I've added some basic support for numpy. Right now it just handles some of the basic numpy scalars and 1D arrays. The implementation isn't perfect (I'm a bit concerned with casting everything) but it seems to work ok. The goal is to eventually add support for numpy N-dimensional arrays (possibly with a max limit on N) and pandas data types, specifically Series and DataFrame.

It's my first time dealing with the Python and Numpy C-APIs so any comments are welcome!

Komnomnomnom/ultrajson@511ec03

@Komnomnomnom
Copy link
Contributor

Encoding support for DataFrame, Series and Index is now committed, as well as proper support for encoding numpy arrays. Still not sure how to properly handle decoding, right now I'm just passing the decoded dict / list to the relevant data-type's constructor.

I decided to encode the DataFrame index and column labels separately (it suits my purposes and I think it's more efficient to work on the underlying numpy arrays). So you end up with something like:

>>> df = DataFrame([[1,2,3], [4,5,6]], columns=['x', 'y', 'z'], index=['a', 'b'])
>>> ujson.encode(df)
'{"columns":["x","y","z"],"index":["a","b"],"data":[[1,2,3],[4,5,6]]}'

@wesm
Copy link
Member Author

wesm commented Mar 28, 2012

I think what's needed for @mikedewar's needs and others would be:

'[{"a":1,"b":2,"c":3}, {"a":4,"b":5,"c":6}]'

when you deserialize that and pass it to DataFrame, you get back the same DataFrame:

In [3]: DataFrame(json.loads('[{"a":1,"b":2,"c":3}, {"a":4,"b":5,"c":6}]'))
Out[3]: 
   a  b  c
0  1  2  3
1  4  5  6

However, this doesn't give you the row index, but that's not a big deal for the particular use case (feeding a DataFrame into d3 or something else)

@Komnomnomnom
Copy link
Contributor

Ok, I was initially going to match the output of the to_dict() method but preferred the output above for my purposes. Note you can still recreate the DataFrame using:

>>> DataFrame(**ujson.loads('{"columns":["x","y","z"],"index":["a","b"],"data":[[1,2,3],[4,5,6]]}'))
   x  y  z
a  1  2  3
b  4  5  6

That said I don't think it would be too difficult to add an option to produce output like you mentioned. How about a labelled option where the output would be identical to the to_dict() method. e.g.

>>> df = DataFrame([[1,2,3], [4,5,6]], columns=['x', 'y', 'z'], index=['a', 'b'])
>>> ujson.dumps(df, labelled=True)
'{"x":{"a":1,"b":4},"y":{"a":2,"b":5},"z":{"a":3,"b":6}}'

Or is it absolutely necessary to suppress the index labels?

@wesm
Copy link
Member Author

wesm commented Mar 28, 2012

I'm thinking it might be preferable to ship the relevant ultrajson code in pandas and use it to implement Series.to_json and DataFrame.to_json. But having multiple output options makes sense, including the "records format" where the index is ignored, or could be put in each JSON object in the list

@Komnomnomnom
Copy link
Contributor

Agreed, it would make sense for it to be included in pandas.

I think all the ujson code is required (as it will still have to deal with basic types), albeit tailored for numpy and pandas types. I can fork and attempt to introduce it into pandas if you point me in the right direction. Ujson is composed of several different c files, I'm not sure where to put them and how to include them in the build process.

@wesm
Copy link
Member Author

wesm commented Mar 28, 2012

You would want to put it in a subdirectory of pandas/src and co-opt the extension configuration from the UltraJSON setup.py file

@Komnomnomnom
Copy link
Contributor

I've finally got around to revisiting this. I've added support to my fork of ujson for different output formats when encoding pandas data types:

In [4]: df = DataFrame([[1,2,3], [4,5,6]], index=['a', 'b'], columns=['x', 'y', 'z']) 

In [5]: ujson.encode(df, format="headers")
Out[5]: '{"columns":["x","y","z"],"index":["a","b"],"data":[[1,2,3],[4,5,6]]}'

In [6]: ujson.encode(df, format="records")
Out[6]: '[{"x":1,"y":2,"z":3},{"x":4,"y":5,"z":6}]'

In [7]: ujson.encode(df, format="indexed")
Out[7]: '{"a":{"x":1,"y":2,"z":3},"b":{"x":4,"y":5,"z":6}}'

In [8]: ujson.encode(df, format="column_indexed")
Out[8]: '{"x":{"a":1,"b":4},"y":{"a":2,"b":5},"z":{"a":3,"b":6}}'

If format isn't specified encoding defaults to the column_indexed format as it matches the output of to_dict() and it can be given straight to the DataFrame constructor. All of the encoding / iteration is performed in ujson in C.

I've added similar support for Series and Index (although some of the formats don't suit them it tries to handle them sensibly)

In [9]: s = Series([10, 20, 30, 40, 50, 60], name="myseries", index=[6,7,8,9,10,15])

In [10]: ujson.encode(s, format="headers")
Out[10]: '{"name":"myseries","index":[6,7,8,9,10,15],"data":[10,20,30,40,50,60]}'

In [11]: ujson.encode(s, format="records")
Out[11]: '[10,20,30,40,50,60]'

In [12]: ujson.encode(s, format="indexed")
Out[12]: '{"6":10,"7":20,"8":30,"9":40,"10":50,"15":60}'

In [13]: ujson.encode(s, format="column_indexed")
Out[13]: '{"6":10,"7":20,"8":30,"9":40,"10":50,"15":60}'

In [14]: i = Index([23, 45, 18, 98, 43, 11], name="myindex")

In [15]: ujson.encode(i, format="headers")
Out[15]: '{"name":"myindex","data":[23,45,18,98,43,11]}'

In [16]: ujson.encode(i, format="records")
Out[16]: '[23,45,18,98,43,11]'

In [17]: ujson.encode(i, format="indexed")
Out[17]: '[23,45,18,98,43,11]'

In [18]: ujson.encode(i, format="column_indexed")
Out[18]: '[23,45,18,98,43,11]'

My next step is to integrate this into pandas but I'd welcome any comments. Are there values for the format argument that would fit better with existing pandas code?

@wesm
Copy link
Member Author

wesm commented May 3, 2012

Hm, I'll think about the API. What you propose looks pretty good and you could just go for that for now, adding a to_json method to Series and DataFrame. It think would make sense to ship a pared down version of ujson in pandas (and have lots of tests, of course). Could put the source code in pandas.io or somewhere like that.

@Komnomnomnom
Copy link
Contributor

ujson is pure C, no python file except for setup.py and some test classes. I think all of it is required though (apart from its test code and metafiles) so it can properly handle whatever type happens to be in the DataFrame etc.

@wesm
Copy link
Member Author

wesm commented May 3, 2012

Right, so you would just need to set it up to build as a submodule inside pandas and wire it up with the new object instance methods, and write appropriate tests. If you do some of the heavy lifting to set this up and make a pull request I can integrate and round things out in a few weeks

@Komnomnomnom
Copy link
Contributor

Hi Wes,

I've improved the performance a bit and made some other tweaks and improvements, most notably I've added support for direct decoding to numpy arrays which gets rid of the list to numpy array conversion step.

I've updated the README on my fork with more information and some simple benchmarks, https://github.com/Komnomnomnom/ultrajson. Although there were a couple of surprises I'm pretty happy with the overall performance.

Integrating with pandas and the pandas build was a lot more straightforward than I expected. I should send through a pull request later on today (I'll attach it to this issue if I can).

Oh and I've changed the format argument to 'orient', seems to fit better with other DataFrame methods and format clashes with a Python built-in. I also added the 'values' format which only encodes the DataFrame values array, ignoring column and index labels.

@wesm
Copy link
Member Author

wesm commented May 25, 2012

Addressed by #1263, #1309

@wesm wesm closed this as completed May 25, 2012
@PhE
Copy link

PhE commented Aug 16, 2012

All issues related to DataFrame.to_json() seems closed, but on version 0.8.1 there is not DataFrame.to_json() method.
Is this feature released ?

@changhiskhan
Copy link
Contributor

It's not part of pandas for now due to issues with MinGW. It's in a
separate project for now and we will revisit this issue when we can.
Thanks.
On Aug 16, 2012 10:18 AM, "Philippe Entzmann" [email protected]
wrote:

All issues related to DataFrame.to_json() seems closed, but on version
0.8.1 there is not DataFrame.to_json() method.
Is this feature released ?


Reply to this email directly or view it on GitHubhttps://github.com//issues/631#issuecomment-7786674.

@PhE
Copy link

PhE commented Aug 16, 2012

MinGW issues are Windows related, I suppose.
I'm on Linux, can you point me to the project/branch ? (I am not a git/github master)
Thanks.

@wesm
Copy link
Member Author

wesm commented Aug 18, 2012

it's pydata/pandasjson

dan-nadler pushed a commit to dan-nadler/pandas that referenced this issue Sep 23, 2019
* Update numpy_arrays.py

* test for fix

* update changelog

* Remove circle build status since arctic still doesnt support circle 2.0
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

6 participants