Skip to content

DataFrame with 'list of dicts' behaviour proposal #526

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
gregglind opened this issue Dec 22, 2011 · 6 comments
Closed

DataFrame with 'list of dicts' behaviour proposal #526

gregglind opened this issue Dec 22, 2011 · 6 comments
Milestone

Comments

@gregglind
Copy link
Contributor

Sketch of proposed behaviour... make 'list of dicts' create a (potentially) 'ragged' array, with autoguessed column names, and sensible default values, when the keys don't exist in all dicts.

Current behaviour:

In [215]: pandas.DataFrame([dict(a=1),dict(a=2)],columns=['a'])
Out[215]:
a
0 {'a': 1}
1 {'a': 2}

(I happen to find this very surprising/useless behaviour!)

(one) Proposed behaviour...

print DataFrame2([dict(a=1,c=1,d=True),dict(b=2,c='abc')])
a c d
0 1 1 True
1 NaN abc NaN

I have a straw implementation at: https://gist.github.com/1511578

(there is a lot to comment on!... should it use the set of keys? Do we need more args? Documentation? Is this just a recipe?)

@wesm
Copy link
Member

wesm commented Dec 22, 2011

Let me guess...you've got lists of JSON objects? =P

this works for example:

In [4]: DataFrame.from_dict(dict(zip(range(2), [dict(a=1,c=1,d=True),dict(b=2,c='abc')])), orient='index')
Out[4]: 
   a    b    c    d  
0  1    NaN  1    1  
1  NaN  2    abc  NaN

but I agree with you that the constructor should be able to figure out a list of dicts without having to type so much. I'll look at your impl and cook up something similar / fast as possible.

@gregglind
Copy link
Contributor Author

My data does mostly come in from JSON, and I have to transform it. I
wrote a mutant jsonpath / jquery sort of of way to 'flatten' out
json/mongo structures into 'table-ish' (row-column) things, which is
already gross enough! My current temptation is to use R (because, for
the life of me, I don't grok numpy indexing / slicing), but pandas
DataFrame feels right :)

Eventually, I want to make my exploratory stuff as simple as possible,
as described in previous rants!

Note: If the default 'use all columns that appear in any' is desired
(which feels 'more right' to me):

        from itertools import chain
        columns = sorted(
            set(chain(*(x.keys() for x in data)))
        )

It's worth thinking about if this is something you want to actually
include. It think this is fixable only patching
pandas/core/frame.py:DataFrame, docs, and tests. That
whole set of code could stand an interface / behavior / documentation
review. Lots of isinstance and other hidden assumptions (like
tending to privilege whatever goes on in row[0]).

Thanks for reviewing the idea! (and sorry that the iget is so
gross! It should have to hide in utils as punishment)

GL

On Thu, Dec 22, 2011 at 2:07 PM, Wes McKinney
[email protected]
wrote:

Let me guess...you've got lists of JSON objects? =P

this works for example:

In [4]: DataFrame.from_dict(dict(zip(range(2), [dict(a=1,c=1,d=True),dict(b=2,c='abc')])), orient='index')
Out[4]:
  a    b    c    d
0  1    NaN  1    1
1  NaN  2    abc  NaN

but I agree with you that the constructor should be able to figure out a list of dicts without having to type so much. I'll look at your impl and cook up something similar / fast as possible.


Reply to this email directly or view it on GitHub:
https://github.com/wesm/pandas/issues/526#issuecomment-3254153

@wesm
Copy link
Member

wesm commented Dec 22, 2011

I think 'use all columns that appear in any' is the right default behavior unless a set of columns is explicitly passed (in which case obviously just use those). This would probably also be a good time to review all the dict-creation routines and set up some vbench action for them too (http://pandas.sourceforge.net/vbench.html). I'm kind of performance obsessed (!) if that hasn't come through yet, so I suspect I can come up with a pretty performant way of processing the data into the right form.

As far as giving privilege to the first element of a list...well, if a user passes a list of differently-typed objects, that is most likely going to blow up. In practice that is pretty rare so I'm willing to live with it.

@gregglind
Copy link
Contributor Author

Let me know if you want design or code review on any of it! I will be
posting my jsonpath-ish stuff soon, which is allied to pandas.

(eventually, I want to write bridge code to use DataFrames in orange as well)

GL

On Thu, Dec 22, 2011 at 2:32 PM, Wes McKinney
[email protected]
wrote:

I think  'use all columns that appear in any' is the right default behavior unless a set of columns is explicitly passed (in which case obviously just use those). This would probably also be a good time to review all the dict-creation routines and set up some vbench action for them too (http://pandas.sourceforge.net/vbench.html). I'm kind of performance obsessed (!) if that hasn't come through yet, so I suspect I can come up with a pretty performant way of processing the data into the right form.

As far as giving privilege to the first element of a list...well, if a user passes a list of differently-typed objects, that is most likely going to blow up. In practice that is pretty rare so I'm willing to live with it.


Reply to this email directly or view it on GitHub:
https://github.com/wesm/pandas/issues/526#issuecomment-3254449

@wesm
Copy link
Member

wesm commented Dec 22, 2011

Cool. I think that would be very valuable (on both fronts). I'd be happy to have json-related tools in pandas, I'm eventually going to need to write up DataFrame with JS data visualization in the browser

@wesm
Copy link
Member

wesm commented Dec 23, 2011

I implemented this in the above commit. I guess you piqued my interest :) btw the implementation (utilizing Cython routines) above is roughly 6x faster than the one in the gist above. The Cython routine I have that implements

from itertools import chain
columns = sorted(
    set(chain(*(x.keys() for x in data)))
)

beats it by about 35%. Though I do love the simple elegance of itertools and generators

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants