DataFrame with 'list of dicts' behaviour proposal #526

gregglind · 2011-12-22T19:48:24Z

Sketch of proposed behaviour... make 'list of dicts' create a (potentially) 'ragged' array, with autoguessed column names, and sensible default values, when the keys don't exist in all dicts.

Current behaviour:

In [215]: pandas.DataFrame([dict(a=1),dict(a=2)],columns=['a'])
Out[215]:
a
0 {'a': 1}
1 {'a': 2}

(I happen to find this very surprising/useless behaviour!)

(one) Proposed behaviour...

print DataFrame2([dict(a=1,c=1,d=True),dict(b=2,c='abc')])
a c d
0 1 1 True
1 NaN abc NaN

I have a straw implementation at: https://gist.github.com/1511578

(there is a lot to comment on!... should it use the set of keys? Do we need more args? Documentation? Is this just a recipe?)

wesm · 2011-12-22T20:07:17Z

Let me guess...you've got lists of JSON objects? =P

this works for example:

In [4]: DataFrame.from_dict(dict(zip(range(2), [dict(a=1,c=1,d=True),dict(b=2,c='abc')])), orient='index')
Out[4]: 
   a    b    c    d  
0  1    NaN  1    1  
1  NaN  2    abc  NaN

but I agree with you that the constructor should be able to figure out a list of dicts without having to type so much. I'll look at your impl and cook up something similar / fast as possible.

gregglind · 2011-12-22T20:19:42Z

My data does mostly come in from JSON, and I have to transform it. I
wrote a mutant jsonpath / jquery sort of of way to 'flatten' out
json/mongo structures into 'table-ish' (row-column) things, which is
already gross enough! My current temptation is to use R (because, for
the life of me, I don't grok numpy indexing / slicing), but pandas
DataFrame feels right :)

Eventually, I want to make my exploratory stuff as simple as possible,
as described in previous rants!

Note: If the default 'use all columns that appear in any' is desired
(which feels 'more right' to me):

        from itertools import chain
        columns = sorted(
            set(chain(*(x.keys() for x in data)))
        )

It's worth thinking about if this is something you want to actually
include. It think this is fixable only patching
pandas/core/frame.py:DataFrame, docs, and tests. That
whole set of code could stand an interface / behavior / documentation
review. Lots of isinstance and other hidden assumptions (like
tending to privilege whatever goes on in row[0]).

Thanks for reviewing the idea! (and sorry that the iget is so
gross! It should have to hide in utils as punishment)

GL

On Thu, Dec 22, 2011 at 2:07 PM, Wes McKinney
[email protected]
wrote:

Let me guess...you've got lists of JSON objects? =P

this works for example:
In [4]: DataFrame.from_dict(dict(zip(range(2), [dict(a=1,c=1,d=True),dict(b=2,c='abc')])), orient='index')
Out[4]:
  a    b    c    d
0  1    NaN  1    1
1  NaN  2    abc  NaN
but I agree with you that the constructor should be able to figure out a list of dicts without having to type so much. I'll look at your impl and cook up something similar / fast as possible.

Reply to this email directly or view it on GitHub:
https://github.com/wesm/pandas/issues/526#issuecomment-3254153

wesm · 2011-12-22T20:32:41Z

I think 'use all columns that appear in any' is the right default behavior unless a set of columns is explicitly passed (in which case obviously just use those). This would probably also be a good time to review all the dict-creation routines and set up some vbench action for them too (http://pandas.sourceforge.net/vbench.html). I'm kind of performance obsessed (!) if that hasn't come through yet, so I suspect I can come up with a pretty performant way of processing the data into the right form.

As far as giving privilege to the first element of a list...well, if a user passes a list of differently-typed objects, that is most likely going to blow up. In practice that is pretty rare so I'm willing to live with it.

gregglind · 2011-12-22T20:39:17Z

Let me know if you want design or code review on any of it! I will be
posting my jsonpath-ish stuff soon, which is allied to pandas.

(eventually, I want to write bridge code to use DataFrames in orange as well)

GL

On Thu, Dec 22, 2011 at 2:32 PM, Wes McKinney
[email protected]
wrote:

I think 'use all columns that appear in any' is the right default behavior unless a set of columns is explicitly passed (in which case obviously just use those). This would probably also be a good time to review all the dict-creation routines and set up some vbench action for them too (http://pandas.sourceforge.net/vbench.html). I'm kind of performance obsessed (!) if that hasn't come through yet, so I suspect I can come up with a pretty performant way of processing the data into the right form.

As far as giving privilege to the first element of a list...well, if a user passes a list of differently-typed objects, that is most likely going to blow up. In practice that is pretty rare so I'm willing to live with it.

Reply to this email directly or view it on GitHub:
https://github.com/wesm/pandas/issues/526#issuecomment-3254449

wesm · 2011-12-22T20:59:25Z

Cool. I think that would be very valuable (on both fronts). I'd be happy to have json-related tools in pandas, I'm eventually going to need to write up DataFrame with JS data visualization in the browser

…code, #526

wesm · 2011-12-23T00:41:17Z

I implemented this in the above commit. I guess you piqued my interest :) btw the implementation (utilizing Cython routines) above is roughly 6x faster than the one in the gist above. The Cython routine I have that implements

from itertools import chain
columns = sorted(
    set(chain(*(x.keys() for x in data)))
)

beats it by about 35%. Though I do love the simple elegance of itertools and generators

wesm added a commit that referenced this issue Dec 23, 2011

ENH: can pass list of dicts to DataFrame constructor, support Cython …

5a38dca

…code, #526

wesm closed this as completed Dec 23, 2011

wesm mentioned this issue Dec 23, 2011

fast_unique_multiple that consumes generator, smaller memory footprint #530

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

DataFrame with 'list of dicts' behaviour proposal #526

DataFrame with 'list of dicts' behaviour proposal #526

gregglind commented Dec 22, 2011

wesm commented Dec 22, 2011

Uh oh!

gregglind commented Dec 22, 2011

Uh oh!

wesm commented Dec 22, 2011

Uh oh!

gregglind commented Dec 22, 2011

Uh oh!

wesm commented Dec 22, 2011

Uh oh!

wesm commented Dec 23, 2011

Uh oh!

Uh oh!

DataFrame with 'list of dicts' behaviour proposal #526

DataFrame with 'list of dicts' behaviour proposal #526

Comments

gregglind commented Dec 22, 2011

wesm commented Dec 22, 2011

Uh oh!

gregglind commented Dec 22, 2011

Uh oh!

wesm commented Dec 22, 2011

Uh oh!

gregglind commented Dec 22, 2011

Uh oh!

wesm commented Dec 22, 2011

Uh oh!

wesm commented Dec 23, 2011

Uh oh!