ENH: namedtuple's fields as columns #11416

max-sixty · 2015-10-23T00:44:21Z

Resolves #11181

Is this testing OK? Or do we need to test with differing lengths of tuples etc?

shoyer · 2015-10-23T02:18:59Z

pandas/core/frame.py

@@ -261,6 +261,8 @@ def __init__(self, data=None, index=None, columns=None, dtype=None,
                data = list(data)
            if len(data) > 0:
                if is_list_like(data[0]) and getattr(data[0], 'ndim', 1) == 1:
+                    if hasattr(data[0], '_fields') and columns is None: # is namedtuple


Let's also check for isinstance(data[0], tuple). I can imagine lots of other classes with _fields attributes.

jreback · 2015-10-23T15:58:19Z

pandas/core/common.py

@@ -2672,6 +2672,9 @@ def is_list_like(arg):
     return (hasattr(arg, '__iter__') and
            not isinstance(arg, compat.string_and_binary_types))

+def is_named_tuple(arg):
+    return isinstance(arg, tuple) and hasattr(arg, '_fields')


let's import namedtupled at the top and just check isinstance here?

namedtuple is not a type, so that wouldn't work, unfortunately

FYI: it's an factory function which builds and evals a string to create a class inherited from tuple. Example code here: https://docs.python.org/2/library/collections.html#collections.namedtuple

https://bugs.python.org/issue7796

ok!

It seems a perfect case for "duck typing" style of programming: All namedtuple classes: - inherit from tuple - have a "_fields" class attribute These two properties could be the "duck test" for namedtuples, regardless of the actual implementation.

Ah cool!
Lots of discussion on those boards about a better way of doing namedtuple. It's not perfect at the moment, but it's very functional (we use them a lot)

cool

we might have named tuples elsewhere that could use he is_ function can u give a check

I had a look - while namedtuple is in half a dozen places, there's nowhere that checks its type.

max-sixty · 2015-10-23T18:04:46Z

@jreback green

jreback · 2015-10-23T18:05:27Z

pandas/tests/test_frame.py

+        from collections import namedtuple
+        named_tuple = namedtuple("Pandas", list('ab'))
+        tuples = [named_tuple(1,3), named_tuple(2,4)]
+        expected = DataFrame({'a':[1,2], 'b':[3,4]})


add a test where you pass columns as well

jreback · 2015-10-23T18:06:03Z

couple comments....ping when green

jreback · 2015-10-23T18:59:36Z

ok, ping on green.!

max-sixty · 2015-10-23T20:28:59Z

@jreback green

ENH: namedtuple's fields as columns

jreback · 2015-10-23T20:30:37Z

@MaximilianR thanks!

jreback · 2015-10-23T20:31:55Z

pandas/tests/test_frame.py

+        expected = DataFrame({'a': [1, 2], 'b': [3, 4]})
+        result = DataFrame(tuples)
+        assert_frame_equal(result, expected)
+


acutally I just realized, that if you have DIFFERENT named tuples this code will break (e.g. different fields). Can you do a PR to assert that test? pretty pathological but possible

Yes, will do now.

To be clear, is the case you're suggesting: DataFrame receives a list of namedtuples, each with the same number of items, but with different _fields?
The intended outcome here is that it takes _fields from the first. Is that the outcome you want to test for?

I think the correct way would be to compare the columns with each of the namedtuples and if they differ then raise a ValueError. This might be expensive, so what I would do instead is keep track of the type of namedtuple. and instead just compare that. If they differ then easiest to just raise a ValueError (in theory you could just discard columns at this point, but I think this is an actual error).

That's pretty expensive:

In [1]: from collections import namedtuple nt = namedtuple('NT', list('abc')) tuples = [nt(0,1,2) for i in range(int(1e7))] In [2]: t=tuples[0] correct_type=type(t) In [3]: %timeit all(type(tup)==correct_type for tup in tuples) 1 loops, best of 3: 1.16 s per loop

Given this is a 'best efforts' check - i.e. the alternative is 'useless' columns of (0, 1, 2) - is taking the _fields from the first item OK? If the user really cares, she can supply columns, if not, she gets something reasonable...

I bet the == actually is doing a lot of work
use is

very odd that this is slow, though I guess its a lot of tuples

ok guess just go with first namedtuple then.

what happens if tuples have different lengths does this break? (eg. the current code in master), I think yes..

No, you get NaNs, which IIRC is what always happens if you pass lists in with unequal lengths

In [2]: pd.DataFrame([(1,2,3),(4,5)]) pd.DataFrame([(1,2,3),(4,5)]) Out[2]: 0 1 2 0 1 2 3 1 4 5 NaN

If you pass columns, they need to be the max length:

In [4]: pd.DataFrame([(1,2),(3,4,5)], columns=['a','b']) pd.DataFrame([(1,2),(3,4,5)], columns=['a','b']) --------------------------------------------------------------------------- AssertionError Traceback (most recent call last) <ipython-input-4-e2f410a01184> in <module>() ----> 1 pd.DataFrame([(1,2),(3,4,5)], columns=['a','b']) /usr/local/lib/python2.7/dist-packages/pandas/core/frame.pyc in __init__(self, data, index, columns, dtype, copy) 262 if len(data) > 0: 263 if is_list_like(data[0]) and getattr(data[0], 'ndim', 1) == 1: --> 264 arrays, columns = _to_arrays(data, columns, dtype=dtype) 265 columns = _ensure_index(columns) 266 /usr/local/lib/python2.7/dist-packages/pandas/core/frame.pyc in _to_arrays(data, columns, coerce_float, dtype) 5211 if isinstance(data[0], (list, tuple)): 5212 return _list_to_arrays(data, columns, coerce_float=coerce_float, -> 5213 dtype=dtype) 5214 elif isinstance(data[0], collections.Mapping): 5215 return _list_of_dict_to_arrays(data, columns, /usr/local/lib/python2.7/dist-packages/pandas/core/frame.pyc in _list_to_arrays(data, columns, coerce_float, dtype) 5294 content = list(lib.to_object_array(data).T) 5295 return _convert_object_array(content, columns, dtype=dtype, -> 5296 coerce_float=coerce_float) 5297 5298 /usr/local/lib/python2.7/dist-packages/pandas/core/frame.pyc in _convert_object_array(content, columns, coerce_float, dtype) 5352 # caller's responsibility to check for this... 5353 raise AssertionError('%d columns passed, passed data had %s ' -> 5354 'columns' % (len(columns), len(content))) 5355 5356 # provide soft conversion of object dtypes AssertionError: 2 columns passed, passed data had 3 columns

ok then
I guess the only issue is if u have differently tuples with different columns
but that's just user error and not worth it to detect
ok then
thanks!

shoyer reviewed Oct 23, 2015
View reviewed changes

max-sixty force-pushed the namedtuple-fields-as-columns branch 5 times, most recently from 9511a7c to 1e90f54 Compare October 23, 2015 14:42

jreback reviewed Oct 23, 2015
View reviewed changes

jreback added the Compat pandas objects compatability with Numpy or Python functions label Oct 23, 2015

jreback added this to the 0.17.1 milestone Oct 23, 2015

jreback reviewed Oct 23, 2015
View reviewed changes

max-sixty force-pushed the namedtuple-fields-as-columns branch from 1e90f54 to b45e3a2 Compare October 23, 2015 18:20

use a namedtuple's fields as column names in df constructor

0e1da54

max-sixty force-pushed the namedtuple-fields-as-columns branch from b45e3a2 to 0e1da54 Compare October 23, 2015 19:35

jreback added a commit that referenced this pull request Oct 23, 2015

Merge pull request #11416 from SixtyCapital/namedtuple-fields-as-columns

37a80bc

ENH: namedtuple's fields as columns

jreback merged commit 37a80bc into pandas-dev:master Oct 23, 2015

max-sixty deleted the namedtuple-fields-as-columns branch October 23, 2015 20:30

jreback reviewed Oct 23, 2015
View reviewed changes

This was referenced Jul 9, 2019

ENH: Preserve key order when passing list of dicts to DataFrame on py 3.6+ #27309

Merged

BUG?: namedtuples fields not checked on DataFrame constructor #27329

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ENH: namedtuple's fields as columns #11416

ENH: namedtuple's fields as columns #11416

max-sixty commented Oct 23, 2015

shoyer Oct 23, 2015

jreback Oct 23, 2015

max-sixty Oct 23, 2015

jreback Oct 23, 2015

max-sixty Oct 23, 2015

jreback Oct 23, 2015

max-sixty Oct 23, 2015

max-sixty commented Oct 23, 2015

jreback Oct 23, 2015

max-sixty Oct 23, 2015

jreback commented Oct 23, 2015

jreback commented Oct 23, 2015

max-sixty commented Oct 23, 2015

jreback commented Oct 23, 2015

jreback Oct 23, 2015

max-sixty Oct 23, 2015

jreback Oct 23, 2015

max-sixty Oct 23, 2015

jreback Oct 23, 2015

jreback Oct 23, 2015

jreback Oct 23, 2015

max-sixty Oct 24, 2015

jreback Oct 24, 2015

max-sixty Oct 24, 2015

ENH: namedtuple's fields as columns #11416

ENH: namedtuple's fields as columns #11416

Conversation

max-sixty commented Oct 23, 2015

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

max-sixty commented Oct 23, 2015

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jreback commented Oct 23, 2015

jreback commented Oct 23, 2015

max-sixty commented Oct 23, 2015

jreback commented Oct 23, 2015

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment