-
-
Notifications
You must be signed in to change notification settings - Fork 18.5k
ENH: namedtuple's fields as columns #11416
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ENH: namedtuple's fields as columns #11416
Conversation
@@ -261,6 +261,8 @@ def __init__(self, data=None, index=None, columns=None, dtype=None, | |||
data = list(data) | |||
if len(data) > 0: | |||
if is_list_like(data[0]) and getattr(data[0], 'ndim', 1) == 1: | |||
if hasattr(data[0], '_fields') and columns is None: # is namedtuple |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's also check for isinstance(data[0], tuple)
. I can imagine lots of other classes with _fields
attributes.
9511a7c
to
1e90f54
Compare
@@ -2672,6 +2672,9 @@ def is_list_like(arg): | |||
return (hasattr(arg, '__iter__') and | |||
not isinstance(arg, compat.string_and_binary_types)) | |||
|
|||
def is_named_tuple(arg): | |||
return isinstance(arg, tuple) and hasattr(arg, '_fields') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
let's import namedtupled
at the top and just check isinstance here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
namedtuple
is not a type, so that wouldn't work, unfortunately
FYI: it's an factory function which builds and eval
s a string to create a class inherited from tuple
. Example code here: https://docs.python.org/2/library/collections.html#collections.namedtuple
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
https://bugs.python.org/issue7796
ok!
It seems a perfect case for "duck typing" style of programming:
All namedtuple classes:
- inherit from tuple
- have a "_fields" class attribute
These two properties could be the "duck test" for namedtuples, regardless of the actual implementation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah cool!
Lots of discussion on those boards about a better way of doing namedtuple
. It's not perfect at the moment, but it's very functional (we use them a lot)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
cool
we might have named tuples elsewhere that could use he is_ function can u give a check
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I had a look - while namedtuple
is in half a dozen places, there's nowhere that checks its type.
@jreback green |
from collections import namedtuple | ||
named_tuple = namedtuple("Pandas", list('ab')) | ||
tuples = [named_tuple(1,3), named_tuple(2,4)] | ||
expected = DataFrame({'a':[1,2], 'b':[3,4]}) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
add a test where you pass columns
as well
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
couple comments....ping when green |
1e90f54
to
b45e3a2
Compare
ok, ping on green.! |
b45e3a2
to
0e1da54
Compare
@jreback green |
ENH: namedtuple's fields as columns
@MaximilianR thanks! |
expected = DataFrame({'a': [1, 2], 'b': [3, 4]}) | ||
result = DataFrame(tuples) | ||
assert_frame_equal(result, expected) | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
acutally I just realized, that if you have DIFFERENT named tuples this code will break (e.g. different fields). Can you do a PR to assert that test? pretty pathological but possible
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, will do now.
To be clear, is the case you're suggesting: DataFrame
receives a list of namedtuple
s, each with the same number of items, but with different _fields
?
The intended outcome here is that it takes _fields
from the first. Is that the outcome you want to test for?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the correct way would be to compare the columns with each of the namedtuples and if they differ then raise a ValueError
. This might be expensive, so what I would do instead is keep track of the type of namedtuple. and instead just compare that. If they differ then easiest to just raise a ValueError
(in theory you could just discard columns
at this point, but I think this is an actual error).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's pretty expensive:
In [1]:
from collections import namedtuple
nt = namedtuple('NT', list('abc'))
tuples = [nt(0,1,2) for i in range(int(1e7))]
In [2]:
t=tuples[0]
correct_type=type(t)
In [3]:
%timeit all(type(tup)==correct_type for tup in tuples)
1 loops, best of 3: 1.16 s per loop
Given this is a 'best efforts' check - i.e. the alternative is 'useless' columns of (0, 1, 2)
- is taking the _fields
from the first item OK? If the user really cares, she can supply columns
, if not, she gets something reasonable...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I bet the == actually is doing a lot of work
use is
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
very odd that this is slow, though I guess its a lot of tuples
ok guess just go with first namedtuple then.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what happens if tuples have different lengths does this break? (eg. the current code in master), I think yes..
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, you get NaN
s, which IIRC is what always happens if you pass lists in with unequal lengths
In [2]:
pd.DataFrame([(1,2,3),(4,5)])
pd.DataFrame([(1,2,3),(4,5)])
Out[2]:
0 1 2
0 1 2 3
1 4 5 NaN
If you pass columns, they need to be the max length:
In [4]:
pd.DataFrame([(1,2),(3,4,5)], columns=['a','b'])
pd.DataFrame([(1,2),(3,4,5)], columns=['a','b'])
---------------------------------------------------------------------------
AssertionError Traceback (most recent call last)
<ipython-input-4-e2f410a01184> in <module>()
----> 1 pd.DataFrame([(1,2),(3,4,5)], columns=['a','b'])
/usr/local/lib/python2.7/dist-packages/pandas/core/frame.pyc in __init__(self, data, index, columns, dtype, copy)
262 if len(data) > 0:
263 if is_list_like(data[0]) and getattr(data[0], 'ndim', 1) == 1:
--> 264 arrays, columns = _to_arrays(data, columns, dtype=dtype)
265 columns = _ensure_index(columns)
266
/usr/local/lib/python2.7/dist-packages/pandas/core/frame.pyc in _to_arrays(data, columns, coerce_float, dtype)
5211 if isinstance(data[0], (list, tuple)):
5212 return _list_to_arrays(data, columns, coerce_float=coerce_float,
-> 5213 dtype=dtype)
5214 elif isinstance(data[0], collections.Mapping):
5215 return _list_of_dict_to_arrays(data, columns,
/usr/local/lib/python2.7/dist-packages/pandas/core/frame.pyc in _list_to_arrays(data, columns, coerce_float, dtype)
5294 content = list(lib.to_object_array(data).T)
5295 return _convert_object_array(content, columns, dtype=dtype,
-> 5296 coerce_float=coerce_float)
5297
5298
/usr/local/lib/python2.7/dist-packages/pandas/core/frame.pyc in _convert_object_array(content, columns, coerce_float, dtype)
5352 # caller's responsibility to check for this...
5353 raise AssertionError('%d columns passed, passed data had %s '
-> 5354 'columns' % (len(columns), len(content)))
5355
5356 # provide soft conversion of object dtypes
AssertionError: 2 columns passed, passed data had 3 columns
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok then
I guess the only issue is if u have differently tuples with different columns
but that's just user error and not worth it to detect
ok then
thanks!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Cheers!
Resolves #11181
Is this testing OK? Or do we need to test with differing lengths of tuples etc?