Skip to content

ENH: namedtuple's fields as columns #11416

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
Oct 23, 2015

Conversation

max-sixty
Copy link
Contributor

Resolves #11181

Is this testing OK? Or do we need to test with differing lengths of tuples etc?

@@ -261,6 +261,8 @@ def __init__(self, data=None, index=None, columns=None, dtype=None,
data = list(data)
if len(data) > 0:
if is_list_like(data[0]) and getattr(data[0], 'ndim', 1) == 1:
if hasattr(data[0], '_fields') and columns is None: # is namedtuple
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's also check for isinstance(data[0], tuple). I can imagine lots of other classes with _fields attributes.

@max-sixty max-sixty force-pushed the namedtuple-fields-as-columns branch 5 times, most recently from 9511a7c to 1e90f54 Compare October 23, 2015 14:42
@@ -2672,6 +2672,9 @@ def is_list_like(arg):
return (hasattr(arg, '__iter__') and
not isinstance(arg, compat.string_and_binary_types))

def is_named_tuple(arg):
return isinstance(arg, tuple) and hasattr(arg, '_fields')
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let's import namedtupled at the top and just check isinstance here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

namedtuple is not a type, so that wouldn't work, unfortunately

FYI: it's an factory function which builds and evals a string to create a class inherited from tuple. Example code here: https://docs.python.org/2/library/collections.html#collections.namedtuple

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

https://bugs.python.org/issue7796

ok!

It seems a perfect case for "duck typing" style of programming:
All namedtuple classes:
- inherit from tuple
- have a "_fields" class attribute
These two properties could be the "duck test" for namedtuples, regardless of the actual implementation.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah cool!
Lots of discussion on those boards about a better way of doing namedtuple. It's not perfect at the moment, but it's very functional (we use them a lot)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cool

we might have named tuples elsewhere that could use he is_ function can u give a check

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had a look - while namedtuple is in half a dozen places, there's nowhere that checks its type.

@jreback jreback added the Compat pandas objects compatability with Numpy or Python functions label Oct 23, 2015
@jreback jreback added this to the 0.17.1 milestone Oct 23, 2015
@max-sixty
Copy link
Contributor Author

@jreback green

from collections import namedtuple
named_tuple = namedtuple("Pandas", list('ab'))
tuples = [named_tuple(1,3), named_tuple(2,4)]
expected = DataFrame({'a':[1,2], 'b':[3,4]})
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add a test where you pass columns as well

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

@jreback
Copy link
Contributor

jreback commented Oct 23, 2015

couple comments....ping when green

@max-sixty max-sixty force-pushed the namedtuple-fields-as-columns branch from 1e90f54 to b45e3a2 Compare October 23, 2015 18:20
@jreback
Copy link
Contributor

jreback commented Oct 23, 2015

ok, ping on green.!

@max-sixty max-sixty force-pushed the namedtuple-fields-as-columns branch from b45e3a2 to 0e1da54 Compare October 23, 2015 19:35
@max-sixty
Copy link
Contributor Author

@jreback green

jreback added a commit that referenced this pull request Oct 23, 2015
@jreback jreback merged commit 37a80bc into pandas-dev:master Oct 23, 2015
@jreback
Copy link
Contributor

jreback commented Oct 23, 2015

@MaximilianR thanks!

@max-sixty max-sixty deleted the namedtuple-fields-as-columns branch October 23, 2015 20:30
expected = DataFrame({'a': [1, 2], 'b': [3, 4]})
result = DataFrame(tuples)
assert_frame_equal(result, expected)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

acutally I just realized, that if you have DIFFERENT named tuples this code will break (e.g. different fields). Can you do a PR to assert that test? pretty pathological but possible

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, will do now.

To be clear, is the case you're suggesting: DataFrame receives a list of namedtuples, each with the same number of items, but with different _fields?
The intended outcome here is that it takes _fields from the first. Is that the outcome you want to test for?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the correct way would be to compare the columns with each of the namedtuples and if they differ then raise a ValueError. This might be expensive, so what I would do instead is keep track of the type of namedtuple. and instead just compare that. If they differ then easiest to just raise a ValueError (in theory you could just discard columns at this point, but I think this is an actual error).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's pretty expensive:

In [1]:

from collections import namedtuple
nt = namedtuple('NT', list('abc'))
tuples = [nt(0,1,2) for i in range(int(1e7))]
In [2]:

t=tuples[0]
correct_type=type(t)
In [3]:

%timeit all(type(tup)==correct_type for tup in tuples)
1 loops, best of 3: 1.16 s per loop

Given this is a 'best efforts' check - i.e. the alternative is 'useless' columns of (0, 1, 2) - is taking the _fields from the first item OK? If the user really cares, she can supply columns, if not, she gets something reasonable...

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I bet the == actually is doing a lot of work
use is

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

very odd that this is slow, though I guess its a lot of tuples

ok guess just go with first namedtuple then.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what happens if tuples have different lengths does this break? (eg. the current code in master), I think yes..

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, you get NaNs, which IIRC is what always happens if you pass lists in with unequal lengths

In [2]:

pd.DataFrame([(1,2,3),(4,5)])
pd.DataFrame([(1,2,3),(4,5)])
Out[2]:
0   1   2
0   1   2   3
1   4   5   NaN

If you pass columns, they need to be the max length:

In [4]:

pd.DataFrame([(1,2),(3,4,5)], columns=['a','b'])
pd.DataFrame([(1,2),(3,4,5)], columns=['a','b'])
---------------------------------------------------------------------------
AssertionError                            Traceback (most recent call last)
<ipython-input-4-e2f410a01184> in <module>()
----> 1 pd.DataFrame([(1,2),(3,4,5)], columns=['a','b'])

/usr/local/lib/python2.7/dist-packages/pandas/core/frame.pyc in __init__(self, data, index, columns, dtype, copy)
    262             if len(data) > 0:
    263                 if is_list_like(data[0]) and getattr(data[0], 'ndim', 1) == 1:
--> 264                     arrays, columns = _to_arrays(data, columns, dtype=dtype)
    265                     columns = _ensure_index(columns)
    266 

/usr/local/lib/python2.7/dist-packages/pandas/core/frame.pyc in _to_arrays(data, columns, coerce_float, dtype)
   5211     if isinstance(data[0], (list, tuple)):
   5212         return _list_to_arrays(data, columns, coerce_float=coerce_float,
-> 5213                                dtype=dtype)
   5214     elif isinstance(data[0], collections.Mapping):
   5215         return _list_of_dict_to_arrays(data, columns,

/usr/local/lib/python2.7/dist-packages/pandas/core/frame.pyc in _list_to_arrays(data, columns, coerce_float, dtype)
   5294         content = list(lib.to_object_array(data).T)
   5295     return _convert_object_array(content, columns, dtype=dtype,
-> 5296                                  coerce_float=coerce_float)
   5297 
   5298 

/usr/local/lib/python2.7/dist-packages/pandas/core/frame.pyc in _convert_object_array(content, columns, coerce_float, dtype)
   5352             # caller's responsibility to check for this...
   5353             raise AssertionError('%d columns passed, passed data had %s '
-> 5354                                  'columns' % (len(columns), len(content)))
   5355 
   5356     # provide soft conversion of object dtypes

AssertionError: 2 columns passed, passed data had 3 columns

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok then
I guess the only issue is if u have differently tuples with different columns
but that's just user error and not worth it to detect
ok then
thanks!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cheers!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Compat pandas objects compatability with Numpy or Python functions
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants