Skip to content

ENH: namedtuple's fields as columns #11416

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
Oct 23, 2015
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions doc/source/whatsnew/v0.17.1.txt
Original file line number Diff line number Diff line change
Expand Up @@ -30,6 +30,7 @@ Other Enhancements

- ``pd.read_*`` functions can now also accept :class:`python:pathlib.Path`, or :class:`py:py._path.local.LocalPath`
objects for the ``filepath_or_buffer`` argument. (:issue:`11033`)
- ``DataFrame`` now uses the fields of a ``namedtuple`` as columns, if columns are not supplied (:issue:`11181`)
- Improve the error message displayed in :func:`pandas.io.gbq.to_gbq` when the DataFrame does not match the schema of the destination table (:issue:`11359`)

.. _whatsnew_0171.api:
Expand Down
3 changes: 3 additions & 0 deletions pandas/core/common.py
Original file line number Diff line number Diff line change
Expand Up @@ -2676,6 +2676,9 @@ def is_list_like(arg):
return (hasattr(arg, '__iter__') and
not isinstance(arg, compat.string_and_binary_types))

def is_named_tuple(arg):
return isinstance(arg, tuple) and hasattr(arg, '_fields')
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let's import namedtupled at the top and just check isinstance here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

namedtuple is not a type, so that wouldn't work, unfortunately

FYI: it's an factory function which builds and evals a string to create a class inherited from tuple. Example code here: https://docs.python.org/2/library/collections.html#collections.namedtuple

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

https://bugs.python.org/issue7796

ok!

It seems a perfect case for "duck typing" style of programming:
All namedtuple classes:
- inherit from tuple
- have a "_fields" class attribute
These two properties could be the "duck test" for namedtuples, regardless of the actual implementation.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah cool!
Lots of discussion on those boards about a better way of doing namedtuple. It's not perfect at the moment, but it's very functional (we use them a lot)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cool

we might have named tuples elsewhere that could use he is_ function can u give a check

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had a look - while namedtuple is in half a dozen places, there's nowhere that checks its type.


def is_null_slice(obj):
""" we have a null slice """
return (isinstance(obj, slice) and obj.start is None and
Expand Down
2 changes: 2 additions & 0 deletions pandas/core/frame.py
Original file line number Diff line number Diff line change
Expand Up @@ -261,6 +261,8 @@ def __init__(self, data=None, index=None, columns=None, dtype=None,
data = list(data)
if len(data) > 0:
if is_list_like(data[0]) and getattr(data[0], 'ndim', 1) == 1:
if com.is_named_tuple(data[0]) and columns is None:
columns = data[0]._fields
arrays, columns = _to_arrays(data, columns, dtype=dtype)
columns = _ensure_index(columns)

Expand Down
9 changes: 9 additions & 0 deletions pandas/tests/test_common.py
Original file line number Diff line number Diff line change
Expand Up @@ -538,6 +538,15 @@ def test_is_list_like():
for f in fails:
assert not com.is_list_like(f)

def test_is_named_tuple():
passes = (collections.namedtuple('Test',list('abc'))(1,2,3),)
fails = ((1,2,3), 'a', Series({'pi':3.14}))

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

remove 1 line

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

for p in passes:
assert com.is_named_tuple(p)

for f in fails:
assert not com.is_named_tuple(f)

def test_is_hashable():

Expand Down
30 changes: 17 additions & 13 deletions pandas/tests/test_frame.py
Original file line number Diff line number Diff line change
Expand Up @@ -16,8 +16,7 @@

from pandas.compat import(
map, zip, range, long, lrange, lmap, lzip,
OrderedDict, u, StringIO, string_types,
is_platform_windows
OrderedDict, u, StringIO, is_platform_windows
)
from pandas import compat

Expand All @@ -33,8 +32,7 @@
import pandas.core.datetools as datetools
from pandas import (DataFrame, Index, Series, Panel, notnull, isnull,
MultiIndex, DatetimeIndex, Timestamp, date_range,
read_csv, timedelta_range, Timedelta, CategoricalIndex,
option_context, period_range)
read_csv, timedelta_range, Timedelta, option_context, period_range)
from pandas.core.dtypes import DatetimeTZDtype
import pandas as pd
from pandas.parser import CParserError
Expand Down Expand Up @@ -2239,7 +2237,6 @@ class TestDataFrame(tm.TestCase, CheckIndexing,
_multiprocess_can_split_ = True

def setUp(self):
import warnings

self.frame = _frame.copy()
self.frame2 = _frame2.copy()
Expand Down Expand Up @@ -3568,6 +3565,20 @@ def test_constructor_tuples(self):
expected = DataFrame({'A': Series([(1, 2), (3, 4)])})
assert_frame_equal(result, expected)

def test_constructor_namedtuples(self):
# GH11181
from collections import namedtuple
named_tuple = namedtuple("Pandas", list('ab'))
tuples = [named_tuple(1, 3), named_tuple(2, 4)]
expected = DataFrame({'a': [1, 2], 'b': [3, 4]})
result = DataFrame(tuples)
assert_frame_equal(result, expected)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

acutally I just realized, that if you have DIFFERENT named tuples this code will break (e.g. different fields). Can you do a PR to assert that test? pretty pathological but possible

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, will do now.

To be clear, is the case you're suggesting: DataFrame receives a list of namedtuples, each with the same number of items, but with different _fields?
The intended outcome here is that it takes _fields from the first. Is that the outcome you want to test for?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the correct way would be to compare the columns with each of the namedtuples and if they differ then raise a ValueError. This might be expensive, so what I would do instead is keep track of the type of namedtuple. and instead just compare that. If they differ then easiest to just raise a ValueError (in theory you could just discard columns at this point, but I think this is an actual error).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's pretty expensive:

In [1]:

from collections import namedtuple
nt = namedtuple('NT', list('abc'))
tuples = [nt(0,1,2) for i in range(int(1e7))]
In [2]:

t=tuples[0]
correct_type=type(t)
In [3]:

%timeit all(type(tup)==correct_type for tup in tuples)
1 loops, best of 3: 1.16 s per loop

Given this is a 'best efforts' check - i.e. the alternative is 'useless' columns of (0, 1, 2) - is taking the _fields from the first item OK? If the user really cares, she can supply columns, if not, she gets something reasonable...

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I bet the == actually is doing a lot of work
use is

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

very odd that this is slow, though I guess its a lot of tuples

ok guess just go with first namedtuple then.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what happens if tuples have different lengths does this break? (eg. the current code in master), I think yes..

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, you get NaNs, which IIRC is what always happens if you pass lists in with unequal lengths

In [2]:

pd.DataFrame([(1,2,3),(4,5)])
pd.DataFrame([(1,2,3),(4,5)])
Out[2]:
0   1   2
0   1   2   3
1   4   5   NaN

If you pass columns, they need to be the max length:

In [4]:

pd.DataFrame([(1,2),(3,4,5)], columns=['a','b'])
pd.DataFrame([(1,2),(3,4,5)], columns=['a','b'])
---------------------------------------------------------------------------
AssertionError                            Traceback (most recent call last)
<ipython-input-4-e2f410a01184> in <module>()
----> 1 pd.DataFrame([(1,2),(3,4,5)], columns=['a','b'])

/usr/local/lib/python2.7/dist-packages/pandas/core/frame.pyc in __init__(self, data, index, columns, dtype, copy)
    262             if len(data) > 0:
    263                 if is_list_like(data[0]) and getattr(data[0], 'ndim', 1) == 1:
--> 264                     arrays, columns = _to_arrays(data, columns, dtype=dtype)
    265                     columns = _ensure_index(columns)
    266 

/usr/local/lib/python2.7/dist-packages/pandas/core/frame.pyc in _to_arrays(data, columns, coerce_float, dtype)
   5211     if isinstance(data[0], (list, tuple)):
   5212         return _list_to_arrays(data, columns, coerce_float=coerce_float,
-> 5213                                dtype=dtype)
   5214     elif isinstance(data[0], collections.Mapping):
   5215         return _list_of_dict_to_arrays(data, columns,

/usr/local/lib/python2.7/dist-packages/pandas/core/frame.pyc in _list_to_arrays(data, columns, coerce_float, dtype)
   5294         content = list(lib.to_object_array(data).T)
   5295     return _convert_object_array(content, columns, dtype=dtype,
-> 5296                                  coerce_float=coerce_float)
   5297 
   5298 

/usr/local/lib/python2.7/dist-packages/pandas/core/frame.pyc in _convert_object_array(content, columns, coerce_float, dtype)
   5352             # caller's responsibility to check for this...
   5353             raise AssertionError('%d columns passed, passed data had %s '
-> 5354                                  'columns' % (len(columns), len(content)))
   5355 
   5356     # provide soft conversion of object dtypes

AssertionError: 2 columns passed, passed data had 3 columns

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok then
I guess the only issue is if u have differently tuples with different columns
but that's just user error and not worth it to detect
ok then
thanks!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cheers!

# with columns
expected = DataFrame({'y': [1, 2], 'z': [3, 4]})
result = DataFrame(tuples, columns=['y', 'z'])
assert_frame_equal(result, expected)

def test_constructor_orient(self):
data_dict = self.mixed_frame.T._series
recons = DataFrame.from_dict(data_dict, orient='index')
Expand Down Expand Up @@ -4418,7 +4429,7 @@ def test_timedeltas(self):

def test_operators_timedelta64(self):

from datetime import datetime, timedelta
from datetime import timedelta
df = DataFrame(dict(A = date_range('2012-1-1', periods=3, freq='D'),
B = date_range('2012-1-2', periods=3, freq='D'),
C = Timestamp('20120101')-timedelta(minutes=5,seconds=5)))
Expand Down Expand Up @@ -9645,7 +9656,6 @@ def test_replace_mixed(self):
assert_frame_equal(result,expected)

# test case from
from pandas.util.testing import makeCustomDataframe as mkdf
df = DataFrame({'A' : Series([3,0],dtype='int64'), 'B' : Series([0,3],dtype='int64') })
result = df.replace(3, df.mean().to_dict())
expected = df.copy().astype('float64')
Expand Down Expand Up @@ -12227,7 +12237,6 @@ def test_sort_index_inplace(self):
assert_frame_equal(df, expected)

def test_sort_index_different_sortorder(self):
import random
A = np.arange(20).repeat(5)
B = np.tile(np.arange(5), 20)

Expand Down Expand Up @@ -13301,7 +13310,6 @@ def test_quantile(self):

def test_quantile_axis_parameter(self):
# GH 9543/9544
from numpy import percentile

df = DataFrame({"A": [1, 2, 3], "B": [2, 3, 4]}, index=[1, 2, 3])

Expand Down Expand Up @@ -16093,8 +16101,6 @@ def test_query_doesnt_pickup_local(self):
n = m = 10
df = DataFrame(np.random.randint(m, size=(n, 3)), columns=list('abc'))

from numpy import sin

# we don't pick up the local 'sin'
with tm.assertRaises(UndefinedVariableError):
df.query('sin > 5', engine=engine, parser=parser)
Expand Down Expand Up @@ -16392,7 +16398,6 @@ def setUpClass(cls):
cls.frame = _frame.copy()

def test_query_builtin(self):
from pandas.computation.engines import NumExprClobberingError
engine, parser = self.engine, self.parser

n = m = 10
Expand All @@ -16413,7 +16418,6 @@ def setUpClass(cls):
cls.frame = _frame.copy()

def test_query_builtin(self):
from pandas.computation.engines import NumExprClobberingError
engine, parser = self.engine, self.parser

n = m = 10
Expand Down