Skip to content

Arguably better handling of input data in constructor for DataFrame (fix for #4297) #4317

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 8 commits into from
Closed
Show file tree
Hide file tree
Changes from 7 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
23 changes: 18 additions & 5 deletions pandas/core/frame.py
Original file line number Diff line number Diff line change
Expand Up @@ -5680,32 +5680,45 @@ def _arrays_to_mgr(arrays, arr_names, index, columns, dtype=None):
return create_block_manager_from_arrays(arrays, arr_names, axes)

def extract_index(data):
# Slightly misleading name.
# Indexes are only extracted for elements in the iterable
# `data` inheriting from Series.
from pandas.core.index import _union_indexes

index = None
if len(data) == 0:
index = Index([])
elif len(data) > 0 and index is None:
elif len(data) > 0:
raw_lengths = []
indexes = []

have_raw_arrays = False
have_series = False
have_dicts = False

# Loop over the element, such as vectors `v` corresponding
# to columns in the DataFrame
for v in data:
if isinstance(v, Series):
have_series = True
indexes.append(v.index)
elif isinstance(v, dict):
have_dicts = True
indexes.append(v.keys())
elif isinstance(v, (list, tuple, np.ndarray)):
elif com._is_sequence(v):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh....Iyes....either way is prob fine

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unordered sequences were not allowed here previously, why the change?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Where was the test that the sequences are ordered ?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

list, tuple, and ndarray objects are ordered sequences meaning if you put element X at index i it will remain there for the lifetime of the object (modulo any operations that you do to remove it or modify it). dict objects and Python Mapping objects cannot generally make the same guarantee (specific objects may enforce an ordering such as collections.OrderedDict). So, in the following code

def _is_sequence(x):
    try:
        iter(x)
        len(x) # it has a length
        return not isinstance(x, basestring) and True
    except Exception:
        return False

a dict object is both iterable and has a length. The orderedness is guaranteed by the type of objects that test True in the conditional that you changed.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see. I had understood it as /sorted/ sequence.

While the original code seemingly (see below) guarantees that all items have an immutable order (when modifications of the order are not explicitly made), it does so by also excluding items satisfying this property. Similarly, your comment is leading to the observation that dict is not the only possible mapping. collections.OrderedDict would go as a "sequence", since mappings are accompanied by a warning. [note: I am unsure that the persistence of the structure is required to make an object ordered, which would mean that xrange (Python 2) and range (Python 3) would qualify)].

While I initially looked into this because of array.array, an sequence in the Python stdlib, I felt like generalization would be beneficial.

At the C level there is PySequence_Check() and PyMapping_Check(). Interestingly I don't think that there is an equivalent at the Python level... may be design decision made to encourage duck-typing. Anyway, while PyMapping_Check() is the most interesting source of information about what to check: https://github.com/python-git/python/blob/master/Objects/abstract.c#L2387 . PySequence_Check() is here: https://github.com/python-git/python/blob/master/Objects/abstract.c#L1806 there is no way to ensure that the property is satisfied.

I'd argue that the requirement for orderedness should be mentioned in the documention, and that extract_index() (should be called build_index(), really) relaxes its test for mapping and sequences in order to allow other objects with the right properties. The argument about guarantee does not hold, as shown right below (before the crowd starts brandishing torches and pitchforks, I'd like to stress that the argumentation is moving to the ground of proofs and certainties, and that the sole purpose of the example is to show that there is none - the real argumentation should be around practicality):

class MyWeirdList(list):
    def __getitem__(self, i):
        return super(MyWeirdList, self).__getitem__(0)    
>>> l = MyWeirdList(range(5))
>>> l[3]
0
>>> isinstance(l, list)
True

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The reason for this in numpy-land that you need to know the size of the object ahead of time to efficiently allocate space for it. There's no standard, widely-used way for a generator to express this information.

Generators can have a __len__, as I wrote.

import sys
if sys.version_info[0] == 2:
    range = xrange # make python 2 behave
r = range(10)
print(len(r)) # prints 10

If you pass a generator to pandas, it treats it just like numpy does, which is to say just considers it a Python object.

# using r as defined above
import numpy
print(numpy.array(r))
# array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])

Moreover, if the rationale is to treat objects "just like numpy does" we shall agree that the current behaviour of the constructors for DataFrame is not accommodating enough.

That said, using collections isn't compatible with using the array module in 2.7, e.g. the following

I had no doubt about that.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@lgautier right, but (x for x in range(10)) would fail.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can compute the space necessary for range/xrange arithmetically

For arbitrary generator expressions you cannot.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@cpcloud

You can compute the space necessary for range/xrange arithmetically

For arbitrary generator expressions you cannot.

We are very explicitly talking about generators with a __len__. How the length is calculated (implemented in __len__) is pretty much irrelevant.

On a related note, the question of the size of iterators motivated a PEP 424 and might be more frequent than one thinks.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jtratner

right, but (x for x in range(10)) would fail.

Yes it would fail. The length is not passed to downstream iterators when chained. Thinking of it, may be it could for generator expressions, but that's a question for the python-dev list.

# This is a sequence-but-not-a-string
# Although strings have a __len__,
# they will be considered scalar.
have_raw_arrays = True
raw_lengths.append(len(v))
else:
# Item v is silently ignored,
# as it is not anything an index can be inferred
# from.
pass

if not indexes and not raw_lengths:
raise ValueError('If using all scalar values, you must must pass'
raise ValueError('If using all scalar values, you must pass'
' an index')

if have_series or have_dicts:
Expand All @@ -5714,15 +5727,15 @@ def extract_index(data):
if have_raw_arrays:
lengths = list(set(raw_lengths))
if len(lengths) > 1:
raise ValueError('arrays must all be same length')
raise ValueError('Arrays must all be same length')

if have_dicts:
raise ValueError('Mixing dicts with non-Series may lead to '
'ambiguous ordering.')

if have_series:
if lengths[0] != len(index):
msg = ('array length %d does not match index length %d'
msg = ('Array length %d does not match index length %d'
% (lengths[0], len(index)))
raise ValueError(msg)
else:
Expand Down
13 changes: 12 additions & 1 deletion pandas/tests/test_frame.py
Original file line number Diff line number Diff line change
Expand Up @@ -2236,7 +2236,7 @@ def testit():

def testit():
DataFrame({'a': False, 'b': True})
assertRaisesRegexp(ValueError, 'If using all scalar values, you must must pass an index', testit)
assertRaisesRegexp(ValueError, 'If using all scalar values, you must pass an index', testit)

def test_insert_error_msmgs(self):

Expand Down Expand Up @@ -2774,6 +2774,17 @@ def test_constructor_from_items(self):
# pass some columns
recons = DataFrame.from_items(items, columns=['C', 'B', 'A'])
assert_frame_equal(recons, self.frame.ix[:, ['C', 'B', 'A']])
# not any column either a dict, a list, a tuple, or a numpy.ndarray
import array
recons_ar = DataFrame.from_items([('A', array.array('i', range(10)))])
recons_rg = DataFrame.from_items([('A', range(10))])
recons_np = DataFrame.from_items([('A', np.array(range(10)))])
self.assertEquals(tuple(recons_ar['A']),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

probably better to use assert_frame_equals(recons_ar, recons_rg) here. where assert_frame_equals is from pandas.util.testing

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indeed.
I'll have to put all this on hold until time again (may be as long as in few weeks).

In the unlikely case someone is in a hurry to see this merged, or irrevocably rejected, he/she should feel free to edit further.

tuple(recons_rg['A']))
self.assertEquals(tuple(recons_ar['A']),
tuple(recons_np['A']))



# orient='index'

Expand Down