-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
Arguably better handling of input data in constructor for DataFrame (fix for #4297) #4317
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from 5 commits
d2b85ff
e6c3cd6
209c01c
04732e0
9172436
4f71e98
f131b71
8787ef5
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -5680,6 +5680,9 @@ def _arrays_to_mgr(arrays, arr_names, index, columns, dtype=None): | |
return create_block_manager_from_arrays(arrays, arr_names, axes) | ||
|
||
def extract_index(data): | ||
# Slightly misleading name. | ||
# Indexes are only extracted for elements in the iterable | ||
# `data` inheriting from Series. | ||
from pandas.core.index import _union_indexes | ||
|
||
index = None | ||
|
@@ -5693,19 +5696,31 @@ def extract_index(data): | |
have_series = False | ||
have_dicts = False | ||
|
||
# Loop over the element, such as vectors `v` corresponding | ||
# to columns in the DataFrame | ||
for v in data: | ||
if isinstance(v, Series): | ||
have_series = True | ||
indexes.append(v.index) | ||
elif isinstance(v, dict): | ||
have_dicts = True | ||
indexes.append(v.keys()) | ||
elif isinstance(v, (list, tuple, np.ndarray)): | ||
elif com._is_sequence(v): | ||
# This is a sequence-but-not-a-string | ||
# Although strings have a __len__, | ||
# they will be considered scalar. | ||
have_raw_arrays = True | ||
raw_lengths.append(len(v)) | ||
else: | ||
# Item v silently ignored (to conserve | ||
# the original behaviour - see also | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. didn't realize that this happens.....can you see if you can gett a case to fail (so we can see why?) There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. On 07/22/2013 10:17 PM, jreback wrote:
The item is silently ignored, that's for certain. Consider the following example: import pandas
dataf = pandas.DataFrame.from_items([('A', range(5)), ('B', None)]) The second dataf['B'][2] = 3 Now I do not have sufficient knowledge / overview of places where
|
||
# test of __getitem__ below). | ||
# This behaviour is kept, but I think | ||
# that an exception (TypeError) should be raised instead. | ||
pass | ||
|
||
if not indexes and not raw_lengths: | ||
raise ValueError('If using all scalar values, you must must pass' | ||
raise ValueError('If using all scalar values, you must pass' | ||
' an index') | ||
|
||
if have_series or have_dicts: | ||
|
@@ -5714,15 +5729,15 @@ def extract_index(data): | |
if have_raw_arrays: | ||
lengths = list(set(raw_lengths)) | ||
if len(lengths) > 1: | ||
raise ValueError('arrays must all be same length') | ||
raise ValueError('Arrays must all be same length') | ||
|
||
if have_dicts: | ||
raise ValueError('Mixing dicts with non-Series may lead to ' | ||
'ambiguous ordering.') | ||
|
||
if have_series: | ||
if lengths[0] != len(index): | ||
msg = ('array length %d does not match index length %d' | ||
msg = ('Array length %d does not match index length %d' | ||
% (lengths[0], len(index))) | ||
raise ValueError(msg) | ||
else: | ||
|
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -2236,7 +2236,7 @@ def testit(): | |
|
||
def testit(): | ||
DataFrame({'a': False, 'b': True}) | ||
assertRaisesRegexp(ValueError, 'If using all scalar values, you must must pass an index', testit) | ||
assertRaisesRegexp(ValueError, 'If using all scalar values, you must pass an index', testit) | ||
|
||
def test_insert_error_msmgs(self): | ||
|
||
|
@@ -2774,6 +2774,17 @@ def test_constructor_from_items(self): | |
# pass some columns | ||
recons = DataFrame.from_items(items, columns=['C', 'B', 'A']) | ||
assert_frame_equal(recons, self.frame.ix[:, ['C', 'B', 'A']]) | ||
# not any column either a dict, a list, a tuple, or a numpy.ndarray | ||
import array | ||
recons_ar = DataFrame.from_items([('A', array.array('i', range(10)))]) | ||
recons_rg = DataFrame.from_items([('A', range(10))]) | ||
recons_np = DataFrame.from_items([('A', np.array(range(10)))]) | ||
self.assertEquals(tuple(recons_ar['A']), | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. probably better to use There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Indeed. In the unlikely case someone is in a hurry to see this merged, or irrevocably rejected, he/she should feel free to edit further. |
||
tuple(recons_rg['A'])) | ||
self.assertEquals(tuple(recons_ar['A']), | ||
tuple(recons_np['A'])) | ||
|
||
|
||
|
||
# orient='index' | ||
|
||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
oh....Iyes....either way is prob fine
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unordered sequences were not allowed here previously, why the change?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Where was the test that the sequences are ordered ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
list
,tuple
, andndarray
objects are ordered sequences meaning if you put elementX
at indexi
it will remain there for the lifetime of the object (modulo any operations that you do to remove it or modify it).dict
objects and PythonMapping
objects cannot generally make the same guarantee (specific objects may enforce an ordering such ascollections.OrderedDict
). So, in the following codea
dict
object is both iterable and has a length. The orderedness is guaranteed by the type of objects that testTrue
in the conditional that you changed.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see. I had understood it as /sorted/ sequence.
While the original code seemingly (see below) guarantees that all items have an immutable order (when modifications of the order are not explicitly made), it does so by also excluding items satisfying this property. Similarly, your comment is leading to the observation that
dict
is not the only possible mapping.collections.OrderedDict
would go as a "sequence", since mappings are accompanied by a warning. [note: I am unsure that the persistence of the structure is required to make an object ordered, which would mean thatxrange
(Python 2) andrange
(Python 3) would qualify)].While I initially looked into this because of
array.array
, an sequence in the Python stdlib, I felt like generalization would be beneficial.At the C level there is
PySequence_Check()
andPyMapping_Check()
. Interestingly I don't think that there is an equivalent at the Python level... may be design decision made to encourage duck-typing. Anyway, whilePyMapping_Check()
is the most interesting source of information about what to check: https://github.com/python-git/python/blob/master/Objects/abstract.c#L2387 .PySequence_Check()
is here: https://github.com/python-git/python/blob/master/Objects/abstract.c#L1806 there is no way to ensure that the property is satisfied.I'd argue that the requirement for orderedness should be mentioned in the documention, and that
extract_index()
(should be calledbuild_index()
, really) relaxes its test for mapping and sequences in order to allow other objects with the right properties. The argument about guarantee does not hold, as shown right below (before the crowd starts brandishing torches and pitchforks, I'd like to stress that the argumentation is moving to the ground of proofs and certainties, and that the sole purpose of the example is to show that there is none - the real argumentation should be around practicality):There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Generators can have a
__len__
, as I wrote.Moreover, if the rationale is to treat objects "just like numpy does" we shall agree that the current behaviour of the constructors for
DataFrame
is not accommodating enough.I had no doubt about that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@lgautier right, but
(x for x in range(10))
would fail.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can compute the space necessary for
range
/xrange
arithmeticallyFor arbitrary generator expressions you cannot.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@cpcloud
We are very explicitly talking about generators with a
__len__
. How the length is calculated (implemented in__len__
) is pretty much irrelevant.On a related note, the question of the size of iterators motivated a PEP 424 and might be more frequent than one thinks.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jtratner
Yes it would fail. The length is not passed to downstream iterators when chained. Thinking of it, may be it could for generator expressions, but that's a question for the python-dev list.