Skip to content

Maintain Dict Ordering with Concat #21512

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 9 commits into from
3 changes: 2 additions & 1 deletion doc/source/whatsnew/v0.23.2.txt
Original file line number Diff line number Diff line change
Expand Up @@ -83,4 +83,5 @@ Bug Fixes
**Other**

- Bug in :meth:`Series.nlargest` for signed and unsigned integer dtypes when the minimum value is present (:issue:`21426`)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you picked up some extra text here. also move to 0.24

-
- Bug in :class:`_Concatenator` should not sort Ordered Dictionaries (:issue:`21510`)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you come up with some less technical? This is too focused on the technical aspect of the change, but you really should be writing something that a casual user could read and understand in the whatsnew

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed. I would just say that the bug was causing pd.concat to not respect order present in an ordered dictionary object (something along those lines).

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please, check my last commit.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Getting closer but you should remove any reference to _Concatenator as it is a private class. I would be fine with almost a copy/paste of what @gfyoung suggested above as well

-
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this needs to go into 0.24.0 as even though its a bug fix, its a jaring one, should be in reshaping.

7 changes: 5 additions & 2 deletions pandas/core/reshape/concat.py
Original file line number Diff line number Diff line change
@@ -1,7 +1,7 @@
"""
concat routines
"""

from collections import OrderedDict
import numpy as np
from pandas import compat, DataFrame, Series, Index, MultiIndex
from pandas.core.index import (_get_objs_combined_axis,
Expand Down Expand Up @@ -250,7 +250,10 @@ def __init__(self, objs, axis=0, join='outer', join_axes=None,

if isinstance(objs, dict):
if keys is None:
keys = sorted(objs)
if not isinstance(objs, OrderedDict):
keys = sorted(objs)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Normal dicts are ordered too in python3.6+, that must be checked too. There is a PY36 constant somewhere, that you can use for this check.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure if we already took a stance on this as a project, but the orderedness of dicts in 3.6 is considered an implementation detail of cpython, while it will be a language feature starting from 3.7 . So in principle in 3.6 users should not rely on it, and we should not assume they do.

@topper-123 's comment still holds I think, but for 3.7

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well we've already relied on that implementation detail in other changes (see #19830, #19859, #19884) so I'd be +1 on relying on that here as well

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, given that Python 3.6 is at 3.6.5 now, and its's being formalised in 3.7, I think this is quite safe. They're not going to change implementation now...

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is very simliar to in structure to pandas.core.common._dict_keys_to_ordered_list. structure like this. see if we can consolidate all of these try/sorting routings.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jreback you mean use: keys = com._dict_keys_to_ordered_list(objs) ? If so, I'm afraid we'll need to refactor _dict_keys_to_ordered_list because it assumes that PY36 has sorted dict which I think its not always true, and will break test_concat_dict test case.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes that's the point i am raising, I want a refactor here.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So, from _dict_keys_to_ordered_list method, i suggest change the conditionif PY36 or isinstance(mapping, OrderedDict): to if PY37 or isinstance(mapping, OrderedDict): then, remove test_constructor_dict_order_insertion and test_constructor_dict_order. What do you think about it?

else:
keys = objs
objs = [objs[k] for k in keys]
else:
objs = list(objs)
Expand Down
11 changes: 11 additions & 0 deletions pandas/tests/reshape/test_concat.py
Original file line number Diff line number Diff line change
@@ -1,3 +1,4 @@
from collections import OrderedDict
from warnings import catch_warnings
from itertools import combinations, product

Expand Down Expand Up @@ -1294,6 +1295,16 @@ def test_concat_rename_index(self):
tm.assert_frame_equal(result, exp)
assert result.index.names == exp.index.names

def test_concat_with_ordered_dict(self):
# GH 21510
ps = pd.concat(OrderedDict([('First', pd.Series(range(3))),
('Another', pd.Series(range(4)))]))
ps_list = list(ps.keys())
exp_list = [('First', 0), ('First', 1), ('First', 2),
('Another', 0), ('Another', 1),
('Another', 2), ('Another', 3)]
assert ps_list == exp_list
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Typically with test cases we have a pattern of using result and expected as the variable names. In this case, you should create an expected variable that is exactly the Series you are looking for and then us tm.assert_series_equal(result, expected) to ensure the proper outcome

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is ok if I just change the variables names (expected and result), or I should also create the Series object?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Create the Series

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm trying to figure it out how I can create the Series with different sizes without using the concat by itself.

Copy link
Author

@SaturninoMateus SaturninoMateus Jun 17, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried to create like this:

a = pd.Series(range(3),range(3))
b = pd.Series(range(4),range(4))
expected = pd.Series([a,b],index=['First','Another'])

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.


def test_crossed_dtypes_weird_corner(self):
columns = ['A', 'B', 'C', 'D']
df1 = DataFrame({'A': np.array([1, 2, 3, 4], dtype='f8'),
Expand Down