Skip to content

BUG: from_dict ignored order of OrderedDict (#8425) #26875

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 8 commits into from
Jul 8, 2019

Conversation

mazayo
Copy link
Contributor

@mazayo mazayo commented Jun 16, 2019

Updated some existing test codes because index order in some DataFrames constructed by from_dict have changed after this fix.

@codecov
Copy link

codecov bot commented Jun 16, 2019

Codecov Report

Merging #26875 into master will decrease coverage by 1.41%.
The diff coverage is 100%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master   #26875      +/-   ##
==========================================
- Coverage   91.88%   90.46%   -1.42%     
==========================================
  Files         179      179              
  Lines       50696    50699       +3     
==========================================
- Hits        46581    45867     -714     
- Misses       4115     4832     +717
Flag Coverage Δ
#multiple 90.46% <100%> (ø) ⬆️
#single ?
Impacted Files Coverage Δ
pandas/core/indexes/api.py 99% <100%> (ø) ⬆️
pandas/core/internals/construction.py 95.96% <100%> (+0.03%) ⬆️
pandas/core/computation/pytables.py 62.5% <0%> (-27.75%) ⬇️
pandas/io/pytables.py 63.82% <0%> (-26.48%) ⬇️
pandas/io/gbq.py 88.88% <0%> (-11.12%) ⬇️
pandas/core/computation/common.py 84.21% <0%> (-5.27%) ⬇️
pandas/core/computation/expr.py 94.78% <0%> (-3.03%) ⬇️
pandas/io/clipboard/clipboards.py 31.88% <0%> (-2.9%) ⬇️
pandas/io/formats/printing.py 84.49% <0%> (-1.07%) ⬇️
pandas/core/indexes/datetimes.py 96.21% <0%> (-0.17%) ⬇️
... and 3 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 430f0fd...d765439. Read the comment docs.

@codecov
Copy link

codecov bot commented Jun 16, 2019

Codecov Report

Merging #26875 into master will decrease coverage by 0.8%.
The diff coverage is 100%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master   #26875      +/-   ##
==========================================
- Coverage   92.79%   91.99%   -0.81%     
==========================================
  Files         180      180              
  Lines       50417    50776     +359     
==========================================
- Hits        46784    46709      -75     
- Misses       3633     4067     +434
Flag Coverage Δ
#multiple 90.63% <100%> (-0.84%) ⬇️
#single 41.83% <100%> (-0.6%) ⬇️
Impacted Files Coverage Δ
pandas/core/internals/construction.py 95.97% <100%> (-0.78%) ⬇️
pandas/io/gbq.py 88.88% <0%> (-11.12%) ⬇️
pandas/io/excel/_openpyxl.py 84.71% <0%> (-3.23%) ⬇️
pandas/core/internals/blocks.py 94.38% <0%> (-1.01%) ⬇️
pandas/core/tools/datetimes.py 85.05% <0%> (-0.74%) ⬇️
pandas/core/arrays/integer.py 96.3% <0%> (-0.59%) ⬇️
pandas/io/formats/printing.py 86.72% <0%> (-0.48%) ⬇️
pandas/core/dtypes/concat.py 96.58% <0%> (-0.46%) ⬇️
pandas/core/config_init.py 95.8% <0%> (-0.4%) ⬇️
... and 154 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 2efb607...b379d8a. Read the comment docs.

@@ -125,7 +125,7 @@ def _get_combined_index(indexes, intersect=False, sort=False):
return index


def _union_indexes(indexes, sort=True):
def _union_indexes(indexes, sort=True, ordered=False):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems wrong: why not set sort=False at the caller?

@topper-123 topper-123 added the DataFrame DataFrame data structure label Jun 16, 2019
Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

agree with @topper-123 , we already pass thru sort=, so pls edit to use that.

@mazayo mazayo force-pushed the from_ordered_dict branch from d765439 to 18ccb11 Compare June 16, 2019 15:25
@mazayo
Copy link
Contributor Author

mazayo commented Jun 16, 2019

not sure what caused this error.
I’ll work on this later tonight.

@mazayo mazayo force-pushed the from_ordered_dict branch from 18ccb11 to 3929945 Compare June 17, 2019 12:57
@mazayo
Copy link
Contributor Author

mazayo commented Jun 20, 2019

Azure pipelines raised some test errors in Python 3.5 and I checked some of them. The change in this PR made the difference in order of DataFrame index if it is constructed from a dict. I suppose this is expected behavior, because from_dict no longer sorts index and a dict is not ordered in Python 3.5.

Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm. can you merge master and ping on green.

@@ -633,6 +633,7 @@ Indexing
- Bug in which :meth:`DataFrame.to_csv` caused a segfault for a reindexed data frame, when the indices were single-level :class:`MultiIndex` (:issue:`26303`).
- Fixed bug where assigning a :class:`arrays.PandasArray` to a :class:`pandas.core.frame.DataFrame` would raise error (:issue:`26390`)
- Allow keyword arguments for callable local reference used in the :meth:`DataFrame.query` string (:issue:`26426`)
- Bug in which :meth:`DataFrame.from_dict` ignored order of OrderedDict when orient='index' (:issue:`8425`).
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you use double backticks around OrderedDict, also around orient='index'

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you move this to Reshaping section.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated whatsnew and merged upstream/master.

@jreback jreback added this to the 0.25.0 milestone Jun 21, 2019
@jreback jreback added the Reshaping Concat, Merge/Join, Stack/Unstack, Explode label Jun 21, 2019
Copy link
Contributor

@TomAugspurger TomAugspurger left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Test failure: https://dev.azure.com/pandas-dev/pandas/_build/results?buildId=13216&view=logs&jobId=521b7dfd-2989-5ff8-bc8c-7481906480fa&taskId=07b8d9d4-6363-5e2d-bc2b-146a30521256&lineStart=70&lineEnd=70&colStart=1&colEnd=38

You may want to set up a python 3.5-based environment locally to debug. You can use, e.g.

conda env create -n pandas-azure-macos-35  --file=ci/deps/azure-macos-35.yaml 

I'm not sure whether the failure is because the test was relying on that being / not being sorted, or whether concat was relying on that. It'd be good to match the behavior of master, unless it's clearly a bug.

@mazayo
Copy link
Contributor Author

mazayo commented Jun 24, 2019

@TomAugspurger, thank you.
I followed your instruction to create a new env, but it raised below error when importing pandas.

>>> import pandas as pd
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/Users/masayo/Documents/work/contributions/pandas/pandas/__init__.py", line 25, in <module>
    from pandas._libs import (hashtable as _hashtable,
  File "/Users/masayo/Documents/work/contributions/pandas/pandas/_libs/__init__.py", line 3, in <module>
    from .tslibs import (
  File "/Users/masayo/Documents/work/contributions/pandas/pandas/_libs/tslibs/__init__.py", line 3, in <module>
    from .conversion import normalize_date, localize_pydatetime
  File "__init__.pxd", line 918, in init pandas._libs.tslibs.conversion
ValueError: numpy.ufunc size changed, may indicate binary incompatibility. Expected 216 from C header, got 192 from PyObject

FYR, I created another env and installed Python 3.5.6 and pandas (python -m pip install -e). expected in test function test_concat_tuple_keys returns a DataFrame whose index is not sorted.

>>> expected = pd.DataFrame(
...             {'A': {('bee', 'bah', 0): 1.0,
...                    ('bee', 'bah', 1): 1.0,
...                    ('bee', 'boo', 0): 2.0,
...                    ('bee', 'boo', 1): 2.0,
...                    ('bee', 'boo', 2): 2.0},
...              'B': {('bee', 'bah', 0): 1.0,
...                    ('bee', 'bah', 1): 1.0,
...                    ('bee', 'boo', 0): 2.0,
...                    ('bee', 'boo', 1): 2.0,
...                    ('bee', 'boo', 2): 2.0}})
>>> expected
             A    B
bee boo 1  2.0  2.0
        2  2.0  2.0
    bah 1  1.0  1.0
        0  1.0  1.0
    boo 0  2.0  2.0

@TomAugspurger
Copy link
Contributor

I followed your instruction to create a new env, but it raised below error when importing pandas.

Not sure, but if you're in the same git repo, you would need to rebuild the C extensions for that new environment. Locally I have my main pandas git repo, and a second pandas-35 repo.

returns a DataFrame whose index is not sorted.

Can you clarify: is that on this branch? When I run that in a 3.5 env on master (from a few weeks ago) I get

Python 3.5.6 |Anaconda, Inc.| (default, Aug 26 2018, 16:30:03)
[GCC 4.2.1 Compatible Clang 4.0.1 (tags/RELEASE_401/final)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> import pandas as pd
>>> pd.DataFrame(
...  {'A': {('bee', 'bah', 0): 1.0,
...         ('bee', 'bah', 1): 1.0,
...         ('bee', 'boo', 0): 2.0,
...         ('bee', 'boo', 1): 2.0,
...         ('bee', 'boo', 2): 2.0},
...   'B': {('bee', 'bah', 0): 1.0,
...         ('bee', 'bah', 1): 1.0,
...         ('bee', 'boo', 0): 2.0,
...         ('bee', 'boo', 1): 2.0,
...         ('bee', 'boo', 2): 2.0}})
             A    B
bee bah 0  1.0  1.0
        1  1.0  1.0
    boo 0  2.0  2.0
        1  2.0  2.0
        2  2.0  2.0

@mazayo
Copy link
Contributor Author

mazayo commented Jun 25, 2019

returns a DataFrame whose index is not sorted.

Can you clarify: is that on this branch?

Yes it is. No change have been made after commit ae19e6c.

You’re right. I didn’t rebuild pandas after merging master.
But even after python setup.py build_ext --inplace -j 4, I still get below error when importing pandas. Maybe it it because the files (pandas
/_libs/tslibs/c_timestamp.*) related to this error is not updated after the build. Could you tell me how to full build?

>>> import pandas as pd
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/Users/masayo/Documents/work/contributions/pandas/pandas/__init__.py", line 25, in <module>
    from pandas._libs import (hashtable as _hashtable,
  File "/Users/masayo/Documents/work/contributions/pandas/pandas/_libs/__init__.py", line 3, in <module>
    from .tslibs import (
  File "/Users/masayo/Documents/work/contributions/pandas/pandas/_libs/tslibs/__init__.py", line 3, in <module>
    from .conversion import normalize_date, localize_pydatetime
  File "pandas/_libs/tslibs/c_timestamp.pxd", line 7, in init pandas._libs.tslibs.conversion
    cdef class _Timestamp(datetime):
  File "__init__.pxd", line 918, in init pandas._libs.tslibs.c_timestamp
ValueError: numpy.ufunc size changed, may indicate binary incompatibility. Expected 216 from C header, got 192 from PyObject

@TomAugspurger
Copy link
Contributor

TomAugspurger commented Jun 25, 2019 via email

@mazayo
Copy link
Contributor Author

mazayo commented Jun 26, 2019

Thank you. I could successfully full build pandas by python setup.py clean followed by python setup.py build_ext --inplace -j 4.

Here is what I got in this branch and in the the environment created by conda env create -n pandas-azure-macos-35 --file=ci/deps/azure-macos-35.yaml.

>>> import pandas as pd
>>> pd.__version__
'0.25.0.dev0+784.gae19e6c85'
>>> expected = pd.DataFrame(
...             {'A': {('bee', 'bah', 0): 1.0,
...                    ('bee', 'bah', 1): 1.0,
...                    ('bee', 'boo', 0): 2.0,
...                    ('bee', 'boo', 1): 2.0,
...                    ('bee', 'boo', 2): 2.0},
...              'B': {('bee', 'bah', 0): 1.0,
...                    ('bee', 'bah', 1): 1.0,
...                    ('bee', 'boo', 0): 2.0,
...                    ('bee', 'boo', 1): 2.0,
...                    ('bee', 'boo', 2): 2.0}})
>>> expected
             A    B
bee boo 1  2.0  2.0
        2  2.0  2.0
    bah 1  1.0  1.0
        0  1.0  1.0
    boo 0  2.0  2.0
>>>
>>>
>>> import numpy as np
>>> df1 = pd.DataFrame(np.ones((2, 2)), columns=list('AB'))
>>> df2 = pd.DataFrame(np.ones((3, 2)) * 2, columns=list('AB'))
>>> results = pd.concat((df1, df2), keys=[('bee', 'bah'), ('bee', 'boo')])
>>> results
             A    B
bee bah 0  1.0  1.0
        1  1.0  1.0
    boo 0  2.0  2.0
        1  2.0  2.0
        2  2.0  2.0

Sorry it took me a while to get back to you. I was about to go work when I received you message.

@TomAugspurger
Copy link
Contributor

OK, does that output make sense to out? I haven't looked to closely, but it seems like the old dict-of-dicts constructor sorted unordered dictionaries?

@mazayo
Copy link
Contributor Author

mazayo commented Jun 26, 2019

Right. extract_index is called when constructing a DataFrame from a dict. It used to sort keys before the update in this branch.

Copy link
Contributor

@TomAugspurger TomAugspurger left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mazayo really sorry about the delay. Hopefully my suggested change will help. This looks quite close.

index = _union_indexes(indexes)
elif have_dicts:
index = _union_indexes(indexes, sort=False)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So I suppose this is the key one. IIUC, we want to sort for Python=3.5. So this should be

index = _union_indexes(indexes, sort=not compat.PY36)

That will be True for python 3.5 and False for 3.6 and newer.

You'll need to import pandas.compat at the top.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I haven't checked yet if you need to inspect the contents of the dicts to see if they're ordered or not. Hopefully you won't have to.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK. I'll take care of this later.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of

index = _union_indexes(indexes, sort=not compat.PY36)

changed to

index = _union_indexes(indexes, sort=not (compat.PY36 or have_ordered)) 

to prevent sorting a OrderedDict in Python3.5

@jreback jreback removed this from the 0.25.0 milestone Jul 3, 2019
@mazayo mazayo force-pushed the from_ordered_dict branch from a9ec3fa to de4406b Compare July 4, 2019 19:55
@mazayo mazayo force-pushed the from_ordered_dict branch from b379d8a to 8720697 Compare July 4, 2019 21:14
Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

small tests comments, otherwise lgtm. ping on green.

@@ -517,7 +517,7 @@ def test_constructor_subclass_dict(self, float_frame):
dct.update(v.to_dict())
data[k] = dct
frame = DataFrame(data)
tm.assert_frame_equal(float_frame.sort_index(), frame)
tm.assert_frame_equal(float_frame, frame.reindex(float_frame.index))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

expected = frame.reindex(index=float_frame.index)

@@ -1342,15 +1342,28 @@ def test_constructor_list_of_namedtuples(self):
def test_constructor_orient(self, float_string_frame):
data_dict = float_string_frame.T._series
recons = DataFrame.from_dict(data_dict, orient="index")
expected = float_string_frame.sort_index()
tm.assert_frame_equal(recons, expected)
expected = float_string_frame
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

put the reindex here

@jreback jreback added this to the 0.25.0 milestone Jul 5, 2019
@jreback jreback merged commit 5422807 into pandas-dev:master Jul 8, 2019
@jreback
Copy link
Contributor

jreback commented Jul 8, 2019

thanks @mazayo

@mazayo
Copy link
Contributor Author

mazayo commented Jul 8, 2019

Thanks for offering good first issues here and your support

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
DataFrame DataFrame data structure Reshaping Concat, Merge/Join, Stack/Unstack, Explode
Projects
None yet
Development

Successfully merging this pull request may close these issues.

BUG: DataFrame from_dict constructor ignores Ordered dict when orient='index'
4 participants