BUG: .loc with duplicated label may have incorrect index dtype #11497

sinhrks · 2015-11-01T12:05:05Z

.loc result with duplicated keys may have incorred Index dtype.

import pandas as pd

ser = pd.Series([0.1, 0.2], index=pd.Index([1, 2], name='idx'))

# OK
ser.loc[[2, 2, 1]].index
# Int64Index([2, 2, 1], dtype='int64', name=u'idx')

# NG, Int64Index(dtype=object) 
ser.loc[[3, 2, 3]].index 
# Int64Index([3, 2, 3], dtype='object', name=u'idx')
ser.loc[[3, 2, 3, 'x']].index 
# Int64Index([3, 2, 3, u'x'], dtype='object', name=u'idx')

idx = pd.date_range('2011-01-01', '2011-01-02', freq='D', name='idx')
ser = pd.Series([0.1, 0.2], index=idx, name='s')

# OK
ser.loc[[pd.Timestamp('2011-01-02'), pd.Timestamp('2011-01-02'), pd.Timestamp('2011-01-01')]].index
# DatetimeIndex(['2011-01-02', '2011-01-02', '2011-01-01'], dtype='datetime64[ns]', name=u'idx', freq=None)

# NG, ValueError
ser.loc[[pd.Timestamp('2011-01-03'), pd.Timestamp('2011-01-02'), pd.Timestamp('2011-01-03')]].index
# ValueError: Inferred frequency None from passed dates does not conform to passed frequency D

After the PR:

Above OK results are unchanged.

import pandas as pd
ser = pd.Series([0.1, 0.2], index=pd.Index([1, 2], name='idx'))

ser.loc[[3, 2, 3]].index 
# Int64Index([3, 2, 3], dtype='int64', name=u'idx')
ser.loc[[3, 2, 3, 'x']].index 
# Index([3, 2, 3, u'x'], dtype='object', name=u'idx')

idx = pd.date_range('2011-01-01', '2011-01-02', freq='D', name='idx')
ser = pd.Series([0.1, 0.2], index=idx, name='s')

ser.loc[[pd.Timestamp('2011-01-03'), pd.Timestamp('2011-01-02'), pd.Timestamp('2011-01-03')]].index
# DatetimeIndex(['2011-01-03', '2011-01-02', '2011-01-03'], dtype='datetime64[ns]', name=u'idx', freq=None)

jreback · 2015-11-01T14:48:31Z

pandas/core/index.py

+        # non-unique slicing must reset freq
+        attrs.pop('freq', None)
+        try:
+            return self._constructor(new_labels, **attrs), indexer, new_indexer


I think this logic should go in Index._shallow_copy itself.

jreback · 2015-11-07T03:19:47Z

pandas/core/index.py

@@ -382,6 +390,8 @@ def _shallow_copy(self, values=None, infer=False, **kwargs):
        values : the values to create the new Index, optional
        infer : boolean, default False
            if True, infer the new type of the passed values
+        reset_attributes : boolean, default False
+            if True, reset attributes specified in _reset_attributes
        kwargs : updates the default attributes for this Index


this is pretty convoluted

why is this needed?

Tried to handle freq reset for DTI / keep for PeriodIndex operation in the function. Will consider better way and remove from this PR.

sinhrks · 2015-11-07T14:45:32Z

Updated and now green.

jreback · 2015-11-07T14:51:17Z

pandas/core/index.py

@@ -391,6 +395,11 @@ def _shallow_copy(self, values=None, infer=False, **kwargs):

        if infer:
            attributes['copy'] = False
+            if self._infer_as_myclass:


why is infer not good enough here? (or maybe just reprupose and make infer a class attribute? (lke what you are calling _infer_as_myclass). e.g. confusing that we need/have 2

To achieve below result. .loc against DatetimeIndex internally converts the index to Int64Index. Thus, there should be a logic to revert to DatetimeIndex.

pd.Index([1., 2., 3.])._shallow_copy([1, 2, 3], infer=True) # Int64Index([1, 2, 3], dtype='int64') idx = pd.date_range('2011-01-01', freq='D', periods=3) idx._shallow_copy(idx.asi8, infer=True) # DatetimeIndex(['2011-01-01', '2011-01-02', '2011-01-03'], dtype='datetime64[ns]', freq='D')

In current master, the below case results in:

idx._shallow_copy(idx.asi8, infer=True) Int64Index([1293840000000000000, 1293926400000000000, 1294012800000000000], dtype='int64')

I understand. The point is that maybe if we have a class attribute, we don't need the instance one (e.g. passing infer=True at all). OR (if you think we need both) I would name them the same, and only use the instance passed one if its not None

where is the inference of the above example?

Below is the behavior of current master.

pd.Index([1., 2., 3.])._shallow_copy([1, 2, 3], infer=True) # Int64Index([1, 2, 3], dtype='int64') pd.Index([1., 2., 3.])._shallow_copy([1, 2, 3]) # Float64Index([1, 2, 3], dtype='int64')

idx = pd.date_range('2011-01-01', freq='D', periods=3) idx._shallow_copy(idx.asi8, infer=True) # Int64Index([1293840000000000000, 1293926400000000000, 1294012800000000000], dtype='int64') idx._shallow_copy(idx.asi8) # DatetimeIndex(['2011-01-01', '2011-01-02', '2011-01-03'], dtype='datetime64[ns]', freq='D')

.loc must infer because user may pass different dtype.

pd.Index([1., 2., 3.])._shallow_copy(['a', 'b', 'c']) # Float64Index([u'a', u'b', u'c'], dtype='|S1') # -> NG, incorrect dtype

Thus, I understand _shallow_copy should cover followings.

Return the same Index if values are guaranteed to be the same dtype as the original (without inference)

Return the correct type of Index if values can have different dtypes from the original (with inference)

Inference prioritizing original class (used in datetime-like Index accepting internal repr like asi8)

Normal inference (simply pass values to Index)

Class attribute _infer_as_myclass intends to split the logic as below (or needs to override _shallow_copy in tseries/base)

pd.Index([1., 2., 3.])._shallow_copy(['2011-01', '2011-02', '2011-02'], infer=True) # Index([u'2011-01', u'2011-02', u'2011-02'], dtype='object') idx._shallow_copy(['2011-01', '2011-02', '2011-02'], infer=True) # Index([u'2011-01', u'2011-02', u'2011-02'], dtype='object') # -> want DatetimeIndex in this case.

@sinhrks not questing the need for this! just whether we need infer AND _infer_as_myclass or could we just use a single class variable for this (rather than the passed kw infer). Just confusing to have 2 sort-of-similar ways of doing this.

Ah I see. One idea is splitting the method. Based on the current impl:

Use _simple_new when inference is not needed.

Use _shallow_copy when inference is needed.

I've once tried it, but it isn't work by the simple change because of some impl differences, for example MultiIndex doesn't have _simple_new.

Let me try it in a separate PR, or change the milestone to 0.18.0.

jreback · 2015-11-13T18:49:37Z

@sinhrks can you update

jreback · 2015-11-14T15:25:16Z

@sinhrks yeh, prob requires some playing around to get this to work nicely. Like your separation of concerns though. We should document this prob at top of core/index.py

sinhrks · 2015-11-24T20:26:17Z

How about adding _infer_new as below?

_infer_new: The same as current _shallow_copy(infer=True). Always tries to infer correct type.
_shallow_copy: Remove infer kw. Always use the same type as the caller.

We can replace _shallow_copy by _get_attribute_dict + _simple_new completely, but I didn't do because adding _get_attribute_dict is redundant / easy to be missed.

If OK, I'll squash.

jreback · 2015-11-25T14:36:03Z

@sinhrks seems reasonable call it _shallow_copy_with_infer? to make the connection with _shallow_copy

sinhrks · 2015-11-28T02:21:08Z

@jreback OK, renamed. Also added general explanation about simple_new, shallow_copy and shallow_copy_with_infer. Could you check?

BUG: .loc with duplicated label may have incorrect index dtype

jreback · 2015-11-29T18:01:24Z

thanks @sinhrks and nice docs!

sinhrks added Bug Indexing Related to indexing on series/frames, not to indexes themselves labels Nov 1, 2015

sinhrks added this to the 0.17.1 milestone Nov 1, 2015

jreback reviewed Nov 1, 2015
View reviewed changes

sinhrks force-pushed the loc_dtype branch from b2bd9f6 to 2d523c9 Compare November 7, 2015 02:12

sinhrks mentioned this pull request Nov 7, 2015

BUG/API: Clarify the behaviour of fillna downcasting #11537

Closed

1 task

sinhrks force-pushed the loc_dtype branch from 2d523c9 to a28efab Compare November 7, 2015 02:53

jreback reviewed Nov 7, 2015
View reviewed changes

sinhrks force-pushed the loc_dtype branch 3 times, most recently from eee51c5 to 3dc6db0 Compare November 7, 2015 05:57

jreback reviewed Nov 7, 2015
View reviewed changes

sinhrks mentioned this pull request Nov 13, 2015

TST: Enable Index dtype comparison by default #11588

Merged

sinhrks force-pushed the loc_dtype branch from 3dc6db0 to 2956e4a Compare November 14, 2015 05:27

jreback modified the milestones: Next Major Release, 0.17.1 Nov 14, 2015

sinhrks force-pushed the loc_dtype branch 5 times, most recently from 4a232a9 to 6e3ccb0 Compare November 23, 2015 20:53

sinhrks modified the milestones: 0.18.0, Next Major Release Nov 24, 2015

max-sixty mentioned this pull request Nov 24, 2015

BUG: Index does not inherit existing Index or DatatetimeIndex object … #11695

Closed

BUG: .loc with duplicated label may have incorrect index dtype

5a4ba71

sinhrks force-pushed the loc_dtype branch from a8ff530 to 5a4ba71 Compare November 28, 2015 01:26

jreback added a commit that referenced this pull request Nov 29, 2015

Merge pull request #11497 from sinhrks/loc_dtype

431e224

BUG: .loc with duplicated label may have incorrect index dtype

jreback merged commit 431e224 into pandas-dev:master Nov 29, 2015

sinhrks deleted the loc_dtype branch November 29, 2015 19:42

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

BUG: .loc with duplicated label may have incorrect index dtype #11497

BUG: .loc with duplicated label may have incorrect index dtype #11497

sinhrks commented Nov 1, 2015

jreback Nov 1, 2015

jreback Nov 7, 2015

sinhrks Nov 7, 2015

sinhrks commented Nov 7, 2015

jreback Nov 7, 2015

sinhrks Nov 7, 2015

jreback Nov 7, 2015

jreback Nov 7, 2015

sinhrks Nov 8, 2015

jreback Nov 8, 2015

sinhrks Nov 14, 2015

jreback commented Nov 13, 2015

jreback commented Nov 14, 2015

sinhrks commented Nov 24, 2015

jreback commented Nov 25, 2015

sinhrks commented Nov 28, 2015

jreback commented Nov 29, 2015

BUG: .loc with duplicated label may have incorrect index dtype #11497

BUG: .loc with duplicated label may have incorrect index dtype #11497

Conversation

sinhrks commented Nov 1, 2015

After the PR:

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

sinhrks commented Nov 7, 2015

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jreback commented Nov 13, 2015

jreback commented Nov 14, 2015

sinhrks commented Nov 24, 2015

jreback commented Nov 25, 2015

sinhrks commented Nov 28, 2015

jreback commented Nov 29, 2015