Skip to content

BUG: .loc with duplicated label may have incorrect index dtype #11497

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
Nov 29, 2015

Conversation

sinhrks
Copy link
Member

@sinhrks sinhrks commented Nov 1, 2015

.loc result with duplicated keys may have incorred Index dtype.

import pandas as pd

ser = pd.Series([0.1, 0.2], index=pd.Index([1, 2], name='idx'))

# OK
ser.loc[[2, 2, 1]].index
# Int64Index([2, 2, 1], dtype='int64', name=u'idx')

# NG, Int64Index(dtype=object) 
ser.loc[[3, 2, 3]].index 
# Int64Index([3, 2, 3], dtype='object', name=u'idx')
ser.loc[[3, 2, 3, 'x']].index 
# Int64Index([3, 2, 3, u'x'], dtype='object', name=u'idx')

idx = pd.date_range('2011-01-01', '2011-01-02', freq='D', name='idx')
ser = pd.Series([0.1, 0.2], index=idx, name='s')

# OK
ser.loc[[pd.Timestamp('2011-01-02'), pd.Timestamp('2011-01-02'), pd.Timestamp('2011-01-01')]].index
# DatetimeIndex(['2011-01-02', '2011-01-02', '2011-01-01'], dtype='datetime64[ns]', name=u'idx', freq=None)

# NG, ValueError
ser.loc[[pd.Timestamp('2011-01-03'), pd.Timestamp('2011-01-02'), pd.Timestamp('2011-01-03')]].index
# ValueError: Inferred frequency None from passed dates does not conform to passed frequency D

After the PR:

Above OK results are unchanged.

import pandas as pd
ser = pd.Series([0.1, 0.2], index=pd.Index([1, 2], name='idx'))

ser.loc[[3, 2, 3]].index 
# Int64Index([3, 2, 3], dtype='int64', name=u'idx')
ser.loc[[3, 2, 3, 'x']].index 
# Index([3, 2, 3, u'x'], dtype='object', name=u'idx')

idx = pd.date_range('2011-01-01', '2011-01-02', freq='D', name='idx')
ser = pd.Series([0.1, 0.2], index=idx, name='s')

ser.loc[[pd.Timestamp('2011-01-03'), pd.Timestamp('2011-01-02'), pd.Timestamp('2011-01-03')]].index
# DatetimeIndex(['2011-01-03', '2011-01-02', '2011-01-03'], dtype='datetime64[ns]', name=u'idx', freq=None)

@sinhrks sinhrks added Bug Indexing Related to indexing on series/frames, not to indexes themselves labels Nov 1, 2015
@sinhrks sinhrks added this to the 0.17.1 milestone Nov 1, 2015
# non-unique slicing must reset freq
attrs.pop('freq', None)
try:
return self._constructor(new_labels, **attrs), indexer, new_indexer
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this logic should go in Index._shallow_copy itself.

@@ -382,6 +390,8 @@ def _shallow_copy(self, values=None, infer=False, **kwargs):
values : the values to create the new Index, optional
infer : boolean, default False
if True, infer the new type of the passed values
reset_attributes : boolean, default False
if True, reset attributes specified in _reset_attributes
kwargs : updates the default attributes for this Index
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is pretty convoluted

why is this needed?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tried to handle freq reset for DTI / keep for PeriodIndex operation in the function. Will consider better way and remove from this PR.

@sinhrks sinhrks force-pushed the loc_dtype branch 3 times, most recently from eee51c5 to 3dc6db0 Compare November 7, 2015 05:57
@sinhrks
Copy link
Member Author

sinhrks commented Nov 7, 2015

Updated and now green.

@@ -391,6 +395,11 @@ def _shallow_copy(self, values=None, infer=False, **kwargs):

if infer:
attributes['copy'] = False
if self._infer_as_myclass:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why is infer not good enough here? (or maybe just reprupose and make infer a class attribute? (lke what you are calling _infer_as_myclass). e.g. confusing that we need/have 2

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To achieve below result. .loc against DatetimeIndex internally converts the index to Int64Index. Thus, there should be a logic to revert to DatetimeIndex.

pd.Index([1., 2., 3.])._shallow_copy([1, 2, 3], infer=True)
# Int64Index([1, 2, 3], dtype='int64')

idx = pd.date_range('2011-01-01', freq='D', periods=3)
idx._shallow_copy(idx.asi8, infer=True)
# DatetimeIndex(['2011-01-01', '2011-01-02', '2011-01-03'], dtype='datetime64[ns]', freq='D')

In current master, the below case results in:

idx._shallow_copy(idx.asi8, infer=True)
Int64Index([1293840000000000000, 1293926400000000000, 1294012800000000000], dtype='int64')

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I understand. The point is that maybe if we have a class attribute, we don't need the instance one (e.g. passing infer=True at all). OR (if you think we need both) I would name them the same, and only use the instance passed one if its not None

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

where is the inference of the above example?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Below is the behavior of current master.

pd.Index([1., 2., 3.])._shallow_copy([1, 2, 3], infer=True)
# Int64Index([1, 2, 3], dtype='int64')

pd.Index([1., 2., 3.])._shallow_copy([1, 2, 3])
# Float64Index([1, 2, 3], dtype='int64')
idx = pd.date_range('2011-01-01', freq='D', periods=3)
idx._shallow_copy(idx.asi8, infer=True)
# Int64Index([1293840000000000000, 1293926400000000000, 1294012800000000000], dtype='int64')

idx._shallow_copy(idx.asi8)
# DatetimeIndex(['2011-01-01', '2011-01-02', '2011-01-03'], dtype='datetime64[ns]', freq='D')

.loc must infer because user may pass different dtype.

pd.Index([1., 2., 3.])._shallow_copy(['a', 'b', 'c'])
# Float64Index([u'a', u'b', u'c'], dtype='|S1')
# -> NG, incorrect dtype

Thus, I understand _shallow_copy should cover followings.

  • Return the same Index if values are guaranteed to be the same dtype as the original (without inference)
  • Return the correct type of Index if values can have different dtypes from the original (with inference)
    • Inference prioritizing original class (used in datetime-like Index accepting internal repr like asi8)
    • Normal inference (simply pass values to Index)

Class attribute _infer_as_myclass intends to split the logic as below (or needs to override _shallow_copy in tseries/base)

pd.Index([1., 2., 3.])._shallow_copy(['2011-01', '2011-02', '2011-02'], infer=True)
# Index([u'2011-01', u'2011-02', u'2011-02'], dtype='object')

idx._shallow_copy(['2011-01', '2011-02', '2011-02'], infer=True)
# Index([u'2011-01', u'2011-02', u'2011-02'], dtype='object')
# -> want DatetimeIndex in this case. 

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@sinhrks not questing the need for this! just whether we need infer AND _infer_as_myclass or could we just use a single class variable for this (rather than the passed kw infer). Just confusing to have 2 sort-of-similar ways of doing this.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah I see. One idea is splitting the method. Based on the current impl:

  • Use _simple_new when inference is not needed.
  • Use _shallow_copy when inference is needed.

I've once tried it, but it isn't work by the simple change because of some impl differences, for example MultiIndex doesn't have _simple_new.

Let me try it in a separate PR, or change the milestone to 0.18.0.

@jreback
Copy link
Contributor

jreback commented Nov 13, 2015

@sinhrks can you update

@jreback jreback modified the milestones: Next Major Release, 0.17.1 Nov 14, 2015
@jreback
Copy link
Contributor

jreback commented Nov 14, 2015

@sinhrks yeh, prob requires some playing around to get this to work nicely. Like your separation of concerns though. We should document this prob at top of core/index.py

@sinhrks sinhrks force-pushed the loc_dtype branch 5 times, most recently from 4a232a9 to 6e3ccb0 Compare November 23, 2015 20:53
@sinhrks
Copy link
Member Author

sinhrks commented Nov 24, 2015

How about adding _infer_new as below?

  • _infer_new: The same as current _shallow_copy(infer=True). Always tries to infer correct type.
  • _shallow_copy: Remove infer kw. Always use the same type as the caller.

We can replace _shallow_copy by _get_attribute_dict + _simple_new completely, but I didn't do because adding _get_attribute_dict is redundant / easy to be missed.

If OK, I'll squash.

@jreback
Copy link
Contributor

jreback commented Nov 25, 2015

@sinhrks seems reasonable call it _shallow_copy_with_infer? to make the connection with _shallow_copy

@sinhrks
Copy link
Member Author

sinhrks commented Nov 28, 2015

@jreback OK, renamed. Also added general explanation about simple_new, shallow_copy and shallow_copy_with_infer. Could you check?

jreback added a commit that referenced this pull request Nov 29, 2015
BUG: .loc with duplicated label may have incorrect index dtype
@jreback jreback merged commit 431e224 into pandas-dev:master Nov 29, 2015
@jreback
Copy link
Contributor

jreback commented Nov 29, 2015

thanks @sinhrks and nice docs!

@sinhrks sinhrks deleted the loc_dtype branch November 29, 2015 19:42
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug Indexing Related to indexing on series/frames, not to indexes themselves
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants