-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
BUG: .loc with duplicated label may have incorrect index dtype #11497
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
# non-unique slicing must reset freq | ||
attrs.pop('freq', None) | ||
try: | ||
return self._constructor(new_labels, **attrs), indexer, new_indexer |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this logic should go in Index._shallow_copy
itself.
@@ -382,6 +390,8 @@ def _shallow_copy(self, values=None, infer=False, **kwargs): | |||
values : the values to create the new Index, optional | |||
infer : boolean, default False | |||
if True, infer the new type of the passed values | |||
reset_attributes : boolean, default False | |||
if True, reset attributes specified in _reset_attributes | |||
kwargs : updates the default attributes for this Index |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is pretty convoluted
why is this needed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Tried to handle freq
reset for DTI / keep for PeriodIndex operation in the function. Will consider better way and remove from this PR.
eee51c5
to
3dc6db0
Compare
Updated and now green. |
@@ -391,6 +395,11 @@ def _shallow_copy(self, values=None, infer=False, **kwargs): | |||
|
|||
if infer: | |||
attributes['copy'] = False | |||
if self._infer_as_myclass: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why is infer
not good enough here? (or maybe just reprupose and make infer
a class attribute? (lke what you are calling _infer_as_myclass
). e.g. confusing that we need/have 2
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To achieve below result. .loc
against DatetimeIndex
internally converts the index to Int64Index
. Thus, there should be a logic to revert to DatetimeIndex
.
pd.Index([1., 2., 3.])._shallow_copy([1, 2, 3], infer=True)
# Int64Index([1, 2, 3], dtype='int64')
idx = pd.date_range('2011-01-01', freq='D', periods=3)
idx._shallow_copy(idx.asi8, infer=True)
# DatetimeIndex(['2011-01-01', '2011-01-02', '2011-01-03'], dtype='datetime64[ns]', freq='D')
In current master, the below case results in:
idx._shallow_copy(idx.asi8, infer=True)
Int64Index([1293840000000000000, 1293926400000000000, 1294012800000000000], dtype='int64')
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I understand. The point is that maybe if we have a class attribute, we don't need the instance one (e.g. passing infer=True
at all). OR (if you think we need both) I would name them the same, and only use the instance passed one if its not None
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
where is the inference of the above example?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Below is the behavior of current master.
pd.Index([1., 2., 3.])._shallow_copy([1, 2, 3], infer=True)
# Int64Index([1, 2, 3], dtype='int64')
pd.Index([1., 2., 3.])._shallow_copy([1, 2, 3])
# Float64Index([1, 2, 3], dtype='int64')
idx = pd.date_range('2011-01-01', freq='D', periods=3)
idx._shallow_copy(idx.asi8, infer=True)
# Int64Index([1293840000000000000, 1293926400000000000, 1294012800000000000], dtype='int64')
idx._shallow_copy(idx.asi8)
# DatetimeIndex(['2011-01-01', '2011-01-02', '2011-01-03'], dtype='datetime64[ns]', freq='D')
.loc
must infer
because user may pass different dtype.
pd.Index([1., 2., 3.])._shallow_copy(['a', 'b', 'c'])
# Float64Index([u'a', u'b', u'c'], dtype='|S1')
# -> NG, incorrect dtype
Thus, I understand _shallow_copy
should cover followings.
- Return the same
Index
if values are guaranteed to be the same dtype as the original (without inference) - Return the correct type of
Index
if values can have different dtypes from the original (with inference)- Inference prioritizing original class (used in datetime-like
Index
accepting internal repr likeasi8
) - Normal inference (simply pass values to
Index
)
- Inference prioritizing original class (used in datetime-like
Class attribute _infer_as_myclass
intends to split the logic as below (or needs to override _shallow_copy
in tseries/base
)
pd.Index([1., 2., 3.])._shallow_copy(['2011-01', '2011-02', '2011-02'], infer=True)
# Index([u'2011-01', u'2011-02', u'2011-02'], dtype='object')
idx._shallow_copy(['2011-01', '2011-02', '2011-02'], infer=True)
# Index([u'2011-01', u'2011-02', u'2011-02'], dtype='object')
# -> want DatetimeIndex in this case.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@sinhrks not questing the need for this! just whether we need infer
AND _infer_as_myclass
or could we just use a single class variable for this (rather than the passed kw infer
). Just confusing to have 2 sort-of-similar ways of doing this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah I see. One idea is splitting the method. Based on the current impl:
- Use
_simple_new
when inference is not needed. - Use
_shallow_copy
when inference is needed.
I've once tried it, but it isn't work by the simple change because of some impl differences, for example MultiIndex
doesn't have _simple_new
.
Let me try it in a separate PR, or change the milestone to 0.18.0.
@sinhrks can you update |
@sinhrks yeh, prob requires some playing around to get this to work nicely. Like your separation of concerns though. We should document this prob at top of core/index.py |
4a232a9
to
6e3ccb0
Compare
How about adding
We can replace If OK, I'll squash. |
@sinhrks seems reasonable call it |
@jreback OK, renamed. Also added general explanation about |
BUG: .loc with duplicated label may have incorrect index dtype
thanks @sinhrks and nice docs! |
.loc
result with duplicated keys may have incorredIndex
dtype.After the PR:
Above OK results are unchanged.