-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
BUG: Fix #13149 and ENH: 'copy' param in Index.astype() #13209
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
tests pls. |
cb58d56
to
ade4e95
Compare
Added tests. (I hope I'm doing this right.) |
@@ -259,6 +259,12 @@ def test_astype(self): | |||
for dtype in ['M8[ns]', 'm8[ns]']: | |||
self.assertRaises(TypeError, lambda: i.astype(dtype)) | |||
|
|||
# GH 13149 | |||
# a NaN astype int |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
add tests for datetimelikes as well to cover the bases; I think your change might be changing behavior there as well (which is not correct)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you mean this conversion:
pd.DatetimeIndex([1,2,np.nan]).astype(int)
Out[16]: array([ 1, 2, -9223372036854775808])
Its behaviour is unchanged. I didn't touch other indexes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes - just want to make sure that is tested as well
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
(BTW, DatetimeIndex.astype(int), TimedeltaIndex.astype(int) return a numpy.array. Isn't it a bug? PeriodIndex and Float64Index return Int64Index.)
4d310e9
to
07e27e1
Compare
def test_astype_int(self): | ||
# GH 13149 (not changed, added explicit tests) | ||
idx = DatetimeIndex([0, 1, np.NAN]) | ||
expected = np.array([0, 1, -9223372036854775808], dtype='int64') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
that is wrong ; see if u can fix so these are returned as index types
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've fixed this, I guess.
@@ -152,3 +152,4 @@ Bug Fixes | |||
- Bug in ``Period`` addition raises ``TypeError`` if ``Period`` is on right hand side (:issue:`13069`) | |||
- Bug in ``Peirod`` and ``Series`` or ``Index`` comparison raises ``TypeError`` (:issue:`13200`) | |||
- Bug in ``pd.set_eng_float_format()`` that would prevent NaN's from formatting (:issue:`11981`) | |||
- Bug in ``Float64Index.astype(int)`` raises ``ValueError`` if ``NaN`` is converted to integer; furthermore ``TimedeltaIndex``, ``DatetimeIndex`` return ``Index64Int`` on ``.astype(int)`` (:issue:`13149`) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
so you can list the second point as a separate bug fix issue and use the PR number as the issue number
7243bb6
to
3d3a5b3
Compare
@@ -832,8 +832,8 @@ def astype(self, dtype): | |||
|
|||
if dtype == np.object_: | |||
return self.asobject | |||
elif dtype == _INT64_DTYPE: | |||
return self.asi8.copy() | |||
elif issubclass(np.dtype(dtype).type, np.integer): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
actually, better to use is_integer_dtype(dtype)
more generic her
(and change dtype == np.object_
to is_object_dtype(dtype)
)
and dtype == _NS_DTYPE
to is_int64_dtype(dtype)
and dtype == str
to is_string_dtype(dtype)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
and
dtype == _NS_DTYPE
tois_int64_dtype(dtype)
I guess it's is_datetime64_ns_dtype(dtype)
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yep - typing fast !
looks pretty good. a couple of syntax changes. ping when green. |
@@ -154,3 +154,5 @@ Bug Fixes | |||
- Bug in ``Period`` addition raises ``TypeError`` if ``Period`` is on right hand side (:issue:`13069`) | |||
- Bug in ``Peirod`` and ``Series`` or ``Index`` comparison raises ``TypeError`` (:issue:`13200`) | |||
- Bug in ``pd.set_eng_float_format()`` that would prevent NaN's from formatting (:issue:`11981`) | |||
- Bug in ``Float64Index.astype(int)`` raises ``ValueError`` if ``NaN`` is converted to integer (:issue:`13149`) | |||
- Bug in ``TimedeltaIndex``, ``DatetimeIndex.astype(int)`` where ``.astype(int)`` returned ``np.array`` instead of ``Index64Int`` (:issue:`13209`) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Int64Index
expected = Index([Period('2016-05-16', freq='D')] + | ||
[Period(NaT, freq='D')] * 3, dtype='object') | ||
# Hack because of lack of support for Period null checking | ||
# tm.assert_index_equal(result, expected) # it raises |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here, I ran into an issue with the null Period Period(NaT, 'D")
not being seen as null by pd.isnull()
. It's also impossible to directly check array or index equivalence if they contain null Periods.
I can elaborate on this and possible fixes (the fixes I tried are rather beyond the scope of this commit). Unless we don't want Period(NaT, 'D")
to be treated as null.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
see #12759 this will be addressed, so for now use Period('NaT', 'D')
until this is fixed.
You can add a TODO / comment with this issue number and can be updated later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK. Added TODO.
@@ -184,3 +189,4 @@ Bug Fixes | |||
|
|||
|
|||
- Bug in ``groupby`` where ``apply`` returns different result depending on whether first result is ``None`` or not (:issue:`12824`) | |||
- Bug in ``core.common.is_timedelta64_ns_dtype()`` which always returned ``False`` (:issue:`13209`) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you can remove this; this is an internal function
@pijucha looks pretty good. There are some existing
|
[-9223372036854775808] * 3, dtype=np.int64) | ||
tm.assert_index_equal(result, expected) | ||
|
||
result = idx.astype(str) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here, Python2 returns an Index with mixed type: Date '2016-05-16' is a string, and NaT's are unicode. Is it all right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
in py2 these should all be strings not Unicode
never should be mixed
can u show an example?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
An example from tests (nb. it failed assert_index_equal
with pyhton2.7):
In [2]: idx = DatetimeIndex(['2016-05-16', 'NaT', NaT, np.NaN])
In [3]: idx
Out[3]: DatetimeIndex(['2016-05-16', 'NaT', 'NaT', 'NaT'], dtype='datetime64[ns]', freq=None)
In [4]: idx2 = idx.astype(str)
In [5]: idx2
Out[5]: Index([u'2016-05-16', u'NaT', u'NaT', u'NaT'], dtype='object')
In [7]: type(idx2[0])
Out[7]: str
In [8]: type(idx2[1])
Out[8]: unicode
I think this u('NaT')
may be responsible:
# class DatetimeIndex:
def _format_native_types(self, na_rep=u('NaT'),
date_format=None, **kwargs):
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hmm looks like we are always returning Unicode
we should be returning a string here under py2 (and the display formatters will make them Unicode when displaying)
see if returning as a string breaks anything
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems to be ok. It's just passed Python2.7 tests with "not network and not slow and not disabled".
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok great. make sure we add this as a test (and comment)
@@ -814,8 +815,7 @@ def _add_offset(self, offset): | |||
"or DatetimeIndex", PerformanceWarning) | |||
return self.astype('O') + offset | |||
|
|||
def _format_native_types(self, na_rep=u('NaT'), | |||
date_format=None, **kwargs): | |||
def _format_native_types(self, na_rep='NaT', date_format=None, **kwargs): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changed this default na_rep
. No breaks observed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
k thanks
I don't know if you added that test (below) which discovered this bug? (if not pls add with a nice comment)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes. The test is among astype tests for DatetimeIndex (just an assert_index_equal()
, which previously discovered the Indexes had different inferred types).
I'll comment. Or should I rather make it more explicit and place somewhere else?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yep I saw that. its fine
@pijucha looks really good! some test moving around / slight refactoring. ping when green. |
1. Float64Index.astype(int) raises ValueError if a NaN is present. Previously, it converted NaN's to the smallest negative integer. 2. TimedeltaIndex.astype(int) and DatetimeIndex.astype(int) return Int64Index, which is consistent with behavior of other Indexes. Previously, they returned a numpy.array of ints. 3. Added: - bool parameter 'copy' to Index.astype() - shared doc string to .astype() - tests on .astype() (consolidated and added new) - bool parameter 'copy' to Categorical.astype() 4. Internals: - Fixed core.common.is_timedelta64_ns_dtype(). - Set a default NaT representation to a string type in a parameter of DatetimeIndex._format_native_types(). Previously, it produced a unicode u'NaT' in Python2.
@jreback It's green. |
@pijucha thanks! very nice PR. you really dove deep into some internals! keep em coming! |
@jreback Thanks for the guidelines! They helped a lot. |
git diff upstream/master | flake8 --diff
Update
Previously, it converted NaN's to the smallest negative integer.
Int64Index, which is consistent with behavior of other Indexes.
Previously, they returned a numpy.array of ints.
Previously, it produced a unicode u'NaT' in Python2.