Skip to content

BUG: Fix #13149 and ENH: 'copy' param in Index.astype() #13209

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 1 commit into from

Conversation

pijucha
Copy link
Contributor

@pijucha pijucha commented May 17, 2016

Update

  1. Float64Index.astype(int) raises ValueError if a NaN is present.
    Previously, it converted NaN's to the smallest negative integer.
  2. TimedeltaIndex.astype(int) and DatetimeIndex.astype(int) return.
    Int64Index, which is consistent with behavior of other Indexes.
    Previously, they returned a numpy.array of ints.
  3. Added:
    • bool parameter 'copy' to Index.astype()
    • shared doc string to .astype()
    • tests on .astype() (consolidated and added new)
    • bool parameter 'copy' to Categorical.astype()
  4. Fixed core.common.is_timedelta64_ns_dtype().
  5. Set a default NaT representation to a string type in a parameter of DatetimeIndex._format_native_types().
    Previously, it produced a unicode u'NaT' in Python2.

@jreback
Copy link
Contributor

jreback commented May 17, 2016

tests pls.

@jreback jreback added Bug Dtype Conversions Unexpected or buggy dtype conversions labels May 17, 2016
@pijucha pijucha force-pushed the bug13149 branch 2 times, most recently from cb58d56 to ade4e95 Compare May 17, 2016 22:27
@pijucha
Copy link
Contributor Author

pijucha commented May 17, 2016

Added tests. (I hope I'm doing this right.)

@@ -259,6 +259,12 @@ def test_astype(self):
for dtype in ['M8[ns]', 'm8[ns]']:
self.assertRaises(TypeError, lambda: i.astype(dtype))

# GH 13149
# a NaN astype int
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add tests for datetimelikes as well to cover the bases; I think your change might be changing behavior there as well (which is not correct)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you mean this conversion:

pd.DatetimeIndex([1,2,np.nan]).astype(int)
Out[16]: array([                   1,                    2, -9223372036854775808])

Its behaviour is unchanged. I didn't touch other indexes.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes - just want to make sure that is tested as well

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

(BTW, DatetimeIndex.astype(int), TimedeltaIndex.astype(int) return a numpy.array. Isn't it a bug? PeriodIndex and Float64Index return Int64Index.)

@pijucha pijucha force-pushed the bug13149 branch 2 times, most recently from 4d310e9 to 07e27e1 Compare May 18, 2016 00:36
def test_astype_int(self):
# GH 13149 (not changed, added explicit tests)
idx = DatetimeIndex([0, 1, np.NAN])
expected = np.array([0, 1, -9223372036854775808], dtype='int64')
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

that is wrong ; see if u can fix so these are returned as index types

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've fixed this, I guess.

@@ -152,3 +152,4 @@ Bug Fixes
- Bug in ``Period`` addition raises ``TypeError`` if ``Period`` is on right hand side (:issue:`13069`)
- Bug in ``Peirod`` and ``Series`` or ``Index`` comparison raises ``TypeError`` (:issue:`13200`)
- Bug in ``pd.set_eng_float_format()`` that would prevent NaN's from formatting (:issue:`11981`)
- Bug in ``Float64Index.astype(int)`` raises ``ValueError`` if ``NaN`` is converted to integer; furthermore ``TimedeltaIndex``, ``DatetimeIndex`` return ``Index64Int`` on ``.astype(int)`` (:issue:`13149`)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so you can list the second point as a separate bug fix issue and use the PR number as the issue number

@pijucha pijucha force-pushed the bug13149 branch 3 times, most recently from 7243bb6 to 3d3a5b3 Compare May 19, 2016 15:50
@@ -832,8 +832,8 @@ def astype(self, dtype):

if dtype == np.object_:
return self.asobject
elif dtype == _INT64_DTYPE:
return self.asi8.copy()
elif issubclass(np.dtype(dtype).type, np.integer):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

actually, better to use is_integer_dtype(dtype) more generic her

(and change dtype == np.object_ to is_object_dtype(dtype))
and dtype == _NS_DTYPE to is_int64_dtype(dtype)
and dtype == str to is_string_dtype(dtype)

Copy link
Contributor Author

@pijucha pijucha May 19, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

and dtype == _NS_DTYPE to is_int64_dtype(dtype)

I guess it's is_datetime64_ns_dtype(dtype).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yep - typing fast !

@jreback jreback added this to the 0.18.2 milestone May 19, 2016
@jreback
Copy link
Contributor

jreback commented May 19, 2016

looks pretty good. a couple of syntax changes. ping when green.

@@ -154,3 +154,5 @@ Bug Fixes
- Bug in ``Period`` addition raises ``TypeError`` if ``Period`` is on right hand side (:issue:`13069`)
- Bug in ``Peirod`` and ``Series`` or ``Index`` comparison raises ``TypeError`` (:issue:`13200`)
- Bug in ``pd.set_eng_float_format()`` that would prevent NaN's from formatting (:issue:`11981`)
- Bug in ``Float64Index.astype(int)`` raises ``ValueError`` if ``NaN`` is converted to integer (:issue:`13149`)
- Bug in ``TimedeltaIndex``, ``DatetimeIndex.astype(int)`` where ``.astype(int)`` returned ``np.array`` instead of ``Index64Int`` (:issue:`13209`)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Int64Index

expected = Index([Period('2016-05-16', freq='D')] +
[Period(NaT, freq='D')] * 3, dtype='object')
# Hack because of lack of support for Period null checking
# tm.assert_index_equal(result, expected) # it raises
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here, I ran into an issue with the null Period Period(NaT, 'D") not being seen as null by pd.isnull(). It's also impossible to directly check array or index equivalence if they contain null Periods.

I can elaborate on this and possible fixes (the fixes I tried are rather beyond the scope of this commit). Unless we don't want Period(NaT, 'D") to be treated as null.

Copy link
Contributor

@jreback jreback May 21, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

see #12759 this will be addressed, so for now use Period('NaT', 'D') until this is fixed.
You can add a TODO / comment with this issue number and can be updated later.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK. Added TODO.

@@ -184,3 +189,4 @@ Bug Fixes


- Bug in ``groupby`` where ``apply`` returns different result depending on whether first result is ``None`` or not (:issue:`12824`)
- Bug in ``core.common.is_timedelta64_ns_dtype()`` which always returned ``False`` (:issue:`13209`)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you can remove this; this is an internal function

@jreback
Copy link
Contributor

jreback commented May 21, 2016

@pijucha looks pretty good.

There are some existing test_astype tests that would like you to consolidate with the new ones

[Sat May 21 15:29:01 ~/pandas]$ grin 'test_astype' pandas

pandas/tseries/tests/test_base.py:
   53 :     def test_astype_str(self):
pandas/tseries/tests/test_period.py:
 1630 :     def test_astype(self):
pandas/tseries/tests/test_timedeltas.py:
 1194 :     def test_astype(self):
pandas/tseries/tests/test_timeseries.py:
 1809 :     def test_astype_object(self):
 2566 :     def test_astype(self):

[-9223372036854775808] * 3, dtype=np.int64)
tm.assert_index_equal(result, expected)

result = idx.astype(str)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here, Python2 returns an Index with mixed type: Date '2016-05-16' is a string, and NaT's are unicode. Is it all right?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

in py2 these should all be strings not Unicode
never should be mixed

can u show an example?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

An example from tests (nb. it failed assert_index_equal with pyhton2.7):

In [2]: idx = DatetimeIndex(['2016-05-16', 'NaT', NaT, np.NaN])

In [3]: idx
Out[3]: DatetimeIndex(['2016-05-16', 'NaT', 'NaT', 'NaT'], dtype='datetime64[ns]', freq=None)

In [4]: idx2 = idx.astype(str)

In [5]: idx2
Out[5]: Index([u'2016-05-16', u'NaT', u'NaT', u'NaT'], dtype='object')

In [7]: type(idx2[0])
Out[7]: str

In [8]: type(idx2[1])
Out[8]: unicode

I think this u('NaT') may be responsible:

    # class DatetimeIndex:
    def _format_native_types(self, na_rep=u('NaT'),
                             date_format=None, **kwargs):

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmm looks like we are always returning Unicode

we should be returning a string here under py2 (and the display formatters will make them Unicode when displaying)

see if returning as a string breaks anything

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems to be ok. It's just passed Python2.7 tests with "not network and not slow and not disabled".

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok great. make sure we add this as a test (and comment)

@pijucha pijucha changed the title BUG: Fix #13149 NaN to int conv. in Float64Index BUG: Fix #13149 and ENH: 'copy' param in Index.astype() May 23, 2016
@@ -814,8 +815,7 @@ def _add_offset(self, offset):
"or DatetimeIndex", PerformanceWarning)
return self.astype('O') + offset

def _format_native_types(self, na_rep=u('NaT'),
date_format=None, **kwargs):
def _format_native_types(self, na_rep='NaT', date_format=None, **kwargs):
Copy link
Contributor Author

@pijucha pijucha May 23, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed this default na_rep. No breaks observed.

Copy link
Contributor

@jreback jreback May 23, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

k thanks

I don't know if you added that test (below) which discovered this bug? (if not pls add with a nice comment)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. The test is among astype tests for DatetimeIndex (just an assert_index_equal(), which previously discovered the Indexes had different inferred types).
I'll comment. Or should I rather make it more explicit and place somewhere else?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yep I saw that. its fine

@jreback
Copy link
Contributor

jreback commented May 23, 2016

@pijucha looks really good!

some test moving around / slight refactoring.

ping when green.

1. Float64Index.astype(int) raises ValueError if a NaN is present.
Previously, it converted NaN's to the smallest negative integer.

2. TimedeltaIndex.astype(int) and DatetimeIndex.astype(int) return
Int64Index, which is consistent with behavior of other Indexes.
Previously, they returned a numpy.array of ints.

3. Added:
  - bool parameter 'copy' to Index.astype()
  - shared doc string to .astype()
  - tests on .astype() (consolidated and added new)
  - bool parameter 'copy' to Categorical.astype()

4. Internals:
  - Fixed core.common.is_timedelta64_ns_dtype().
  - Set a default NaT representation to a string type in a parameter
    of DatetimeIndex._format_native_types().
    Previously, it produced a unicode u'NaT' in Python2.
@pijucha
Copy link
Contributor Author

pijucha commented May 23, 2016

@jreback It's green.

@jreback jreback closed this in afde718 May 23, 2016
@jreback
Copy link
Contributor

jreback commented May 23, 2016

@pijucha thanks!

very nice PR. you really dove deep into some internals! keep em coming!

@pijucha
Copy link
Contributor Author

pijucha commented May 24, 2016

@jreback Thanks for the guidelines! They helped a lot.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug Dtype Conversions Unexpected or buggy dtype conversions
Projects
None yet
Development

Successfully merging this pull request may close these issues.

NaNs in Float64Index are converted to silly integers using index.astype('int')
3 participants