Skip to content

NaNs in Float64Index are converted to silly integers using index.astype('int') #13149

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
mao-liu opened this issue May 12, 2016 · 6 comments
Closed
Labels
Bug Dtype Conversions Unexpected or buggy dtype conversions

Comments

@mao-liu
Copy link

mao-liu commented May 12, 2016

Code Sample, a copy-pastable example if possible

>>> df = pd.DataFrame({'a': [1, 2], 'b': [3, 4]})
>>> df.index = [None, 1]
>>> df
      a  b
NaN   1  3
 1.0  2  4
>>> df.index = df.index.astype('int')
>>> df
                      a  b
-9223372036854775808  1  3
 1                    2  4

output of pd.show_versions()

>>> pd.show_versions()

INSTALLED VERSIONS
------------------
commit: None
python: 2.7.11.final.0
python-bits: 64
OS: Linux
OS-release: 4.1.13-100.fc21.x86_64
machine: x86_64
processor: x86_64
byteorder: little
LC_ALL: None
LANG: en_AU.utf8

pandas: 0.18.1
nose: None
pip: 8.1.1
setuptools: 20.2.2
Cython: None
numpy: 1.11.0
scipy: 0.17.0
statsmodels: None
xarray: 0.7.2
IPython: 4.2.0
sphinx: 1.3.5
patsy: None
dateutil: 2.4.2
pytz: 2015.7
blosc: None
bottleneck: 1.0.0
tables: 3.2.2
numexpr: 2.5.2
matplotlib: 1.5.1
openpyxl: None
xlrd: None
xlwt: None
xlsxwriter: None
lxml: None
bs4: None
html5lib: None
httplib2: None
apiclient: None
sqlalchemy: None
pymysql: None
psycopg2: None
jinja2: 2.8
boto: None
pandas_datareader: None
@jorisvandenbossche
Copy link
Member

This is numpy behaviour:

In [22]: np.array([np.nan, 1.]).astype(int)
Out[22]: array([-2147483648,           1])

But, we should probably check for the occurence of NaNs, just as we do for Series:

In [29]: df.iloc[0,0] = np.nan

In [30]: df.a
Out[30]:
NaN   NaN
 1      2
Name: a, dtype: float64

In [31]: df.a.astype(int)
...

C:\Anaconda\lib\site-packages\pandas\core\common.pyc in _astype_nansafe(arr, dty
pe, copy)
   2726
   2727         if np.isnan(arr).any():
-> 2728             raise ValueError('Cannot convert NA to integer')
   2729     elif arr.dtype == np.object_ and np.issubdtype(dtype.type, np.intege
r):
   2730         # work around NumPy brokenness, #1987

ValueError: Cannot convert NA to integer

@jorisvandenbossche jorisvandenbossche added Bug Dtype Conversions Unexpected or buggy dtype conversions labels May 12, 2016
@jorisvandenbossche jorisvandenbossche added this to the Next Major Release milestone May 12, 2016
@pijucha
Copy link
Contributor

pijucha commented May 17, 2016

I wanted to fix this bug but noticed a similar behaviour of other objects: DatetimeIndex, TimedeltaIndex, Categorical, CategoricalIndex. Namely (all four of them behave identically):

A = pd.DatetimeIndex([1e10,2e10,None])
A
Out[76]: DatetimeIndex(['1970-01-01 00:00:10', '1970-01-01 00:00:20', 'NaT'], dtype='datetime64[ns]', freq=None)
A.astype(int)
Out[77]: array([         10000000000,          20000000000, -9223372036854775808])

However, unlike with Float64Index, this is invertible:

pd.DatetimeIndex(A.astype(int))
Out[78]: DatetimeIndex(['1970-01-01 00:00:10', '1970-01-01 00:00:20', 'NaT'], dtype='datetime64[ns]', freq=None)

My question: is this behaviour also a bug and should be fixed the same way (raising a ValueError)? And if so, should all the fixes be placed into one commit/pull request?

By the way, there might be other objects with the same issue, which call numpy.ndarray.astype(). And numpy is also a bit inconsistent here:

np.array([1,np.nan]).astype(int)
Out[84]: array([                   1, -9223372036854775808])
np.array([1,np.nan], dtype = int)
Traceback...
ValueError: cannot convert float NaN to integer

@jreback
Copy link
Contributor

jreback commented May 17, 2016

@ch41rmn these are all as expected. converting to int converts to the underlying integer based representation.

The only issue is that Float64Index.astype(int) should raise (as its effectively non-convertible).

@jorisvandenbossche
Copy link
Member

@jreback I actually think we should raise in the datetimeindex case as well (ideally). A NaT cannot be converted to int (just as float nan cannot be converted). There is the asi8 attribute if you want this.
But, of course, that is not really back compat. Internally I think we consequently use asi8? But not sure about external use of course

@jorisvandenbossche
Copy link
Member

Raising for CategoricalIndex seems less of a problem (not a common thing to do)

@jreback
Copy link
Contributor

jreback commented May 17, 2016

This is excactly what should be returned (and is useful). yes its equivalen to internal .asi8, but I dont' see a good reason to NOT do this.

In [20]: pd.DatetimeIndex([1e10,2e10,None]).astype(int)
Out[20]: array([         10000000000,          20000000000, -9223372036854775808])

pijucha added a commit to pijucha/pandas that referenced this issue May 23, 2016
1. Float64Index.astype(int) raises ValueError if a NaN is present.
Previously, it converted NaN's to the smallest negative integer.

2. TimedeltaIndex.astype(int) and DatetimeIndex.astype(int) return
Int64Index, which is consistent with behavior of other Indexes.
Previously, they returned a numpy.array of ints.

3. Added:
  - bool parameter 'copy' to Index.astype()
  - shared doc string to .astype()
  - tests on .astype() (consolidated and added new)
  - bool parameter 'copy' to Categorical.astype()

4. Internals:
  - Fixed core.common.is_timedelta64_ns_dtype().
  - Set a default NaT representation to a string type in a parameter
    of DatetimeIndex._format_native_types().
    Previously, it produced a unicode u'NaT' in Python2.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug Dtype Conversions Unexpected or buggy dtype conversions
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants