Skip to content

Series of object/strings cannot be converted to Int64Dtype() #28599

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
mar-ses opened this issue Sep 24, 2019 · 18 comments
Closed

Series of object/strings cannot be converted to Int64Dtype() #28599

mar-ses opened this issue Sep 24, 2019 · 18 comments

Comments

@mar-ses
Copy link

mar-ses commented Sep 24, 2019

Edited to add information.

Code Sample, a copy-pastable example if possible

a  = pd.Series(['123', '345', '456'])
a.astype(int)            # works
a.astype('Int64')      # doesn't work

Problem description

Currently, the conversion of object dtypes (containing strings) to Int64 doesn't work, even though it should be able to. It produces a long error (see at the end).

Important to note: the above is trying to convert to Int64 with the capital I. Those are the new nullable-integer arrays that got added to python. pandas seems to support them, yet I think something inside astype wasn't update to reflect that.

In essence, the above should work; there is no reason why it should fail and it's quite simply a bug (in answer to some comments). Moreover, to_numeric is not a sufficient replacement here; it doesn't convert to Int64 when there are missing datatypes, instead it converts to float automatically (this is actually a non-trivial problem when dealing with long integer identifiers, such as GAIA target identifiers).

Traceback:

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-2-5778d49bfa2e> in <module>()
      2 a  = pd.Series(['123', '345', '456'])
      3 print(a.astype(int))
----> 4 print(a.astype('Int64'))

/home/sestovic/anaconda3/lib/python3.6/site-packages/pandas/core/generic.py in astype(self, dtype, copy, errors, **kwargs)
   5880             # else, only a single dtype is given
   5881             new_data = self._data.astype(
-> 5882                 dtype=dtype, copy=copy, errors=errors, **kwargs
   5883             )
   5884             return self._constructor(new_data).__finalize__(self)

/home/sestovic/anaconda3/lib/python3.6/site-packages/pandas/core/internals/managers.py in astype(self, dtype, **kwargs)
    579 
    580     def astype(self, dtype, **kwargs):
--> 581         return self.apply("astype", dtype=dtype, **kwargs)
    582 
    583     def convert(self, **kwargs):

/home/sestovic/anaconda3/lib/python3.6/site-packages/pandas/core/internals/managers.py in apply(self, f, axes, filter, do_integrity_check, consolidate, **kwargs)
    436                     kwargs[k] = obj.reindex(b_items, axis=axis, copy=align_copy)
    437 
--> 438             applied = getattr(b, f)(**kwargs)
    439             result_blocks = _extend_blocks(applied, result_blocks)
    440 

/home/sestovic/anaconda3/lib/python3.6/site-packages/pandas/core/internals/blocks.py in astype(self, dtype, copy, errors, values, **kwargs)
    557 
    558     def astype(self, dtype, copy=False, errors="raise", values=None, **kwargs):
--> 559         return self._astype(dtype, copy=copy, errors=errors, values=values, **kwargs)
    560 
    561     def _astype(self, dtype, copy=False, errors="raise", values=None, **kwargs):

/home/sestovic/anaconda3/lib/python3.6/site-packages/pandas/core/internals/blocks.py in _astype(self, dtype, copy, errors, values, **kwargs)
    641                     # _astype_nansafe works fine with 1-d only
    642                     vals1d = values.ravel()
--> 643                     values = astype_nansafe(vals1d, dtype, copy=True, **kwargs)
    644 
    645                 # TODO(extension)

/home/sestovic/anaconda3/lib/python3.6/site-packages/pandas/core/dtypes/cast.py in astype_nansafe(arr, dtype, copy, skipna)
    648     # dispatch on extension dtype if needed
    649     if is_extension_array_dtype(dtype):
--> 650         return dtype.construct_array_type()._from_sequence(arr, dtype=dtype, copy=copy)
    651 
    652     if not isinstance(dtype, np.dtype):

/home/sestovic/anaconda3/lib/python3.6/site-packages/pandas/core/arrays/integer.py in _from_sequence(cls, scalars, dtype, copy)
    321     @classmethod
    322     def _from_sequence(cls, scalars, dtype=None, copy=False):
--> 323         return integer_array(scalars, dtype=dtype, copy=copy)
    324 
    325     @classmethod

/home/sestovic/anaconda3/lib/python3.6/site-packages/pandas/core/arrays/integer.py in integer_array(values, dtype, copy)
    105     TypeError if incompatible types
    106     """
--> 107     values, mask = coerce_to_array(values, dtype=dtype, copy=copy)
    108     return IntegerArray(values, mask)
    109 

/home/sestovic/anaconda3/lib/python3.6/site-packages/pandas/core/arrays/integer.py in coerce_to_array(values, dtype, mask, copy)
    190         ]:
    191             raise TypeError(
--> 192                 "{} cannot be converted to an IntegerDtype".format(values.dtype)
    193             )
    194 

TypeError: object cannot be converted to an IntegerDtype

Expected Output

Output of pd.show_versions()

[paste the output of pd.show_versions() here below this line]
INSTALLED VERSIONS

commit : None
python : 3.6.8.final.0
python-bits : 64
OS : Linux
OS-release : 4.18.16-041816-generic
machine : x86_64
processor : x86_64
byteorder : little
LC_ALL : None
LANG : en_GB.UTF-8
LOCALE : en_GB.UTF-8

pandas : 0.25.1
numpy : 1.14.3
pytz : 2018.4
dateutil : 2.7.3
pip : 19.1.1
setuptools : 39.1.0
Cython : 0.28.2
pytest : 3.5.1
hypothesis : None
sphinx : 1.7.4
blosc : None
feather : None
xlsxwriter : 1.0.4
lxml.etree : 4.2.1
html5lib : 1.0.1
pymysql : None
psycopg2 : None
jinja2 : 2.10
IPython : 6.4.0
pandas_datareader: None
bs4 : 4.6.0
bottleneck : 1.2.1
fastparquet : None
gcsfs : None
lxml.etree : 4.2.1
matplotlib : 3.1.1
numexpr : 2.6.5
odfpy : None
openpyxl : 2.5.3
pandas_gbq : None
pyarrow : None
pytables : None
s3fs : None
scipy : 1.1.0
sqlalchemy : 1.2.7
tables : 3.4.3
xarray : None
xlrd : 1.1.0
xlwt : 1.3.0
xlsxwriter : 1.0.4

@TomAugspurger
Copy link
Contributor

I don't know if we want to support that automatically, since there's some ambiguity: Converting 'NaN' to NA is probably fine. But what about something like 'some text'.

I'd recommend using pd.to_numeric to get numeric values, and converting to nullable integer after that.

@mar-ses
Copy link
Author

mar-ses commented Sep 24, 2019

The issue is that with missing data, to_numeric will convert to float first right? In which case, if you're talking about having very long integers as identifiers, converting to double precision will approximate and change the last few digits of the identifier. This is actually the problem I was dealing with and why I started looking into Int64. It's not as uncommon as it might seem. For example in astrophysics, GAIA has identifiers that are long enough that floating point conversion approximates and modifies them.

Moreover, as far as I can see, shouldn't .astype('Int64') and .to_numeric handle cases identically really? I mean I don't know the in-depth details of what .to_numeric does off the top of my head, but couldn't you make .astype('Int64') follow the same rules regarding ambiguous cases?

@nrebena
Copy link
Contributor

nrebena commented Oct 3, 2019

Some follow up question:

  • What is the type on a.astype(int). On my system it is int64, but it may differs on yours. Is that part of the problem?
  • What about using a.astype(np.int64) ?
  • Also, is there a difference between int64 and Int64 (outside capitalization?)
  • If we are talking about identifiers, isn't uint64 even better?

Please tell me if I really did'nt understand the issue and am out of my depth.

@jreback
Copy link
Contributor

jreback commented Oct 3, 2019

The issue is that with missing data, to_numeric will convert to float first right? In which case, if you're talking about having very long integers as identifiers, converting to double precision will approximate and change the last few digits of the identifier. This is actually the problem I was dealing with and why I started looking into Int64. It's not as uncommon as it might seem. For example in astrophysics, GAIA has identifiers that are long enough that floating point conversion approximates and modifies them.

Moreover, as far as I can see, shouldn't .astype('Int64') and .to_numeric handle cases identically really? I mean I don't know the in-depth details of what .to_numeric does off the top of my head, but couldn't you make .astype('Int64') follow the same rules regarding ambiguous cases?

we don’t convert to float first

to_numeric is the workhorse

astype doesn’t have any options meaning all values must be convertible like in numpy

@mroeschke
Copy link
Member

Doesn't appear we have much appetite to support this. Thanks for the suggestion but we'd recommend using to_numeric first. Closing.

@mar-ses
Copy link
Author

mar-ses commented Nov 4, 2019

I must say I disagree on both points @mroeschke.

  1. Regarding the "appetite" I'm not sure how you measure that, there were people commenting and some likes.

  2. As I mentioned in my previous comment, to_numeric isn't sufficient. If there are any NaNs, it will first convert to float. That can potentially erase information, as a float cannot hold all the bits of a 64-bit integer.

  3. I gave an example of a situation where this is a problem, namely GAIA identifiers in astronomy, though there are probably other use cases, and in any case this is quite simply a bug. The code in the opening post should work, yet it doesn't. I think something within astype simply wasn't updated yet to reflect the fact that pandas now supports the new Int64 datatype. If pandas doesn't work as expected, people using it will need to spend a lot of time figuring out why and how to get around it. And before you say this is not a common use case, GAIA is essentially the biggest astronomical survey to date. Its data will be used extensively and is already being used, and the fact that this happens with the target/star identifiers means this issue will potentially affect almost everyone using that data that prefers pandas over the astropy.Table.

I think this should be reopened.

@mar-ses
Copy link
Author

mar-ses commented Nov 4, 2019

Some follow up question:

  • What is the type on a.astype(int). On my system it is int64, but it may differs on yours. Is that part of the problem?
  • What about using a.astype(np.int64) ?
  • Also, is there a difference between int64 and Int64 (outside capitalization?)
  • If we are talking about identifiers, isn't uint64 even better?

Please tell me if I really did'nt understand the issue and am out of my depth.

Sorry for the late reply. I think I explained my issue poorly.

On my system I also have int64 by default. However, the issue is that int64 cannot hold missing/NaN values. That's why I need to use IntD64 (with the capital I), the new data type that allows int arrays to hold missing/NaN values:

https://pandas.pydata.org/pandas-docs/stable/user_guide/integer_na.html

This is also why to_numeric doesn't work as it currently is; if it finds missing/NaN values - even if all the other values are int - it will convert to float. That's alright unless you're dealing with something like very long 64-bit integers, where the float significand can't hold all the digits of the integer.

My current workaround is to convert to float128 first, then to Int64, but the above is simply a bug. There is no reason why .astype('Int64') shouldn't work, yet it produces the error above when it tries to convert from strings.

@mar-ses
Copy link
Author

mar-ses commented Nov 4, 2019

The issue is that with missing data, to_numeric will convert to float first right? In which case, if you're talking about having very long integers as identifiers, converting to double precision will approximate and change the last few digits of the identifier. This is actually the problem I was dealing with and why I started looking into Int64. It's not as uncommon as it might seem. For example in astrophysics, GAIA has identifiers that are long enough that floating point conversion approximates and modifies them.
Moreover, as far as I can see, shouldn't .astype('Int64') and .to_numeric handle cases identically really? I mean I don't know the in-depth details of what .to_numeric does off the top of my head, but couldn't you make .astype('Int64') follow the same rules regarding ambiguous cases?

we don’t convert to float first

to_numeric is the workhorse

astype doesn’t have any options meaning all values must be convertible like in numpy

What I mean is to_numeric first converts to float if it detects missing values, and it doesn't seem to want to convert ti Int64. Also, there is no reason astype shouldn't work here, the array of strings above can be converted to Int64.

@jreback
Copy link
Contributor

jreback commented Nov 4, 2019

@mar-ses if you like to contribute tests / patch to .to_numeric that would be greatness; we would / should support nullable integer type conversion there

@mar-ses
Copy link
Author

mar-ses commented Nov 4, 2019

I've never contributed to these big projects, and I assume I would need to understand the internals and the standard way these things are done inside pandas, so any recommendations on where to start reading etc...?

Additionall, would it not also make sense to do it with .astype too? Because if this is done in to_numeric, would that be with an argument, or would it have to automatically figure out that these are all ints with certain values missing? To me, it feels like it makes more sense with astype since there you directly what you want the final dtype to be, whereas to_numeric has to guess right?

@mar-ses
Copy link
Author

mar-ses commented Nov 4, 2019

I'm looking into it, wouldn't minded doing this then. So looking at to_numeric I believe the change would be in __lib.lib in the function maybe_convert_numeric here:

https://github.com/pandas-dev/pandas/blob/master/pandas/_libs/lib.pyx

As far as I can figure out, if it can't immediately convert to the normal int, it will then try to figure out what it can convert to on a value-by-value basis. So it try several possible types and makes an array for each, e.g. complexes, floats, 'uintsetc... Then it goes through the values and if it finds a null for example, it flags that a null was seen, and puts the values into thefloatsandcomplexesarrays but not theints` array.

For this to work then, it would also need to have a nullable integer array, but since this is done in cython, is that even possible? Is there a version of this nullable integer array in cython? Or otherwise, can the following object hold a NaN value: ndarray[int64_t] ints = np.empty(n, dtype='i8')?

Or should I create another array like:

ndarray[Int64_t] null_ints = np.empty(n, dtype='Int64') # or someting similar?

@jreback
Copy link
Contributor

jreback commented Nov 4, 2019

once this is merged (soon); #27335

this will relatively straightforward to patch

@maresb
Copy link

maresb commented Nov 18, 2019

Bumping this issue now since #27335 has been merged. I'd really like to see this, but I personally don't have time at the moment. @mar-ses, are you still up for looking into this?

@jreback
Copy link
Contributor

jreback commented Nov 18, 2019

Bumping this issue now since #27335 has been merged. I'd really like to see this, but I personally don't have time at the moment. @mar-ses, are you still up for looking into this?

i never understood what good bumping an issue in an open source all volunteer project actually means

@maresb
Copy link

maresb commented Nov 18, 2019

I hope I didn't commit a faux pas. Since the anticipated merge recently took place, patching this issue is no longer blocked. I was trying to be helpful by drawing attention to this fact as a "bump". Sorry if that came across as pushy/annoying. I'm newly active on GitHub and still figuring out the social norms.

@mar-ses
Copy link
Author

mar-ses commented Nov 18, 2019

I'm also newby here. So I looked at this other issue a bit (the thing that's getting merged), and won't the update to maybe_convert_numeric fix the issue here too? I need to have a deeper look, sorry been very busy.

@jreback
Copy link
Contributor

jreback commented Nov 18, 2019

@maresb there are 3000 issues and all volunteer
so issues get solved when folks contribute PRs

@maresb
Copy link

maresb commented Nov 19, 2019

@jreback yes, that is so obvious that I'm surprised that you feel the need to point it out to me. I'd love to contribute, but it'll be several weeks before that's even possible.

In case you have a problem with my previous comment, I would appreciate some constructive feedback. I thought that I was being helpful and polite by alerting @mar-ses, since he previously expressed interest in contributing.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants