Series of object/strings cannot be converted to Int64Dtype() #28599

mar-ses · 2019-09-24T17:06:54Z

Edited to add information.

Code Sample, a copy-pastable example if possible

a  = pd.Series(['123', '345', '456'])
a.astype(int)            # works
a.astype('Int64')      # doesn't work

Problem description

Currently, the conversion of object dtypes (containing strings) to Int64 doesn't work, even though it should be able to. It produces a long error (see at the end).

Important to note: the above is trying to convert to Int64 with the capital I. Those are the new nullable-integer arrays that got added to python. pandas seems to support them, yet I think something inside astype wasn't update to reflect that.

In essence, the above should work; there is no reason why it should fail and it's quite simply a bug (in answer to some comments). Moreover, to_numeric is not a sufficient replacement here; it doesn't convert to Int64 when there are missing datatypes, instead it converts to float automatically (this is actually a non-trivial problem when dealing with long integer identifiers, such as GAIA target identifiers).

Traceback:

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-2-5778d49bfa2e> in <module>()
      2 a  = pd.Series(['123', '345', '456'])
      3 print(a.astype(int))
----> 4 print(a.astype('Int64'))

/home/sestovic/anaconda3/lib/python3.6/site-packages/pandas/core/generic.py in astype(self, dtype, copy, errors, **kwargs)
   5880             # else, only a single dtype is given
   5881             new_data = self._data.astype(
-> 5882                 dtype=dtype, copy=copy, errors=errors, **kwargs
   5883             )
   5884             return self._constructor(new_data).__finalize__(self)

/home/sestovic/anaconda3/lib/python3.6/site-packages/pandas/core/internals/managers.py in astype(self, dtype, **kwargs)
    579 
    580     def astype(self, dtype, **kwargs):
--> 581         return self.apply("astype", dtype=dtype, **kwargs)
    582 
    583     def convert(self, **kwargs):

/home/sestovic/anaconda3/lib/python3.6/site-packages/pandas/core/internals/managers.py in apply(self, f, axes, filter, do_integrity_check, consolidate, **kwargs)
    436                     kwargs[k] = obj.reindex(b_items, axis=axis, copy=align_copy)
    437 
--> 438             applied = getattr(b, f)(**kwargs)
    439             result_blocks = _extend_blocks(applied, result_blocks)
    440 

/home/sestovic/anaconda3/lib/python3.6/site-packages/pandas/core/internals/blocks.py in astype(self, dtype, copy, errors, values, **kwargs)
    557 
    558     def astype(self, dtype, copy=False, errors="raise", values=None, **kwargs):
--> 559         return self._astype(dtype, copy=copy, errors=errors, values=values, **kwargs)
    560 
    561     def _astype(self, dtype, copy=False, errors="raise", values=None, **kwargs):

/home/sestovic/anaconda3/lib/python3.6/site-packages/pandas/core/internals/blocks.py in _astype(self, dtype, copy, errors, values, **kwargs)
    641                     # _astype_nansafe works fine with 1-d only
    642                     vals1d = values.ravel()
--> 643                     values = astype_nansafe(vals1d, dtype, copy=True, **kwargs)
    644 
    645                 # TODO(extension)

/home/sestovic/anaconda3/lib/python3.6/site-packages/pandas/core/dtypes/cast.py in astype_nansafe(arr, dtype, copy, skipna)
    648     # dispatch on extension dtype if needed
    649     if is_extension_array_dtype(dtype):
--> 650         return dtype.construct_array_type()._from_sequence(arr, dtype=dtype, copy=copy)
    651 
    652     if not isinstance(dtype, np.dtype):

/home/sestovic/anaconda3/lib/python3.6/site-packages/pandas/core/arrays/integer.py in _from_sequence(cls, scalars, dtype, copy)
    321     @classmethod
    322     def _from_sequence(cls, scalars, dtype=None, copy=False):
--> 323         return integer_array(scalars, dtype=dtype, copy=copy)
    324 
    325     @classmethod

/home/sestovic/anaconda3/lib/python3.6/site-packages/pandas/core/arrays/integer.py in integer_array(values, dtype, copy)
    105     TypeError if incompatible types
    106     """
--> 107     values, mask = coerce_to_array(values, dtype=dtype, copy=copy)
    108     return IntegerArray(values, mask)
    109 

/home/sestovic/anaconda3/lib/python3.6/site-packages/pandas/core/arrays/integer.py in coerce_to_array(values, dtype, mask, copy)
    190         ]:
    191             raise TypeError(
--> 192                 "{} cannot be converted to an IntegerDtype".format(values.dtype)
    193             )
    194 

TypeError: object cannot be converted to an IntegerDtype

Expected Output

Output of `pd.show_versions()`

[paste the output of `pd.show_versions()` here below this line]
INSTALLED VERSIONS

commit : None
python : 3.6.8.final.0
python-bits : 64
OS : Linux
OS-release : 4.18.16-041816-generic
machine : x86_64
processor : x86_64
byteorder : little
LC_ALL : None
LANG : en_GB.UTF-8
LOCALE : en_GB.UTF-8

pandas : 0.25.1
numpy : 1.14.3
pytz : 2018.4
dateutil : 2.7.3
pip : 19.1.1
setuptools : 39.1.0
Cython : 0.28.2
pytest : 3.5.1
hypothesis : None
sphinx : 1.7.4
blosc : None
feather : None
xlsxwriter : 1.0.4
lxml.etree : 4.2.1
html5lib : 1.0.1
pymysql : None
psycopg2 : None
jinja2 : 2.10
IPython : 6.4.0
pandas_datareader: None
bs4 : 4.6.0
bottleneck : 1.2.1
fastparquet : None
gcsfs : None
lxml.etree : 4.2.1
matplotlib : 3.1.1
numexpr : 2.6.5
odfpy : None
openpyxl : 2.5.3
pandas_gbq : None
pyarrow : None
pytables : None
s3fs : None
scipy : 1.1.0
sqlalchemy : 1.2.7
tables : 3.4.3
xarray : None
xlrd : 1.1.0
xlwt : 1.3.0
xlsxwriter : 1.0.4

The text was updated successfully, but these errors were encountered:

TomAugspurger · 2019-09-24T17:17:45Z

I don't know if we want to support that automatically, since there's some ambiguity: Converting 'NaN' to NA is probably fine. But what about something like 'some text'.

I'd recommend using pd.to_numeric to get numeric values, and converting to nullable integer after that.

mar-ses · 2019-09-24T17:24:14Z

The issue is that with missing data, to_numeric will convert to float first right? In which case, if you're talking about having very long integers as identifiers, converting to double precision will approximate and change the last few digits of the identifier. This is actually the problem I was dealing with and why I started looking into Int64. It's not as uncommon as it might seem. For example in astrophysics, GAIA has identifiers that are long enough that floating point conversion approximates and modifies them.

Moreover, as far as I can see, shouldn't .astype('Int64') and .to_numeric handle cases identically really? I mean I don't know the in-depth details of what .to_numeric does off the top of my head, but couldn't you make .astype('Int64') follow the same rules regarding ambiguous cases?

nrebena · 2019-10-03T21:11:11Z

Some follow up question:

What is the type on a.astype(int). On my system it is int64, but it may differs on yours. Is that part of the problem?
What about using a.astype(np.int64) ?
Also, is there a difference between int64 and Int64 (outside capitalization?)
If we are talking about identifiers, isn't uint64 even better?

Please tell me if I really did'nt understand the issue and am out of my depth.

jreback · 2019-10-03T21:16:59Z

The issue is that with missing data, to_numeric will convert to float first right? In which case, if you're talking about having very long integers as identifiers, converting to double precision will approximate and change the last few digits of the identifier. This is actually the problem I was dealing with and why I started looking into Int64. It's not as uncommon as it might seem. For example in astrophysics, GAIA has identifiers that are long enough that floating point conversion approximates and modifies them.

Moreover, as far as I can see, shouldn't .astype('Int64') and .to_numeric handle cases identically really? I mean I don't know the in-depth details of what .to_numeric does off the top of my head, but couldn't you make .astype('Int64') follow the same rules regarding ambiguous cases?

we don’t convert to float first

to_numeric is the workhorse

astype doesn’t have any options meaning all values must be convertible like in numpy

mroeschke · 2019-11-03T01:11:36Z

Doesn't appear we have much appetite to support this. Thanks for the suggestion but we'd recommend using to_numeric first. Closing.

mar-ses · 2019-11-04T09:27:32Z

I must say I disagree on both points @mroeschke.

Regarding the "appetite" I'm not sure how you measure that, there were people commenting and some likes.
As I mentioned in my previous comment, to_numeric isn't sufficient. If there are any NaNs, it will first convert to float. That can potentially erase information, as a float cannot hold all the bits of a 64-bit integer.
I gave an example of a situation where this is a problem, namely GAIA identifiers in astronomy, though there are probably other use cases, and in any case this is quite simply a bug. The code in the opening post should work, yet it doesn't. I think something within astype simply wasn't updated yet to reflect the fact that pandas now supports the new Int64 datatype. If pandas doesn't work as expected, people using it will need to spend a lot of time figuring out why and how to get around it. And before you say this is not a common use case, GAIA is essentially the biggest astronomical survey to date. Its data will be used extensively and is already being used, and the fact that this happens with the target/star identifiers means this issue will potentially affect almost everyone using that data that prefers pandas over the astropy.Table.

I think this should be reopened.

mar-ses · 2019-11-04T09:38:37Z

Some follow up question:

What is the type on a.astype(int). On my system it is int64, but it may differs on yours. Is that part of the problem?

What about using a.astype(np.int64) ?

Also, is there a difference between int64 and Int64 (outside capitalization?)

If we are talking about identifiers, isn't uint64 even better?

Please tell me if I really did'nt understand the issue and am out of my depth.

Sorry for the late reply. I think I explained my issue poorly.

On my system I also have int64 by default. However, the issue is that int64 cannot hold missing/NaN values. That's why I need to use IntD64 (with the capital I), the new data type that allows int arrays to hold missing/NaN values:

https://pandas.pydata.org/pandas-docs/stable/user_guide/integer_na.html

This is also why to_numeric doesn't work as it currently is; if it finds missing/NaN values - even if all the other values are int - it will convert to float. That's alright unless you're dealing with something like very long 64-bit integers, where the float significand can't hold all the digits of the integer.

My current workaround is to convert to float128 first, then to Int64, but the above is simply a bug. There is no reason why .astype('Int64') shouldn't work, yet it produces the error above when it tries to convert from strings.

mar-ses · 2019-11-04T09:40:34Z

The issue is that with missing data, to_numeric will convert to float first right? In which case, if you're talking about having very long integers as identifiers, converting to double precision will approximate and change the last few digits of the identifier. This is actually the problem I was dealing with and why I started looking into Int64. It's not as uncommon as it might seem. For example in astrophysics, GAIA has identifiers that are long enough that floating point conversion approximates and modifies them.
Moreover, as far as I can see, shouldn't .astype('Int64') and .to_numeric handle cases identically really? I mean I don't know the in-depth details of what .to_numeric does off the top of my head, but couldn't you make .astype('Int64') follow the same rules regarding ambiguous cases?

we don’t convert to float first

to_numeric is the workhorse

astype doesn’t have any options meaning all values must be convertible like in numpy

What I mean is to_numeric first converts to float if it detects missing values, and it doesn't seem to want to convert ti Int64. Also, there is no reason astype shouldn't work here, the array of strings above can be converted to Int64.

jreback · 2019-11-04T11:26:27Z

@mar-ses if you like to contribute tests / patch to .to_numeric that would be greatness; we would / should support nullable integer type conversion there

mar-ses · 2019-11-04T11:56:11Z

I've never contributed to these big projects, and I assume I would need to understand the internals and the standard way these things are done inside pandas, so any recommendations on where to start reading etc...?

Additionall, would it not also make sense to do it with .astype too? Because if this is done in to_numeric, would that be with an argument, or would it have to automatically figure out that these are all ints with certain values missing? To me, it feels like it makes more sense with astype since there you directly what you want the final dtype to be, whereas to_numeric has to guess right?

mar-ses · 2019-11-04T12:32:18Z

I'm looking into it, wouldn't minded doing this then. So looking at to_numeric I believe the change would be in __lib.lib in the function maybe_convert_numeric here:

https://github.com/pandas-dev/pandas/blob/master/pandas/_libs/lib.pyx

As far as I can figure out, if it can't immediately convert to the normal int, it will then try to figure out what it can convert to on a value-by-value basis. So it try several possible types and makes an array for each, e.g. complexes, floats, 'uintsetc... Then it goes through the values and if it finds a null for example, it flags that a null was seen, and puts the values into thefloatsandcomplexesarrays but not theints` array.

For this to work then, it would also need to have a nullable integer array, but since this is done in cython, is that even possible? Is there a version of this nullable integer array in cython? Or otherwise, can the following object hold a NaN value: ndarray[int64_t] ints = np.empty(n, dtype='i8')?

Or should I create another array like:

ndarray[Int64_t] null_ints = np.empty(n, dtype='Int64') # or someting similar?

jreback · 2019-11-04T12:39:42Z

once this is merged (soon); #27335

this will relatively straightforward to patch

maresb · 2019-11-18T13:25:45Z

Bumping this issue now since #27335 has been merged. I'd really like to see this, but I personally don't have time at the moment. @mar-ses, are you still up for looking into this?

jreback · 2019-11-18T13:27:54Z

Bumping this issue now since #27335 has been merged. I'd really like to see this, but I personally don't have time at the moment. @mar-ses, are you still up for looking into this?

i never understood what good bumping an issue in an open source all volunteer project actually means

maresb · 2019-11-18T15:16:34Z

I hope I didn't commit a faux pas. Since the anticipated merge recently took place, patching this issue is no longer blocked. I was trying to be helpful by drawing attention to this fact as a "bump". Sorry if that came across as pushy/annoying. I'm newly active on GitHub and still figuring out the social norms.

mar-ses · 2019-11-18T15:46:33Z

I'm also newby here. So I looked at this other issue a bit (the thing that's getting merged), and won't the update to maybe_convert_numeric fix the issue here too? I need to have a deeper look, sorry been very busy.

jreback · 2019-11-18T23:14:29Z

@maresb there are 3000 issues and all volunteer
so issues get solved when folks contribute PRs

maresb · 2019-11-19T00:30:29Z

@jreback yes, that is so obvious that I'm surprised that you feel the need to point it out to me. I'd love to contribute, but it'll be several weeks before that's even possible.

In case you have a problem with my previous comment, I would appreciate some constructive feedback. I thought that I was being helpful and polite by alerting @mar-ses, since he previously expressed interest in contributing.

mroeschke closed this as completed Nov 3, 2019

mroeschke mentioned this issue Nov 8, 2019

Cannot convert string to Int64 #29488

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Series of object/strings cannot be converted to Int64Dtype() #28599

Series of object/strings cannot be converted to Int64Dtype() #28599

mar-ses commented Sep 24, 2019 •

edited

Loading

[paste the output of `pd.show_versions()` here below this line]
INSTALLED VERSIONS

TomAugspurger commented Sep 24, 2019

mar-ses commented Sep 24, 2019 •

edited

Loading

nrebena commented Oct 3, 2019

jreback commented Oct 3, 2019

mroeschke commented Nov 3, 2019

mar-ses commented Nov 4, 2019 •

edited

Loading

mar-ses commented Nov 4, 2019 •

edited

Loading

mar-ses commented Nov 4, 2019 •

edited

Loading

jreback commented Nov 4, 2019

mar-ses commented Nov 4, 2019 •

edited

Loading

mar-ses commented Nov 4, 2019 •

edited

Loading

jreback commented Nov 4, 2019

maresb commented Nov 18, 2019

jreback commented Nov 18, 2019

maresb commented Nov 18, 2019

mar-ses commented Nov 18, 2019 •

edited

Loading

jreback commented Nov 18, 2019

maresb commented Nov 19, 2019

Series of object/strings cannot be converted to Int64Dtype() #28599

Series of object/strings cannot be converted to Int64Dtype() #28599

Comments

mar-ses commented Sep 24, 2019 • edited Loading

Code Sample, a copy-pastable example if possible

Problem description

Expected Output

Output of pd.show_versions()

[paste the output of pd.show_versions() here below this line] INSTALLED VERSIONS

TomAugspurger commented Sep 24, 2019

mar-ses commented Sep 24, 2019 • edited Loading

nrebena commented Oct 3, 2019

jreback commented Oct 3, 2019

mroeschke commented Nov 3, 2019

mar-ses commented Nov 4, 2019 • edited Loading

mar-ses commented Nov 4, 2019 • edited Loading

mar-ses commented Nov 4, 2019 • edited Loading

jreback commented Nov 4, 2019

mar-ses commented Nov 4, 2019 • edited Loading

mar-ses commented Nov 4, 2019 • edited Loading

jreback commented Nov 4, 2019

maresb commented Nov 18, 2019

jreback commented Nov 18, 2019

maresb commented Nov 18, 2019

mar-ses commented Nov 18, 2019 • edited Loading

jreback commented Nov 18, 2019

maresb commented Nov 19, 2019

mar-ses commented Sep 24, 2019 •

edited

Loading

Output of `pd.show_versions()`

[paste the output of `pd.show_versions()` here below this line]
INSTALLED VERSIONS

mar-ses commented Sep 24, 2019 •

edited

Loading

mar-ses commented Nov 4, 2019 •

edited

Loading

mar-ses commented Nov 4, 2019 •

edited

Loading

mar-ses commented Nov 4, 2019 •

edited

Loading

mar-ses commented Nov 4, 2019 •

edited

Loading

mar-ses commented Nov 4, 2019 •

edited

Loading

mar-ses commented Nov 18, 2019 •

edited

Loading