Skip to content

BUG: int64 and uint64 values converted to float64 when concatenated #34356

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
2 of 3 tasks
bergkvist opened this issue May 24, 2020 · 14 comments
Closed
2 of 3 tasks

BUG: int64 and uint64 values converted to float64 when concatenated #34356

bergkvist opened this issue May 24, 2020 · 14 comments

Comments

@bergkvist
Copy link

bergkvist commented May 24, 2020

  • I have checked that this issue has not already been reported.

  • I have confirmed this bug exists on the latest version of pandas.

  • (optional) I have confirmed this bug exists on the master branch of pandas.


Code Sample

import pandas as pd

pd.concat([
    pd.Series([1,2], dtype='uint64'),
    pd.Series([3,4], dtype='int64')
])

# Output:
0    1.0
1    2.0
0    3.0
1    4.0
dtype: float64

Problem description

The output datatype becomes float when the input datatypes are different integer types.

Since there are no NaN-values or decimals involved, it can be confusing that you suddenly get float output.

Expected Output

0    1
1    2
0    3
1    4
dtype: int64

union(int64, uint64) ⊆ int65, which is the same as int64 unless you have numbers close to "the edge".

Output of pd.show_versions()

INSTALLED VERSIONS

commit : None
python : 3.7.3.final.0
python-bits : 64
OS : Linux
OS-release : 4.9.125-linuxkit
machine : x86_64
processor : x86_64
byteorder : little
LC_ALL : en_US.UTF-8
LANG : en_US.UTF-8
LOCALE : en_US.UTF-8

pandas : 0.25.0
numpy : 1.17.0
pytz : 2019.2
dateutil : 2.8.0
pip : 19.2.1
setuptools : 41.0.1
Cython : 0.29.13
pytest : None
hypothesis : None
sphinx : None
blosc : None
feather : None
xlsxwriter : None
lxml.etree : 4.5.0
html5lib : None
pymysql : None
psycopg2 : None
jinja2 : 2.10.1
IPython : 7.7.0
pandas_datareader: None
bs4 : 4.8.0
bottleneck : None
fastparquet : 0.3.2
gcsfs : None
lxml.etree : 4.5.0
matplotlib : 3.1.1
numexpr : 2.6.9
odfpy : None
openpyxl : None
pandas_gbq : None
pyarrow : None
pytables : None
s3fs : None
scipy : 1.3.0
sqlalchemy : 1.3.6
tables : None
xarray : None
xlrd : 1.2.0
xlwt : None
xlsxwriter : None

@bergkvist bergkvist added Bug Needs Triage Issue that has not been reviewed by a pandas team member labels May 24, 2020
@bergkvist
Copy link
Author

For reference, with int8 and uint8:

import pandas as pd
pd.concat([
    pd.Series([1,2], dtype='int8'),
    pd.Series([3,4], dtype='uint8')
])

Output:

0    1
1    2
0    3
1    4
dtype: int16

@vampypandya
Copy link
Contributor

I am a newbie here. Can I take up this issue? @bergkvist

@vampypandya
Copy link
Contributor

vampypandya commented May 25, 2020

@bergkvist This is not a bug. Try doing this.

import pandas as pd

pd.concat([
    pd.Series([1,2], dtype='UInt64'),
    pd.Series([3,4], dtype='Int64')
])

# Output
0    1
1    2
0    3
1    4

I checked the documentation and code. The attribute dtype is case-sensitive.
If it does not find a valid or finds an empty dtype attribute, it analyses and takes the dtype on its own.

Maybe you can raise an issue on why does it take the default as float instead of int. I am looking into it and will be happy to help.

Let me know if I have cleared the issue.

Thanks.

@bergkvist
Copy link
Author

@vampypandya Yes, dtype is case-sensitive (all lower-case). dtype=object is used for mixed data-types that pandas does not understand internally.

pd.concat([
    pd.Series([1,2], dtype='UInt64'),
    pd.Series([3,4], dtype='Int64')
])

# Output
0    1
1    2
0    3
1    4
dtype: object
# ^ Notice that the dtype is object here

@vampypandya
Copy link
Contributor

@bergkvist
I checked the code. Here is an explanation for all cases. I am new to all this, so please let me know if I am making a mistake anywhere.

Lets see for the input:

pd.concat([
    pd.Series([1,2], dtype='UInt64'),
    pd.Series([3,4], dtype='Int64')
])

The code snippet which handles the final dtype is the following:

    if any(isinstance(t, ExtensionDtype) for t in types):
        for t in types:
            if isinstance(t, ExtensionDtype):
                res = t._get_common_dtype(types)
                if res is not None:
                    return res
        return np.dtype("object")

    # take lowest unit
    if all(is_datetime64_dtype(t) for t in types):
        return np.dtype("datetime64[ns]")
    if all(is_timedelta64_dtype(t) for t in types):
        return np.dtype("timedelta64[ns]")

    # don't mix bool / int or float or complex
    # this is different from numpy, which casts bool with float/int as int
    has_bools = any(is_bool_dtype(t) for t in types)
    if has_bools:
        for t in types:
            if is_integer_dtype(t) or is_float_dtype(t) or is_complex_dtype(t):
                return np.object

    return np.find_common_type(types, [])

Let's check for various cases.

  • Case 1:
    When dtype for both the inputs is the same (same case and same value), the final dtype is the same as input (same case and same value).
    Eg.
    dtype1 = 'Int64'
    dtype2 = 'Int64'
    Final dType = 'Int64'
    Eg.
    dtype1 = 'int64'
    dtype2 = 'int64'
    Final dType = 'int64'

  • Case 2:
    When dtype for one is Uppercase and other dtype is Lowercase, then the output dtype is always 'object'.
    Explanation:
    The dtype in Uppercase is a part of ExtensionDtype, as can be seen in the code. But as they dont have unique data types, so the final dtype is set as 'object'.
    Eg.
    dtype1 = 'int64'
    dtype2 = 'Int64'
    Final dType = 'object'

  • Case 3:
    If any input value is not a part of ExtensionDtype, then the final dtype value is computed as follows.
    np.find_common_type(types, [])
    Eg.
    dtype1 = 'int64'
    dtype2 = 'uint64'
    Final dType = 'float64'

I have not understood 'ExtensionDtype' well. I would be grateful if anyone can explain this.
The issue you have mentioned is in Case 3. Pandas is calling Numpy function to find common type.
Let me know if this is an issue. I can take this issue.

Also, do let me know in case of any mistakes.

@jorisvandenbossche
Copy link
Member

The output datatype becomes float when the input datatypes are different integer types.

Since there are no NaN-values or decimals involved, it can be confusing that you suddenly get float output.

@bergkvist this is the expected behavour, according to numpy casting rules:

In [14]: np.result_type(np.int64, np.uint64) 
Out[14]: dtype('float64')

In [15]: np.result_type(np.int32, np.uint32) 
Out[15]: dtype('int64')

This is because you cannot represent all integers from int64 or uint64 as either one of those dtypes (uint64 cannot hold the negative values from int64, and int64 cannot hold the largest integers from uint64). So the only dtype that can stores all the possible values approximately is a float (since there is no larger integer dtype available than 64 bit).
But so you can also see that this depends on the actual dtype. Eg, as shown above, the range of int32 and uint32 does fit in int64, so in this case you don''t get a float.


For the extension dtypes (with upper case), it's the same story, except that in the released version of pandas, you still get "object" dtype as result from concatting different dtypes. This has been fixed in the meantime:

In [18]: pd.concat([ 
    ...:     pd.Series([1,2], dtype='UInt32'), 
    ...:     pd.Series([3,4], dtype='Int32') 
    ...: ]) 
Out[18]: 
0    1
1    2
0    3
1    4
dtype: Int64

@jorisvandenbossche jorisvandenbossche removed Bug Needs Triage Issue that has not been reviewed by a pandas team member labels May 25, 2020
@vampypandya
Copy link
Contributor

@jorisvandenbossche
Could you please explain the extended dtypes or direct me to some documentation? I tried understanding the code but did not get it.

@jorisvandenbossche
Copy link
Member

@jorisvandenbossche
Copy link
Member

@bergkvist
Copy link
Author

@jorisvandenbossche - If you are using integers between 2^63 and 2^64 as well as negative integers, it would probably be better to throw an error (or a warning) than to convert to float64 implicitly.

A reason for using integers is for the precision it gives, and the ability to set it as an index.

Imagine if your computer started compressing your images (lossy) when running out of space instead of throwing an error that you are out of disk space.

@jorisvandenbossche
Copy link
Member

If you are using integers between 2^63 and 2^64 as well as negative integers

In general, both numpy and pandas tries to avoid "value-based" casting rules, meaning: depending on the actual values in the array, and not just the dtype. So in general, the casting gets decided on the dtypes: what should the combination of "int64" and "uint64" give? (regardless of whether negative or big integers are present).

There are certainly arguments for having it raise (or return object dtype) instead of resulting in float64. But it's also a change that would have quite some implications (eg also arithmetic or comparison operations use those casting rules).
And, that's actually a discussion for numpy (since pandas uses numpy here under the hood). See eg numpy/numpy#11801 (comment), numpy/numpy#5745, numpy/numpy#7126, ..

@bergkvist
Copy link
Author

bergkvist commented May 26, 2020

An alternative to converting to float could be to drop support for the uint64-type (who needs integer numbers between 2**63 and 2**64 anyways?)

Allowing only: uint8, int8, uint16, int16, uint32, int32, int64 would allow for well-behaved dtype-merging for all integer-types.

The reason I personally ended up with a uint64, is that pandas does not support uint32-indices (nor int32-indices for that matter), and converts it to uint64 when a row of uint32-values are set as index. Here, an alternative could be for the index to be converted from uint32 to int64 instead.

Essentially, due to the accumulating dtype-inflation, int32 and uint32-values inflated to float64 when concatenating after reindexing.

Is there really any practical reason to use uint64 instead of int64?

@jorisvandenbossche
Copy link
Member

I think we should then rather try to allow for lower-bit dtypes as indices, than prohibit uint64 because of that. Feel free to open an issue specifically about that (if there is not yet one).

Is there really any practical reason to use uint64 instead of int64?

I suppose because it preserves the "unsigned" aspect of it?

@jorisvandenbossche
Copy link
Member

@bergkvist I am going to close this because the original issue (concatenating int64 and uint64 giving float64) is not something we are going to change (that discussion needs to happen in numpy, which there also already has been some).

But we can certainly further discuss the usability problems regarding unsigned integers in pandas, like the indexing you mentioned.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants