BUG: int64 and uint64 values converted to float64 when concatenated #34356

bergkvist · 2020-05-24T21:51:51Z

I have checked that this issue has not already been reported.
I have confirmed this bug exists on the latest version of pandas.
(optional) I have confirmed this bug exists on the master branch of pandas.

Code Sample

import pandas as pd

pd.concat([
    pd.Series([1,2], dtype='uint64'),
    pd.Series([3,4], dtype='int64')
])

# Output:
0    1.0
1    2.0
0    3.0
1    4.0
dtype: float64

Problem description

The output datatype becomes float when the input datatypes are different integer types.

Since there are no NaN-values or decimals involved, it can be confusing that you suddenly get float output.

Expected Output

0    1
1    2
0    3
1    4
dtype: int64

union(int64, uint64) ⊆ int65, which is the same as int64 unless you have numbers close to "the edge".

Output of `pd.show_versions()`

INSTALLED VERSIONS

commit : None
python : 3.7.3.final.0
python-bits : 64
OS : Linux
OS-release : 4.9.125-linuxkit
machine : x86_64
processor : x86_64
byteorder : little
LC_ALL : en_US.UTF-8
LANG : en_US.UTF-8
LOCALE : en_US.UTF-8

pandas : 0.25.0
numpy : 1.17.0
pytz : 2019.2
dateutil : 2.8.0
pip : 19.2.1
setuptools : 41.0.1
Cython : 0.29.13
pytest : None
hypothesis : None
sphinx : None
blosc : None
feather : None
xlsxwriter : None
lxml.etree : 4.5.0
html5lib : None
pymysql : None
psycopg2 : None
jinja2 : 2.10.1
IPython : 7.7.0
pandas_datareader: None
bs4 : 4.8.0
bottleneck : None
fastparquet : 0.3.2
gcsfs : None
lxml.etree : 4.5.0
matplotlib : 3.1.1
numexpr : 2.6.9
odfpy : None
openpyxl : None
pandas_gbq : None
pyarrow : None
pytables : None
s3fs : None
scipy : 1.3.0
sqlalchemy : 1.3.6
tables : None
xarray : None
xlrd : 1.2.0
xlwt : None
xlsxwriter : None

The text was updated successfully, but these errors were encountered:

bergkvist · 2020-05-24T21:57:06Z

For reference, with int8 and uint8:

import pandas as pd
pd.concat([
    pd.Series([1,2], dtype='int8'),
    pd.Series([3,4], dtype='uint8')
])

Output:

0    1
1    2
0    3
1    4
dtype: int16

vampypandya · 2020-05-25T02:47:45Z

I am a newbie here. Can I take up this issue? @bergkvist

vampypandya · 2020-05-25T04:12:28Z

@bergkvist This is not a bug. Try doing this.

import pandas as pd

pd.concat([
    pd.Series([1,2], dtype='UInt64'),
    pd.Series([3,4], dtype='Int64')
])

# Output
0    1
1    2
0    3
1    4

I checked the documentation and code. The attribute dtype is case-sensitive.
If it does not find a valid or finds an empty dtype attribute, it analyses and takes the dtype on its own.

Maybe you can raise an issue on why does it take the default as float instead of int. I am looking into it and will be happy to help.

Let me know if I have cleared the issue.

Thanks.

bergkvist · 2020-05-25T04:27:32Z

@vampypandya Yes, dtype is case-sensitive (all lower-case). dtype=object is used for mixed data-types that pandas does not understand internally.

pd.concat([
    pd.Series([1,2], dtype='UInt64'),
    pd.Series([3,4], dtype='Int64')
])

# Output
0    1
1    2
0    3
1    4
dtype: object
# ^ Notice that the dtype is object here

vampypandya · 2020-05-25T05:55:01Z

@bergkvist
I checked the code. Here is an explanation for all cases. I am new to all this, so please let me know if I am making a mistake anywhere.

Lets see for the input:

pd.concat([
    pd.Series([1,2], dtype='UInt64'),
    pd.Series([3,4], dtype='Int64')
])

The code snippet which handles the final dtype is the following:

    if any(isinstance(t, ExtensionDtype) for t in types):
        for t in types:
            if isinstance(t, ExtensionDtype):
                res = t._get_common_dtype(types)
                if res is not None:
                    return res
        return np.dtype("object")

    # take lowest unit
    if all(is_datetime64_dtype(t) for t in types):
        return np.dtype("datetime64[ns]")
    if all(is_timedelta64_dtype(t) for t in types):
        return np.dtype("timedelta64[ns]")

    # don't mix bool / int or float or complex
    # this is different from numpy, which casts bool with float/int as int
    has_bools = any(is_bool_dtype(t) for t in types)
    if has_bools:
        for t in types:
            if is_integer_dtype(t) or is_float_dtype(t) or is_complex_dtype(t):
                return np.object

    return np.find_common_type(types, [])

Let's check for various cases.

Case 1:
When dtype for both the inputs is the same (same case and same value), the final dtype is the same as input (same case and same value).
Eg.
dtype1 = 'Int64'
dtype2 = 'Int64'
Final dType = 'Int64'
Eg.
dtype1 = 'int64'
dtype2 = 'int64'
Final dType = 'int64'
Case 2:
When dtype for one is Uppercase and other dtype is Lowercase, then the output dtype is always 'object'.
Explanation:
The dtype in Uppercase is a part of ExtensionDtype, as can be seen in the code. But as they dont have unique data types, so the final dtype is set as 'object'.
Eg.
dtype1 = 'int64'
dtype2 = 'Int64'
Final dType = 'object'
Case 3:
If any input value is not a part of ExtensionDtype, then the final dtype value is computed as follows.
np.find_common_type(types, [])
Eg.
dtype1 = 'int64'
dtype2 = 'uint64'
Final dType = 'float64'

I have not understood 'ExtensionDtype' well. I would be grateful if anyone can explain this.
The issue you have mentioned is in Case 3. Pandas is calling Numpy function to find common type.
Let me know if this is an issue. I can take this issue.

Also, do let me know in case of any mistakes.

jorisvandenbossche · 2020-05-25T06:19:30Z

The output datatype becomes float when the input datatypes are different integer types.

Since there are no NaN-values or decimals involved, it can be confusing that you suddenly get float output.

@bergkvist this is the expected behavour, according to numpy casting rules:

In [14]: np.result_type(np.int64, np.uint64) 
Out[14]: dtype('float64')

In [15]: np.result_type(np.int32, np.uint32) 
Out[15]: dtype('int64')

This is because you cannot represent all integers from int64 or uint64 as either one of those dtypes (uint64 cannot hold the negative values from int64, and int64 cannot hold the largest integers from uint64). So the only dtype that can stores all the possible values approximately is a float (since there is no larger integer dtype available than 64 bit).
But so you can also see that this depends on the actual dtype. Eg, as shown above, the range of int32 and uint32 does fit in int64, so in this case you don''t get a float.

For the extension dtypes (with upper case), it's the same story, except that in the released version of pandas, you still get "object" dtype as result from concatting different dtypes. This has been fixed in the meantime:

In [18]: pd.concat([ 
    ...:     pd.Series([1,2], dtype='UInt32'), 
    ...:     pd.Series([3,4], dtype='Int32') 
    ...: ]) 
Out[18]: 
0    1
1    2
0    3
1    4
dtype: Int64

vampypandya · 2020-05-25T06:22:54Z

@jorisvandenbossche
Could you please explain the extended dtypes or direct me to some documentation? I tried understanding the code but did not get it.

jorisvandenbossche · 2020-05-25T06:24:55Z

See https://pandas.pydata.org/docs/dev/development/extending.html#extension-types and https://pandas.pydata.org/docs/dev/user_guide/integer_na.html

jorisvandenbossche · 2020-05-25T06:25:36Z

And https://pandas.pydata.org/community/blog/extension-arrays.html

bergkvist · 2020-05-25T06:30:44Z

@jorisvandenbossche - If you are using integers between 2^63 and 2^64 as well as negative integers, it would probably be better to throw an error (or a warning) than to convert to float64 implicitly.

A reason for using integers is for the precision it gives, and the ability to set it as an index.

Imagine if your computer started compressing your images (lossy) when running out of space instead of throwing an error that you are out of disk space.

jorisvandenbossche · 2020-05-25T06:45:13Z

If you are using integers between 2^63 and 2^64 as well as negative integers

In general, both numpy and pandas tries to avoid "value-based" casting rules, meaning: depending on the actual values in the array, and not just the dtype. So in general, the casting gets decided on the dtypes: what should the combination of "int64" and "uint64" give? (regardless of whether negative or big integers are present).

There are certainly arguments for having it raise (or return object dtype) instead of resulting in float64. But it's also a change that would have quite some implications (eg also arithmetic or comparison operations use those casting rules).
And, that's actually a discussion for numpy (since pandas uses numpy here under the hood). See eg numpy/numpy#11801 (comment), numpy/numpy#5745, numpy/numpy#7126, ..

bergkvist · 2020-05-26T00:22:24Z

An alternative to converting to float could be to drop support for the uint64-type (who needs integer numbers between 2**63 and 2**64 anyways?)

Allowing only: uint8, int8, uint16, int16, uint32, int32, int64 would allow for well-behaved dtype-merging for all integer-types.

The reason I personally ended up with a uint64, is that pandas does not support uint32-indices (nor int32-indices for that matter), and converts it to uint64 when a row of uint32-values are set as index. Here, an alternative could be for the index to be converted from uint32 to int64 instead.

Essentially, due to the accumulating dtype-inflation, int32 and uint32-values inflated to float64 when concatenating after reindexing.

Is there really any practical reason to use uint64 instead of int64?

jorisvandenbossche · 2020-05-26T06:25:22Z

I think we should then rather try to allow for lower-bit dtypes as indices, than prohibit uint64 because of that. Feel free to open an issue specifically about that (if there is not yet one).

Is there really any practical reason to use uint64 instead of int64?

I suppose because it preserves the "unsigned" aspect of it?

jorisvandenbossche · 2020-05-26T06:27:33Z

@bergkvist I am going to close this because the original issue (concatenating int64 and uint64 giving float64) is not something we are going to change (that discussion needs to happen in numpy, which there also already has been some).

But we can certainly further discuss the usability problems regarding unsigned integers in pandas, like the indexing you mentioned.

bergkvist added Bug Needs Triage Issue that has not been reviewed by a pandas team member labels May 24, 2020

jorisvandenbossche removed Bug Needs Triage Issue that has not been reviewed by a pandas team member labels May 25, 2020

bergkvist mentioned this issue May 26, 2020

BUG: reset_index() and set_index() expands uint32 to uint64 #34358

Closed

3 tasks

jorisvandenbossche closed this as completed May 26, 2020

jorisvandenbossche added this to the No action milestone May 26, 2020

nawtrey mentioned this issue Aug 6, 2021

BUG: file hash uint64's converted to float64's in darshan.report.mod_read_all_records() darshan-hpc/darshan#438

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

BUG: int64 and uint64 values converted to float64 when concatenated #34356

BUG: int64 and uint64 values converted to float64 when concatenated #34356

bergkvist commented May 24, 2020 •

edited

Loading

INSTALLED VERSIONS

bergkvist commented May 24, 2020

vampypandya commented May 25, 2020

vampypandya commented May 25, 2020 •

edited

Loading

bergkvist commented May 25, 2020

vampypandya commented May 25, 2020

jorisvandenbossche commented May 25, 2020

vampypandya commented May 25, 2020

jorisvandenbossche commented May 25, 2020

jorisvandenbossche commented May 25, 2020

bergkvist commented May 25, 2020

jorisvandenbossche commented May 25, 2020

bergkvist commented May 26, 2020 •

edited

Loading

jorisvandenbossche commented May 26, 2020

jorisvandenbossche commented May 26, 2020

BUG: int64 and uint64 values converted to float64 when concatenated #34356

BUG: int64 and uint64 values converted to float64 when concatenated #34356

Comments

bergkvist commented May 24, 2020 • edited Loading

Code Sample

Problem description

Expected Output

Output of pd.show_versions()

INSTALLED VERSIONS

bergkvist commented May 24, 2020

vampypandya commented May 25, 2020

vampypandya commented May 25, 2020 • edited Loading

bergkvist commented May 25, 2020

vampypandya commented May 25, 2020

jorisvandenbossche commented May 25, 2020

vampypandya commented May 25, 2020

jorisvandenbossche commented May 25, 2020

jorisvandenbossche commented May 25, 2020

bergkvist commented May 25, 2020

jorisvandenbossche commented May 25, 2020

bergkvist commented May 26, 2020 • edited Loading

jorisvandenbossche commented May 26, 2020

jorisvandenbossche commented May 26, 2020

bergkvist commented May 24, 2020 •

edited

Loading

Output of `pd.show_versions()`

vampypandya commented May 25, 2020 •

edited

Loading

bergkvist commented May 26, 2020 •

edited

Loading