-
-
Notifications
You must be signed in to change notification settings - Fork 18.5k
BUG: int64 and uint64 values converted to float64 when concatenated #34356
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
For reference, with int8 and uint8: import pandas as pd
pd.concat([
pd.Series([1,2], dtype='int8'),
pd.Series([3,4], dtype='uint8')
]) Output: 0 1
1 2
0 3
1 4
dtype: int16 |
I am a newbie here. Can I take up this issue? @bergkvist |
@bergkvist This is not a bug. Try doing this.
I checked the documentation and code. The attribute dtype is case-sensitive. Maybe you can raise an issue on why does it take the default as float instead of int. I am looking into it and will be happy to help. Let me know if I have cleared the issue. Thanks. |
@vampypandya Yes, dtype is case-sensitive (all lower-case). dtype=object is used for mixed data-types that pandas does not understand internally. pd.concat([
pd.Series([1,2], dtype='UInt64'),
pd.Series([3,4], dtype='Int64')
])
# Output
0 1
1 2
0 3
1 4
dtype: object
# ^ Notice that the dtype is object here |
@bergkvist Lets see for the input:
The code snippet which handles the final dtype is the following:
Let's check for various cases.
I have not understood 'ExtensionDtype' well. I would be grateful if anyone can explain this. Also, do let me know in case of any mistakes. |
@bergkvist this is the expected behavour, according to numpy casting rules:
This is because you cannot represent all integers from int64 or uint64 as either one of those dtypes (uint64 cannot hold the negative values from int64, and int64 cannot hold the largest integers from uint64). So the only dtype that can stores all the possible values approximately is a float (since there is no larger integer dtype available than 64 bit). For the extension dtypes (with upper case), it's the same story, except that in the released version of pandas, you still get "object" dtype as result from concatting different dtypes. This has been fixed in the meantime:
|
@jorisvandenbossche |
@jorisvandenbossche - If you are using integers between 2^63 and 2^64 as well as negative integers, it would probably be better to throw an error (or a warning) than to convert to float64 implicitly. A reason for using integers is for the precision it gives, and the ability to set it as an index. Imagine if your computer started compressing your images (lossy) when running out of space instead of throwing an error that you are out of disk space. |
In general, both numpy and pandas tries to avoid "value-based" casting rules, meaning: depending on the actual values in the array, and not just the dtype. So in general, the casting gets decided on the dtypes: what should the combination of "int64" and "uint64" give? (regardless of whether negative or big integers are present). There are certainly arguments for having it raise (or return object dtype) instead of resulting in float64. But it's also a change that would have quite some implications (eg also arithmetic or comparison operations use those casting rules). |
An alternative to converting to float could be to drop support for the uint64-type (who needs integer numbers between Allowing only: uint8, int8, uint16, int16, uint32, int32, int64 would allow for well-behaved dtype-merging for all integer-types. The reason I personally ended up with a uint64, is that pandas does not support uint32-indices (nor int32-indices for that matter), and converts it to uint64 when a row of uint32-values are set as index. Here, an alternative could be for the index to be converted from uint32 to int64 instead. Essentially, due to the accumulating dtype-inflation, int32 and uint32-values inflated to float64 when concatenating after reindexing. Is there really any practical reason to use uint64 instead of int64? |
I think we should then rather try to allow for lower-bit dtypes as indices, than prohibit uint64 because of that. Feel free to open an issue specifically about that (if there is not yet one).
I suppose because it preserves the "unsigned" aspect of it? |
@bergkvist I am going to close this because the original issue (concatenating int64 and uint64 giving float64) is not something we are going to change (that discussion needs to happen in numpy, which there also already has been some). But we can certainly further discuss the usability problems regarding unsigned integers in pandas, like the indexing you mentioned. |
I have checked that this issue has not already been reported.
I have confirmed this bug exists on the latest version of pandas.
(optional) I have confirmed this bug exists on the master branch of pandas.
Code Sample
Problem description
The output datatype becomes float when the input datatypes are different integer types.
Since there are no NaN-values or decimals involved, it can be confusing that you suddenly get float output.
Expected Output
union(int64, uint64) ⊆ int65, which is the same as int64 unless you have numbers close to "the edge".
Output of
pd.show_versions()
INSTALLED VERSIONS
commit : None
python : 3.7.3.final.0
python-bits : 64
OS : Linux
OS-release : 4.9.125-linuxkit
machine : x86_64
processor : x86_64
byteorder : little
LC_ALL : en_US.UTF-8
LANG : en_US.UTF-8
LOCALE : en_US.UTF-8
pandas : 0.25.0
numpy : 1.17.0
pytz : 2019.2
dateutil : 2.8.0
pip : 19.2.1
setuptools : 41.0.1
Cython : 0.29.13
pytest : None
hypothesis : None
sphinx : None
blosc : None
feather : None
xlsxwriter : None
lxml.etree : 4.5.0
html5lib : None
pymysql : None
psycopg2 : None
jinja2 : 2.10.1
IPython : 7.7.0
pandas_datareader: None
bs4 : 4.8.0
bottleneck : None
fastparquet : 0.3.2
gcsfs : None
lxml.etree : 4.5.0
matplotlib : 3.1.1
numexpr : 2.6.9
odfpy : None
openpyxl : None
pandas_gbq : None
pyarrow : None
pytables : None
s3fs : None
scipy : 1.3.0
sqlalchemy : 1.3.6
tables : None
xarray : None
xlrd : 1.2.0
xlwt : None
xlsxwriter : None
The text was updated successfully, but these errors were encountered: