Skip to content

Column dtype change on write of improper value #26049

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
jp68138743541 opened this issue Apr 11, 2019 · 5 comments · Fixed by #45273
Closed

Column dtype change on write of improper value #26049

jp68138743541 opened this issue Apr 11, 2019 · 5 comments · Fixed by #45273
Labels
Bug Dtype Conversions Unexpected or buggy dtype conversions Indexing Related to indexing on series/frames, not to indexes themselves
Milestone

Comments

@jp68138743541
Copy link

Code Sample

import pandas as pd
df = pd.DataFrame(columns=['col1'], data=[1,2,3,4], dtype='uint8')
print('original dtypes:')
print(df.dtypes)
print()
print('original data frame:')
print(df)
print()
df.loc[2,'col1']=300
print('dtypes after write operation:')
print(df.dtypes)
print()
print('data frameafter write:')
print(df)

output:

original dtypes:
col1    uint8
dtype: object

original data frame:
   col1
0     1
1     2
2     3
3     4

dtypes after write operation:
col1    int64
dtype: object

data frameafter write:
   col1
0     1
1     2
2    44
3     4

Problem description

When writing, e.g., a too big integer to an 8-bit unsigned integer column, the value of the written integer is casted to uint8 and the data type of the column is changed to int64.

Expected Output

I would expect that either the value is casted and the data type is retained or the data type gets changed and the value is retained.

original dtypes:
col1    uint8
dtype: object

original data frame:
   col1
0     1
1     2
2     3
3     4

dtypes after write operation:
col1    uint8
dtype: object

data frameafter write:
   col1
0     1
1     2
2    44
3     4

or

original dtypes:
col1    uint8
dtype: object

original data frame:
   col1
0     1
1     2
2     3
3     4

dtypes after write operation:
col1    int64 [or even better uint16]
dtype: object

data frameafter write:
   col1
0     1
1     2
2    300
3     4

Output of pd.show_versions()

INSTALLED VERSIONS

commit: None
python: 3.6.8.final.0
python-bits: 64
OS: Linux
OS-release: 4.9.125-linuxkit
machine: x86_64
processor: x86_64
byteorder: little
LC_ALL: en_US.UTF-8
LANG: en_US.UTF-8
LOCALE: en_US.UTF-8

pandas: 0.24.2
pytest: 4.3.1
pip: 19.0.3
setuptools: 40.8.0
Cython: 0.29.6
numpy: 1.16.2
scipy: 1.2.1
pyarrow: 0.11.1
xarray: 0.11.3
IPython: 7.1.1
sphinx: 2.0.0
patsy: 0.5.1
dateutil: 2.8.0
pytz: 2018.9
blosc: None
bottleneck: None
tables: 3.5.1
numexpr: 2.6.9
feather: None
matplotlib: 3.0.3
openpyxl: None
xlrd: 1.2.0
xlwt: None
xlsxwriter: 1.1.5
lxml.etree: 4.3.0
bs4: 4.7.1
html5lib: None
sqlalchemy: 1.3.1
pymysql: None
psycopg2: 2.7.6.1 (dt dec pq3 ext lo64)
jinja2: 2.10
s3fs: None
fastparquet: 0.2.1
pandas_gbq: None
pandas_datareader: None
gcsfs: None

@WillAyd
Copy link
Member

WillAyd commented Apr 11, 2019

That is strange. Option 2 is what we would want here - investigation and PRs to fix would certainly be welcome

@WillAyd WillAyd added Bug Indexing Related to indexing on series/frames, not to indexes themselves labels Apr 11, 2019
@WillAyd WillAyd added this to the Contributions Welcome milestone Apr 11, 2019
@WillAyd WillAyd added the Dtype Conversions Unexpected or buggy dtype conversions label Apr 11, 2019
@bpieper26
Copy link

bpieper26 commented Apr 25, 2019

Behavior is consistent with numpy.
dtype appears to overflow and loop back around to zero.

import numpy as np

value = np.array([300], dtype=np.uint8)[0]

print(f'value = {value}, value should = 300')

print(f'{300 - 2**8} = 300 - 2**8')

Output

value = 44, value should = 300
44 = 300 - 2**8

@bpieper26
Copy link

My apologies, I failed to see the dtype change in the original example. Numpy's behavior is not exactly consistent pandas' as it doesn't cast the dtype to int64.

array = np.array([300], dtype=np.uint8)
print(f'value = {array[0]}, value should = 300')
print(f'array.dtype = {array.dtype}')

Ouput

value = 44, value should = 300
array.dtype = uint8

Will continue investigating on the pandas side.

@phofl
Copy link
Member

phofl commented Nov 7, 2020

@WillAyd The current behavior on master is

original dtypes:
col1    uint8
dtype: object

original data frame:
   col1
0     1
1     2
2     3
3     4

dtypes after write operation:
col1    uint8
dtype: object

data frameafter write:
   col1
0     1
1     2
2    44
3     4

Is int64 and 300 still the expected behavior?

@WillAyd
Copy link
Member

WillAyd commented Nov 10, 2020

Hmm OK. This is a pretty ambiguous set of actions so I think matching numpy is the best we can offer. Even there, this probably just falls back to the C standard for wrap around of unsigned integers

@jreback jreback modified the milestones: Contributions Welcome, 1.5 Jan 10, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug Dtype Conversions Unexpected or buggy dtype conversions Indexing Related to indexing on series/frames, not to indexes themselves
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants