Skip to content

BUG: uint16 inserted as int16 when assigning row with dict #47294

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
2 of 3 tasks
bluss opened this issue Jun 9, 2022 · 8 comments · Fixed by #47475
Closed
2 of 3 tasks

BUG: uint16 inserted as int16 when assigning row with dict #47294

bluss opened this issue Jun 9, 2022 · 8 comments · Fixed by #47475
Labels
Bug Indexing Related to indexing on series/frames, not to indexes themselves Regression Functionality that used to work in a prior pandas version
Milestone

Comments

@bluss
Copy link

bluss commented Jun 9, 2022

Pandas version checks

  • I have checked that this issue has not already been reported.

  • I have confirmed this bug exists on the latest version of pandas.

  • I have confirmed this bug exists on the main branch of pandas.

Reproducible Example

import pandas as pd
import numpy as np

df = pd.DataFrame(columns=["actual", "reference"])
df.loc[0] = {'actual': np.uint16(40_000), 'reference': "nope"}
df

#    actual reference
# 0  -25536      nope

df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 1 entries, 0 to 0
Data columns (total 2 columns):
 #   Column     Non-Null Count  Dtype 
---  ------     --------------  ----- 
 0   actual     1 non-null      int16 
 1   reference  1 non-null      object
dtypes: int16(1), object(1)

Issue Description

Inserting a row with a dict, uint16 values are converted to int16 and the value conversion does not preserve the correct value. This also happens when assigning into an existing object-typed column (the conversion sequence seems to be -> int16 -> int in that case).

Expected Behavior

It's expected the dtype is preserved - uint16 if possible, or an int which is large enough to represent the value.

Installed Versions

python           : 3.8.10.final.0
python-bits      : 64
OS               : Linux
machine          : x86_64

pandas           : 1.4.2
numpy            : 1.22.4
pytz             : 2022.1
dateutil         : 2.8.2
pip              : 20.0.2
setuptools       : 44.0.0
Cython           : None
pytest           : None
hypothesis       : None
sphinx           : None
blosc            : None
feather          : None
xlsxwriter       : None
lxml.etree       : None
html5lib         : None
pymysql          : None
psycopg2         : None
jinja2           : 3.1.2
IPython          : None
pandas_datareader: None
bs4              : None
bottleneck       : None
brotli           : None
fastparquet      : None
fsspec           : None
gcsfs            : None
markupsafe       : 2.1.1
matplotlib       : None
numba            : None
numexpr          : None
odfpy            : None
openpyxl         : None
pandas_gbq       : None
pyarrow          : None
pyreadstat       : None
pyxlsb           : None
s3fs             : None
scipy            : None
snappy           : None
sqlalchemy       : None
tables           : None
tabulate         : None
xarray           : None
xlrd             : None
xlwt             : None
zstandard        : None
@bluss bluss added Bug Needs Triage Issue that has not been reviewed by a pandas team member labels Jun 9, 2022
@bluss bluss changed the title BUG: BUG: uint16 inserted as int16 when assigning row with dict Jun 9, 2022
@simonjayhawkins
Copy link
Member

Thanks @bluss for the report

note working in pandas-1.2.5, albeit retaining object dtype from empty DataFrame and then in pandas-1.3.5 overflow but still with object dtype and then in pandas-1.4.2/main overflow but changed (incorrect) dtype.

print(pd.__version__)
df = pd.DataFrame(columns=["actual", "reference"])
df.loc[0] = {"actual": np.uint16(40_000), "reference": "nope"}
print(df)
print(df.dtypes)

# 1.2.5
#   actual reference
# 0  40000      nope
# actual       object
# reference    object
# dtype: object

# 1.3.5
#    actual reference
# 0  -25536      nope
# actual       object
# reference    object
# dtype: object

# 1.4.2
#    actual reference
# 0  -25536      nope
# actual        int16
# reference    object
# dtype: object

@simonjayhawkins simonjayhawkins added Indexing Related to indexing on series/frames, not to indexes themselves Regression Functionality that used to work in a prior pandas version and removed Needs Triage Issue that has not been reviewed by a pandas team member labels Jun 15, 2022
simonjayhawkins added a commit to simonjayhawkins/pandas that referenced this issue Jun 15, 2022
@simonjayhawkins
Copy link
Member

note working in pandas-1.2.5, albeit retaining object dtype from empty DataFrame and then in pandas-1.3.5 overflow but still with object dtype

first bad commit: [549e39b] ENH: Make maybe_convert_object respect dtype itemsize (#40908) cc @rhshadrach

and then in pandas-1.4.2/main overflow but changed (incorrect) dtype.

I suspect as result of https://pandas.pydata.org/pandas-docs/stable/whatsnew/v1.4.0.html#ignoring-dtypes-in-concat-with-empty-or-all-na-columns and that addressing the above regression resolves the dtype issue

@rhshadrach
Copy link
Member

It looks like we never set seen.uint_ to be True in maybe_convert_objects; I believe that will resolve. Will be working on this.

@rhshadrach
Copy link
Member

rhshadrach commented Jun 19, 2022

maybe_convert_objects currently does not check for signed vs unsigned numpy types.

ser = pd.Series([np.uint16(40000)])
print(ser)

# 0   -25536
# dtype: int16

The current checking for whether maybe_convert_objects returns signed or unsigned uses the Boolean flags

  • seen.uint_: whether a value greater than or equal to 2 ** 63 is seen
  • seen.sint_: whether a value less than 0 is seen

and follows the logic (assuming only integral types are seen):

  • If seen.uint_ and seen.sint_ are both True, result is object dtype
  • If seen.uint_ is True, result is unsigned integer dtype
  • Otherwise result is signed integer dtype

Now that maybe_convert_objects respects itemsize when all inputs are numpy scalars (#40908), the use of the value 2 ** 63 above is no longer sufficient. However, for modifying the logic, we only need to consider the case where the input is all numpy scalars as otherwise 64 bit values are assumed.

I propose the following logic to determining seen.uint_ / sint_:

  • seen.uint_: whether a value greater than or equal to 2 ** 63 or an unsigned numpy scalar is seen
  • seen.sint_: whether a value less than 0 is or a signed numpy scalar seen

In the case where all values are numpy scalars, this is effectively:

  • If both signed and unsigned scalars are seen, result is object dtype
  • If only unsigned scalars are seen, result is unsigned
  • Otherwise result is signed.

I plan to put up a PR in the next few days.

@simonjayhawkins
Copy link
Member

In the case where all values are numpy scalars, this is effectively:

  • If both signed and unsigned scalars are seen, result is object dtype

Out of curiosity, why do we have our own logic and not just use the numpy array constructor when all values are numpy scalars/numeric?

np.array([np.uint16(np.iinfo(np.uint16).max), np.int16(np.iinfo(np.int16).min)]).dtype
# dtype('int32')

np.array([np.uint32(np.iinfo(np.uint32).max), np.int32(np.iinfo(np.int32).min)]).dtype
# dtype('int64')

np.array([np.uint64(np.iinfo(np.uint64).max), np.int64(np.iinfo(np.int64).min)]).dtype
# dtype('float64')

This would probably be an api-breaking change, but reducing the cases that return object dtype is probably a win for most users.

i have also spoken with some users that find pandas slow, so after using pandas to explore/develop, use numpy for speed where possible in production. So although I sometimes see comments about users learning pandas first and hence not knowing numpy and that we don't need to match numpy's behavior, I generally prefer to match numpy where possible.

@jreback
Copy link
Contributor

jreback commented Jul 11, 2022

-1 on changing

pandas does the right things - numpy sometimes does things that tbh are not great but have forever been there

slowness is almost 100% incorrect usage and writing non idiomatic code

if someone's sat 'pandas is slow' then show a specific example

@simonjayhawkins
Copy link
Member

pandas does the right things - numpy sometimes does things that tbh are not great but have forever been there

yes, converting integers (mix of uint64 and int64) to floats (dtype('float64')) could be viewed as "corrupting" input data through loss of precision.

slowness is almost 100% incorrect usage and writing non idiomatic code

if someone's sat 'pandas is slow' then show a specific example

If they somehow end up with object dtype, then this is a given? To be fair, the users that I have heard this from are doing machine learning so tend to move their data into numpy arrays anyway, it's just a question of which stage of their pipeline.

@rhshadrach
Copy link
Member

rhshadrach commented Jul 11, 2022

I've opened a new issue and copied the discussion above there - #47673

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug Indexing Related to indexing on series/frames, not to indexes themselves Regression Functionality that used to work in a prior pandas version
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants