BUG: uint16 inserted as int16 when assigning row with dict #47294

bluss · 2022-06-09T15:24:43Z

Pandas version checks

I have checked that this issue has not already been reported.
I have confirmed this bug exists on the latest version of pandas.
I have confirmed this bug exists on the main branch of pandas.

Reproducible Example

import pandas as pd
import numpy as np

df = pd.DataFrame(columns=["actual", "reference"])
df.loc[0] = {'actual': np.uint16(40_000), 'reference': "nope"}
df

#    actual reference
# 0  -25536      nope

df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 1 entries, 0 to 0
Data columns (total 2 columns):
 #   Column     Non-Null Count  Dtype 
---  ------     --------------  ----- 
 0   actual     1 non-null      int16 
 1   reference  1 non-null      object
dtypes: int16(1), object(1)

Issue Description

Inserting a row with a dict, uint16 values are converted to int16 and the value conversion does not preserve the correct value. This also happens when assigning into an existing object-typed column (the conversion sequence seems to be -> int16 -> int in that case).

Expected Behavior

It's expected the dtype is preserved - uint16 if possible, or an int which is large enough to represent the value.

Installed Versions

python           : 3.8.10.final.0
python-bits      : 64
OS               : Linux
machine          : x86_64

pandas           : 1.4.2
numpy            : 1.22.4
pytz             : 2022.1
dateutil         : 2.8.2
pip              : 20.0.2
setuptools       : 44.0.0
Cython           : None
pytest           : None
hypothesis       : None
sphinx           : None
blosc            : None
feather          : None
xlsxwriter       : None
lxml.etree       : None
html5lib         : None
pymysql          : None
psycopg2         : None
jinja2           : 3.1.2
IPython          : None
pandas_datareader: None
bs4              : None
bottleneck       : None
brotli           : None
fastparquet      : None
fsspec           : None
gcsfs            : None
markupsafe       : 2.1.1
matplotlib       : None
numba            : None
numexpr          : None
odfpy            : None
openpyxl         : None
pandas_gbq       : None
pyarrow          : None
pyreadstat       : None
pyxlsb           : None
s3fs             : None
scipy            : None
snappy           : None
sqlalchemy       : None
tables           : None
tabulate         : None
xarray           : None
xlrd             : None
xlwt             : None
zstandard        : None

The text was updated successfully, but these errors were encountered:

simonjayhawkins · 2022-06-15T17:47:23Z

Thanks @bluss for the report

note working in pandas-1.2.5, albeit retaining object dtype from empty DataFrame and then in pandas-1.3.5 overflow but still with object dtype and then in pandas-1.4.2/main overflow but changed (incorrect) dtype.

print(pd.__version__)
df = pd.DataFrame(columns=["actual", "reference"])
df.loc[0] = {"actual": np.uint16(40_000), "reference": "nope"}
print(df)
print(df.dtypes)

# 1.2.5
#   actual reference
# 0  40000      nope
# actual       object
# reference    object
# dtype: object

# 1.3.5
#    actual reference
# 0  -25536      nope
# actual       object
# reference    object
# dtype: object

# 1.4.2
#    actual reference
# 0  -25536      nope
# actual        int16
# reference    object
# dtype: object

simonjayhawkins · 2022-06-15T18:36:36Z

note working in pandas-1.2.5, albeit retaining object dtype from empty DataFrame and then in pandas-1.3.5 overflow but still with object dtype

first bad commit: [549e39b] ENH: Make maybe_convert_object respect dtype itemsize (#40908) cc @rhshadrach

and then in pandas-1.4.2/main overflow but changed (incorrect) dtype.

I suspect as result of https://pandas.pydata.org/pandas-docs/stable/whatsnew/v1.4.0.html#ignoring-dtypes-in-concat-with-empty-or-all-na-columns and that addressing the above regression resolves the dtype issue

rhshadrach · 2022-06-16T23:03:54Z

It looks like we never set seen.uint_ to be True in maybe_convert_objects; I believe that will resolve. Will be working on this.

rhshadrach · 2022-06-19T19:04:19Z

maybe_convert_objects currently does not check for signed vs unsigned numpy types.

ser = pd.Series([np.uint16(40000)])
print(ser)

# 0   -25536
# dtype: int16

The current checking for whether maybe_convert_objects returns signed or unsigned uses the Boolean flags

seen.uint_: whether a value greater than or equal to 2 ** 63 is seen
seen.sint_: whether a value less than 0 is seen

and follows the logic (assuming only integral types are seen):

If seen.uint_ and seen.sint_ are both True, result is object dtype
If seen.uint_ is True, result is unsigned integer dtype
Otherwise result is signed integer dtype

Now that maybe_convert_objects respects itemsize when all inputs are numpy scalars (#40908), the use of the value 2 ** 63 above is no longer sufficient. However, for modifying the logic, we only need to consider the case where the input is all numpy scalars as otherwise 64 bit values are assumed.

I propose the following logic to determining seen.uint_ / sint_:

seen.uint_: whether a value greater than or equal to 2 ** 63 or an unsigned numpy scalar is seen
seen.sint_: whether a value less than 0 is or a signed numpy scalar seen

In the case where all values are numpy scalars, this is effectively:

If both signed and unsigned scalars are seen, result is object dtype
If only unsigned scalars are seen, result is unsigned
Otherwise result is signed.

I plan to put up a PR in the next few days.

simonjayhawkins · 2022-07-11T12:09:45Z

In the case where all values are numpy scalars, this is effectively:

If both signed and unsigned scalars are seen, result is object dtype

Out of curiosity, why do we have our own logic and not just use the numpy array constructor when all values are numpy scalars/numeric?

np.array([np.uint16(np.iinfo(np.uint16).max), np.int16(np.iinfo(np.int16).min)]).dtype
# dtype('int32')

np.array([np.uint32(np.iinfo(np.uint32).max), np.int32(np.iinfo(np.int32).min)]).dtype
# dtype('int64')

np.array([np.uint64(np.iinfo(np.uint64).max), np.int64(np.iinfo(np.int64).min)]).dtype
# dtype('float64')

This would probably be an api-breaking change, but reducing the cases that return object dtype is probably a win for most users.

i have also spoken with some users that find pandas slow, so after using pandas to explore/develop, use numpy for speed where possible in production. So although I sometimes see comments about users learning pandas first and hence not knowing numpy and that we don't need to match numpy's behavior, I generally prefer to match numpy where possible.

jreback · 2022-07-11T13:35:48Z

-1 on changing

pandas does the right things - numpy sometimes does things that tbh are not great but have forever been there

slowness is almost 100% incorrect usage and writing non idiomatic code

if someone's sat 'pandas is slow' then show a specific example

simonjayhawkins · 2022-07-11T14:13:06Z

pandas does the right things - numpy sometimes does things that tbh are not great but have forever been there

yes, converting integers (mix of uint64 and int64) to floats (dtype('float64')) could be viewed as "corrupting" input data through loss of precision.

slowness is almost 100% incorrect usage and writing non idiomatic code

if someone's sat 'pandas is slow' then show a specific example

If they somehow end up with object dtype, then this is a given? To be fair, the users that I have heard this from are doing machine learning so tend to move their data into numpy arrays anyway, it's just a question of which stage of their pipeline.

rhshadrach · 2022-07-11T20:09:59Z

I've opened a new issue and copied the discussion above there - #47673

bluss added Bug Needs Triage Issue that has not been reviewed by a pandas team member labels Jun 9, 2022

bluss changed the title ~~BUG:~~ BUG: uint16 inserted as int16 when assigning row with dict Jun 9, 2022

simonjayhawkins added Indexing Related to indexing on series/frames, not to indexes themselves Regression Functionality that used to work in a prior pandas version and removed Needs Triage Issue that has not been reviewed by a pandas team member labels Jun 15, 2022

simonjayhawkins added a commit to simonjayhawkins/pandas that referenced this issue Jun 15, 2022

code sample for pandas-dev#47294

f6ca227

rhshadrach mentioned this issue Jun 22, 2022

REGR: maybe_convert_objects ignoring uints #47475

Merged

5 tasks

jreback added this to the 1.5 milestone Jul 3, 2022

jreback closed this as completed in #47475 Jul 10, 2022

rhshadrach mentioned this issue Jul 11, 2022

API: Inferring dtype from iterables in pandas vs numpy #47673

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

BUG: uint16 inserted as int16 when assigning row with dict #47294

BUG: uint16 inserted as int16 when assigning row with dict #47294

bluss commented Jun 9, 2022 •

edited

Loading

simonjayhawkins commented Jun 15, 2022

simonjayhawkins commented Jun 15, 2022

rhshadrach commented Jun 16, 2022

rhshadrach commented Jun 19, 2022 •

edited

Loading

simonjayhawkins commented Jul 11, 2022

jreback commented Jul 11, 2022

simonjayhawkins commented Jul 11, 2022

rhshadrach commented Jul 11, 2022 •

edited

Loading

BUG: uint16 inserted as int16 when assigning row with dict #47294

BUG: uint16 inserted as int16 when assigning row with dict #47294

Comments

bluss commented Jun 9, 2022 • edited Loading

Pandas version checks

Reproducible Example

Issue Description

Expected Behavior

Installed Versions

simonjayhawkins commented Jun 15, 2022

simonjayhawkins commented Jun 15, 2022

rhshadrach commented Jun 16, 2022

rhshadrach commented Jun 19, 2022 • edited Loading

simonjayhawkins commented Jul 11, 2022

jreback commented Jul 11, 2022

simonjayhawkins commented Jul 11, 2022

rhshadrach commented Jul 11, 2022 • edited Loading

bluss commented Jun 9, 2022 •

edited

Loading

rhshadrach commented Jun 19, 2022 •

edited

Loading

rhshadrach commented Jul 11, 2022 •

edited

Loading