-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
BUG: uint16 inserted as int16 when assigning row with dict #47294
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Thanks @bluss for the report note working in pandas-1.2.5, albeit retaining object dtype from empty DataFrame and then in pandas-1.3.5 overflow but still with object dtype and then in pandas-1.4.2/main overflow but changed (incorrect) dtype. print(pd.__version__)
df = pd.DataFrame(columns=["actual", "reference"])
df.loc[0] = {"actual": np.uint16(40_000), "reference": "nope"}
print(df)
print(df.dtypes)
# 1.2.5
# actual reference
# 0 40000 nope
# actual object
# reference object
# dtype: object
# 1.3.5
# actual reference
# 0 -25536 nope
# actual object
# reference object
# dtype: object
# 1.4.2
# actual reference
# 0 -25536 nope
# actual int16
# reference object
# dtype: object |
first bad commit: [549e39b] ENH: Make maybe_convert_object respect dtype itemsize (#40908) cc @rhshadrach
I suspect as result of https://pandas.pydata.org/pandas-docs/stable/whatsnew/v1.4.0.html#ignoring-dtypes-in-concat-with-empty-or-all-na-columns and that addressing the above regression resolves the dtype issue |
It looks like we never set |
maybe_convert_objects currently does not check for signed vs unsigned numpy types.
The current checking for whether maybe_convert_objects returns signed or unsigned uses the Boolean flags
and follows the logic (assuming only integral types are seen):
Now that maybe_convert_objects respects itemsize when all inputs are numpy scalars (#40908), the use of the value I propose the following logic to determining seen.uint_ / sint_:
In the case where all values are numpy scalars, this is effectively:
I plan to put up a PR in the next few days. |
Out of curiosity, why do we have our own logic and not just use the numpy array constructor when all values are numpy scalars/numeric? np.array([np.uint16(np.iinfo(np.uint16).max), np.int16(np.iinfo(np.int16).min)]).dtype
# dtype('int32')
np.array([np.uint32(np.iinfo(np.uint32).max), np.int32(np.iinfo(np.int32).min)]).dtype
# dtype('int64')
np.array([np.uint64(np.iinfo(np.uint64).max), np.int64(np.iinfo(np.int64).min)]).dtype
# dtype('float64') This would probably be an api-breaking change, but reducing the cases that return object dtype is probably a win for most users. i have also spoken with some users that find pandas slow, so after using pandas to explore/develop, use numpy for speed where possible in production. So although I sometimes see comments about users learning pandas first and hence not knowing numpy and that we don't need to match numpy's behavior, I generally prefer to match numpy where possible. |
-1 on changing pandas does the right things - numpy sometimes does things that tbh are not great but have forever been there slowness is almost 100% incorrect usage and writing non idiomatic code if someone's sat 'pandas is slow' then show a specific example |
yes, converting integers (mix of uint64 and int64) to floats (dtype('float64')) could be viewed as "corrupting" input data through loss of precision.
If they somehow end up with object dtype, then this is a given? To be fair, the users that I have heard this from are doing machine learning so tend to move their data into numpy arrays anyway, it's just a question of which stage of their pipeline. |
I've opened a new issue and copied the discussion above there - #47673 |
Pandas version checks
I have checked that this issue has not already been reported.
I have confirmed this bug exists on the latest version of pandas.
I have confirmed this bug exists on the main branch of pandas.
Reproducible Example
Issue Description
Inserting a row with a dict, uint16 values are converted to int16 and the value conversion does not preserve the correct value. This also happens when assigning into an existing object-typed column (the conversion sequence seems to be -> int16 -> int in that case).
Expected Behavior
It's expected the dtype is preserved - uint16 if possible, or an int which is large enough to represent the value.
Installed Versions
The text was updated successfully, but these errors were encountered: