Skip to content

BUG: Dtype was not changed when float was assignt to int column #37680

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 21 commits into from

Conversation

phofl
Copy link
Member

@phofl phofl commented Nov 7, 2020

We have to keep the get_loc statement to check, if the series._set_value function does the trick. I added the dtype conversion there, to avoid implementing the same thing twice

@phofl
Copy link
Member Author

phofl commented Nov 7, 2020

Had to check for numeric dtype, we could do this in validate_numeric_casting?

@@ -1110,6 +1113,10 @@ def _set_value(self, label, value, takeable: bool = False):
else:
loc = self.index.get_loc(label)
validate_numeric_casting(self.dtype, value)
dtype, _ = infer_dtype_from_scalar(value, pandas_dtype=True)
if not is_dtype_equal(self.dtype, dtype) and is_numeric_dtype(dtype):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why do you need to check if is numeric here? isn't just not dtype equal enough?

Copy link
Member Author

@phofl phofl Nov 8, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We have a test calling this function where scalar is a string value, which is expected to raise an ValueError, hence this check there. If the test is somehow erroneous, we could delete it.

https://dev.azure.com/pandas-dev/pandas/_build/results?buildId=47080&view=logs&j=ba13898e-1dfb-5ace-9966-8b7af3677790
The failing test is in this build

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you show the test

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jreback jreback added Dtype Conversions Unexpected or buggy dtype conversions Indexing Related to indexing on series/frames, not to indexes themselves labels Nov 8, 2020
Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@@ -39,3 +39,19 @@ def test_at_with_duplicate_axes_requires_scalar_lookup(self):
df.at[1, ["A"]] = 1
with pytest.raises(ValueError, match=msg):
df.at[:, "A"] = 1


def test_at_assign_float_to_int_frame():
Copy link
Contributor

@jreback jreback Nov 8, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you parameterize this on .loc for the same examples

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

also testing iat/iloc for the same (obviously using the position). if its becomes messy then can do that in a followup.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

also we would want to test this for other dtypes as well. so that is for sure a followon, pls open an issue for this.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

all this said, we might have some tests for this already (likely we do), prob have to hunt a bit for them.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Parametrized the tests for loc, iloc and iat would be an ugly parametrization. I will open an issue for dtypes and iloc/iat

@phofl phofl changed the title BUG: Dtype was not changed when float was assignt do int column BUG: Dtype was not changed when float was assignt to int column Nov 8, 2020
@phofl
Copy link
Member Author

phofl commented Nov 10, 2020

Found #20643 which was not fixed completly. Adjusted the code accordingly to fix this too.

@phofl
Copy link
Member Author

phofl commented Nov 10, 2020

cc @jreback

Two open points here:

  • If integer/float is assigned to boolean series, which dtype is expected? Currently it is converted to object
  • Through dispatching to iloc, when it is takeble but dtypes are not equal in iat, we get a different error. Is this a problem?

Fixed a few tests, which were xfailing previously.

series._values[loc] = value
# Note: trying to use series._set_value breaks tests in
# tests.frame.indexing.test_indexing and tests.indexing.test_partial
self.index._engine.get_loc(index)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why are you discarding loc?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If get_loc raises a KeyError we have to catch this here insted of in series._set_value but we do not need loc

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why bother calling get_loc here at all if its going to be called again in series._set_value?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We want to arrive in the except clause here, when index does not exist instead of the except clause in series._set_value

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We want to arrive in the except clause here, when index does not exist instead of the except clause in series._set_value

can you elaborate on this? is it bc the except clause here does a takeable check whereas the one in _set_value does not? if so, could just add that check there?

why is the deleted comment about "tests.frame.indexing.test_indexing and tests.indexing.test_partial" no longer relevant?

if is_dtype_equal(self.dtype, dtype) or isna(value):
self._values[loc] = value
else:
self.loc[key] = value
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

shouldnt validate_numeric_casting just raise here?

@@ -1043,7 +1046,11 @@ def _set_with_engine(self, key, value):
# fails with AttributeError for IntervalIndex
loc = self.index._engine.get_loc(key)
validate_numeric_casting(self.dtype, value)
self._values[loc] = value
dtype, _ = infer_dtype_from_scalar(value)
if is_dtype_equal(self.dtype, dtype) or isna(value):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

probably want is_valid_nat_for_dtype instead of isna

@jbrockmendel
Copy link
Member

My suggestion is to start with adding in validate_numeric_casting:

    if issubclass(dtype.type, (np.integer, np.bool_)):
        if is_float(value) and np.isnan(value):
            raise ValueError("Cannot assign nan to integer series")
+        if is_float(value) and not value.is_integer():
+            raise ValueError("Cannot assign non-integer float to integer series")

That won't fix all of this, but i think is the right place to collect the relevant logic.

@phofl
Copy link
Member Author

phofl commented Nov 12, 2020

The issue was actually different. Did not recognize this a few days ago. This is only a problem, if the Index contains tuples as values, loc does not support this. So we get a KeyError. Changed this now a bit, which should make it clearer.

@jbrockmendel
Copy link
Member

@phofl can you run the indexing asvs on this. we've optimized this portion of the code pretty hard and im not confident this is the right place to catch these cases

@jreback pls hold off on merging for a couple days so i can hopefully get a handle on this

@jreback
Copy link
Contributor

jreback commented Nov 14, 2020

yep agreed

@phofl
Copy link
Member Author

phofl commented Nov 15, 2020

Ran indexing asvs

      before           after         ratio
     [50ae0bfc]       [566d7692]
     <master>         <26395>   
+      8.45±0.3ms       10.2±0.4ms     1.20  indexing.NumericSeriesIndexing.time_loc_slice(<class 'pandas.core.indexes.numeric.UInt64Index'>, 'nonunique_monotonic_inc')
+      54.3±0.8μs         63.9±8μs     1.18  indexing.NonNumericSeriesIndexing.time_getitem_pos_slice('period', 'non_monotonic')
-         105±3μs       95.1±0.3μs     0.91  indexing.DataFrameNumericIndexing.time_iloc_dups
-         256±4ms          223±7ms     0.87  indexing.NumericSeriesIndexing.time_getitem_array(<class 'pandas.core.indexes.numeric.Float64Index'>, 'unique_monotonic_inc')
-         259±9ms          219±6ms     0.85  indexing.NumericSeriesIndexing.time_getitem_lists(<class 'pandas.core.indexes.numeric.Float64Index'>, 'unique_monotonic_inc')
-         253±8ms          214±8ms     0.85  indexing.NumericSeriesIndexing.time_loc_list_like(<class 'pandas.core.indexes.numeric.Float64Index'>, 'unique_monotonic_inc')
-        261±10ms          217±7ms     0.83  indexing.NumericSeriesIndexing.time_loc_array(<class 'pandas.core.indexes.numeric.Float64Index'>, 'unique_monotonic_inc')
-         211±1μs          171±2μs     0.81  indexing_engines.NumericEngineIndexing.time_get_loc((<class 'pandas._libs.index.Int8Engine'>, <class 'numpy.int8'>), 'monotonic_incr')
-         188±7ms          152±4ms     0.81  indexing.NumericSeriesIndexing.time_getitem_lists(<class 'pandas.core.indexes.numeric.Float64Index'>, 'nonunique_monotonic_inc')
-        270±10ms          216±7ms     0.80  indexing.NumericSeriesIndexing.time_getitem_list_like(<class 'pandas.core.indexes.numeric.Float64Index'>, 'unique_monotonic_inc')
-         184±2ms          146±4ms     0.79  indexing.NumericSeriesIndexing.time_getitem_array(<class 'pandas.core.indexes.numeric.Float64Index'>, 'nonunique_monotonic_inc')
-      42.5±0.9μs         26.4±1μs     0.62  indexing.NumericSeriesIndexing.time_getitem_scalar(<class 'pandas.core.indexes.numeric.Float64Index'>, 'nonunique_monotonic_inc')
-     1.89±0.05ms      1.09±0.04ms     0.58  indexing.NumericSeriesIndexing.time_getitem_list_like(<class 'pandas.core.indexes.numeric.Float64Index'>, 'nonunique_monotonic_inc')
-     1.62±0.04ms         910±40μs     0.56  indexing.NumericSeriesIndexing.time_loc_list_like(<class 'pandas.core.indexes.numeric.Float64Index'>, 'nonunique_monotonic_inc')
-        310±10μs          168±5μs     0.54  indexing.NumericSeriesIndexing.time_loc_slice(<class 'pandas.core.indexes.numeric.Float64Index'>, 'nonunique_monotonic_inc')
-         307±8μs          164±6μs     0.54  indexing.NumericSeriesIndexing.time_getitem_slice(<class 'pandas.core.indexes.numeric.Float64Index'>, 'nonunique_monotonic_inc')
-        90.1±2μs         47.7±1μs     0.53  indexing.NumericSeriesIndexing.time_loc_scalar(<class 'pandas.core.indexes.numeric.Float64Index'>, 'nonunique_monotonic_inc')

If I understood you correctly, you would have expected the opposite?

exp = pd.Series([1, val, 3, 4])
self._assert_setitem_series_conversion(obj, val, exp, exp_dtype)

@pytest.mark.parametrize(
"val,exp_dtype", [(np.int32(1), np.int8), (np.int16(2 ** 9), np.int16)]
"val,exp_dtype", [(np.int32(1), np.int32), (np.int16(2 ** 9), np.int16)]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this was downcasting before?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes

(1.1, np.float64),
(1 + 1j, np.complex128),
(1, "object"),
(3, "object"),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so these were doing an inferred cast before?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

numpy inferred cast, yes

@@ -47,7 +47,6 @@ def test_partial_setting(self):
with pytest.raises(IndexError, match=msg):
s.iloc[3] = 5.0

msg = "index 3 is out of bounds for axis 0 with size 3"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what is the message now

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Messages for iat and iloc are now the same, hence the deletion of the redefinition

else:
loc = self.index.get_loc(label)
validate_numeric_casting(self.dtype, value)
if not is_dtype_equal(self.dtype, dtype) and is_numeric_dtype(dtype):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you integrate this with the higher level if/else, these nested if/else are harder to read, even if it means duplicating some code

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

else:
# This only raises when index contains tuples
try:
self.loc[key] = value
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is a very strange change, why is it needed?

Copy link
Member Author

@phofl phofl Nov 15, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, that is pretty ugly, but a loc limitation unfortunately. If the index is a regular index where values are tuples, loc does not work.

Example:

obj = Series([1, 2], index=[(0,1), (1,2)])
obj[(0,1)] = "test"

This runs in the loc case and loc raises

KeyError: "None of [Int64Index([0, 1], dtype='int64')] are in the [index]"

Tuples are interpreted as components of MultiIndex

@@ -584,6 +584,7 @@ Indexing
- Bug in :meth:`DataFrame.loc` returned requested key plus missing values when ``loc`` was applied to single level from :class:`MultiIndex` (:issue:`27104`)
- Bug in indexing on a :class:`Series` or :class:`DataFrame` with a :class:`CategoricalIndex` using a listlike indexer containing NA values (:issue:`37722`)
- Bug in :meth:`DataFrame.xs` ignored ``droplevel=False`` for columns (:issue:`19056`)
- Bug in :meth:`DataFrame.at` and :meth:`Series.at` did not adjust dtype when float was assigned to integer column (:issue:`26395`, :issue:`20643`)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

did not adjust -> not adjusting

else:
# This only raises when index contains tuples
try:
self.loc[key] = value
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

_set_with_engine is a fastpath, really shouldn't be going through self.loc

the whole infer_dtype_from_scalar stuff seems like it should be handled as part of validate_numeric_casting

the issue this PR is addressing is only for when we are indexing an entire column, right?

@github-actions
Copy link
Contributor

This pull request is stale because it has been open for thirty days with no activity. Please update or respond to this comment if you're still interested in working on this.

@github-actions github-actions bot added the Stale label Dec 20, 2020
@jbrockmendel
Copy link
Member

@phofl can you merge master (even if this is on the back-burner, lets prevent it from drifting too far)

phofl added 2 commits January 3, 2021 19:51
� Conflicts:
�	doc/source/whatsnew/v1.2.0.rst
�	pandas/core/frame.py
@phofl
Copy link
Member Author

phofl commented Jan 3, 2021

Yeah have to get back to this...

@jbrockmendel
Copy link
Member

@phofl #39478 does something similar, same effect on test_setitem_series_bool. Can that approach be used here too?

@simonjayhawkins
Copy link
Member

@phofl what's the status here?

@simonjayhawkins
Copy link
Member

@phofl closing as stale. reopen when ready.

@phofl phofl deleted the 26395 branch April 27, 2023 19:52
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Dtype Conversions Unexpected or buggy dtype conversions Indexing Related to indexing on series/frames, not to indexes themselves Stale
Projects
None yet
4 participants