BUG GH#40498 Add mask to fillna to only modify indexes specified in v… #42859

ghost · 2021-08-03T07:00:19Z

…alues.

closes BUG: Series.fillna is altering None values of unspecified columns #40498
tests added / passed
Ensure all linting tests pass, see here for how to run them
whatsnew entry

…alues.

ghost · 2021-08-03T17:38:08Z

Going to investigate the failing tests today. Looks like tests outside of the fillna methods are failing since I'm touching more the blocks code and isna(value) is not always valid. Need to see what data types 'value' can be and if there's a way to make this mask work for all of them.

ghost · 2021-08-04T05:59:23Z

Looks like the failing tests were related to read_json(json, dtype=False) and missing data / missing values https://github.com/pandas-dev/pandas/issues/28501. The function has been using fillna to replace None with np.nan but this can no longer be done with my changes. I see a few possible solutions:

fillna can no longer be used to fillna(np.nan) [Current Submission] - This is because we need to use np.nan as our way to know which indexes to avoid, meaning it cannot be both part of the mask and a value. There is currently no specific test indicating that this is a required behavior but it does seem to be a common expectation / practice. My first google search for "pandas replace none with np.nan" shows https://stackoverflow.com/questions/23743460/replace-none-with-nan-in-pandas-dataframe
fillna internal functions need to pass values and indexes separately - This means we could use indexes for the mask and values can be anything including np.nan. This wouldn't break existing uses but would require us to touch the code a lot more.
Modify values to an array of the original series with replaced values [Original Submission] - Instead of an array of np.nan and the values at the indexes we need to modify, we would pass an array of the original series with replaced values at the indexes we need to modify. This would be okay because we still have a isna mask for the original series at the end.

Would love to hear if anyone has feedback or suggestions.

ghost · 2021-08-04T06:03:00Z

As I'm working through this, I'm wondering if this ticket is still a bug. I'm seeing more and more places where we infer None as np.nan even when inference is not explicit or even when it's explicitly off (e.g. dtype=False).

ghost · 2021-08-04T15:51:51Z

One problem is that we're currently coercing dtypes in pandas/core/internals/blocks.py L445/448 when we call nb._maybe_downcast([nb], downcast) and L510/513 blk.convert(datetime=True, numeric=False).

When we're converting this can replace None with np.nan if we're using a Series of integers (numeric) -
df = DataFrame({"A": [1, None, None, 4], "B": [2, None, None, 8]})

However it leaves the None when using a Series with integers and strings (object) -
df = DataFrame({"A": [1, None, None, "four"], "B": [2, None, None, "eight"]})

This means convert has a similar behavior which would need to be fixed if this was to work for the first scenario, where None is being changed to np.nan. However, this behavior can also be seen with read_json, pd.DataFrame and pd.Series so I suspect this is intentional.

jreback

cc @mzeitlin11 if comments

jreback · 2021-08-04T22:26:06Z

pandas/io/json/_json.py

@@ -924,7 +924,10 @@ def _try_convert_data(self, name, data, use_dtypes=True, convert_dates=True):
            if not self.dtype:
                if all(notna(data)):
                    return data, False
-                return data.fillna(np.nan), True
+                data = data.where(notna(data), np.nan)
+                if all(isna(data)):


use isna(data).all()

mzeitlin11 · 2021-08-05T01:29:24Z

@mikephung122 have you run ASV's for this? In a previous attempt (#40509) I tried to stay away from internals in fixing this because we'd ideally not degrade performance on usual use cases to fix this pretty niche issue.

IMO the behavior of the function you've modified is not necessarily wrong - its job is to fill missing values with value. It's just that in this specific case the value argument has something of a different meaning - np.nan is being used to indicate "don't fill with this index" which is different from what that function expects.

mzeitlin11 · 2021-08-05T01:34:21Z

This actually causes a behavior change:

Master:

In [10]: ser = pd.Series([None, None, None])

In [11]: ser.fillna(np.nan)
Out[11]:
0   NaN
1   NaN
2   NaN
dtype: float64

This PR:

In [3]: ser = pd.Series([None, None, None])

In [4]: ser.fillna(np.nan)
Out[4]:
0    None
1    None
2    None
dtype: object

…on-nulls.

ghost · 2021-08-05T06:05:35Z

@mzeitlin11 I found some instructions for the ASV (airspeed velocity) performance regressions in doc/source/development/contributing_codebase.rst. The tests have started but looks like this might take awhile.

I was also reluctant to touch internals but found this possible solution while trying to better understand how it worked. It seemed simple and didn't break any functionality except for using fillna(np.nan) to replace na values (e.g. None) with np.nan. Is this an expected use of fillna? Do you think this behavior change is acceptable?

mzeitlin11 · 2021-08-05T15:03:13Z

The tests have started but looks like this might take awhile.

Yeah, the whole suite will be slow, but just running ones specific to fillna should be an ok starting point that would be faster.

I was also reluctant to touch internals but found this possible solution while trying to better understand how it worked. It seemed simple and didn't break any functionality except for using fillna(np.nan) to replace na values (e.g. None) with np.nan. Is this an expected use of fillna? Do you think this behavior change is acceptable?

I think (but maybe others disagree) that the existing behavior is better than with this change - eg it makes more sense to support replacing one missing value with another, even if it comes at the cost of the specific bug in the issue (which I'd hope is a less common use case). Especially since this change would add additional logic (and maybe slowdown) to a potentially hot path.

Thanks for looking at this btw, it's a tricky issue! Sorry to bring up concerns but not solutions :)

ghost · 2021-08-05T16:22:07Z

@mzeitlin11 I finished the entire ASV test suite and got "BENCHMARKS NOT SIGNIFICANTLY CHANGED." However, I do agree with you. I have seen people use fillna to replace one null with another pretty often and I'm not sure if its worth losing this functionality to fix a much rarer use case. Do you think this is worth adding a test or clarifying in the documentation? The only reason I submitted this as a potential solution was because no test failed and there are other ways to replace nulls.

No problem, I plan to look into / investigate this further and see if there's other possible solutions this week. If anyone else has any ideas or opinions just let me know.

mzeitlin11 · 2021-08-05T18:30:03Z

Do you think this is worth adding a test or clarifying in the documentation?

Definitely worth a test to avoid an undocumented regression.

If anyone else has any ideas or opinions just let me know.

My guess would be that the fix will require changing logic earlier on, basically handling indices that should be ignored differently than just using NaNs. But might be tricky to get right without a bunch of special casing or performance slowdown

ghost · 2021-08-17T01:02:25Z

Going to continue working on this on #42981

BUG GH#40498 Add mask to fillna to only modify indexes specified in v…

61b780c

…alues.

BUG GH#40498 Refactor how None is replaced with np.nan in _json.

13105a6

BUG GH#40498 Cleanup fillna tests.

44a5ee2

jreback added Bug Missing-data np.nan, pd.NaT, pd.NA, dropna, isnull, interpolate labels Aug 4, 2021

jreback added this to the 1.4 milestone Aug 4, 2021

jreback requested changes Aug 4, 2021

View reviewed changes

BUG GH#40498 Change how json _try_conver_data uses all to check for n…

0a4331a

…on-nulls.

mzeitlin11 mentioned this pull request Aug 14, 2021

BUG GH#40498 Fillna other missing values not modified #42981

Closed

4 tasks

ghost closed this Aug 17, 2021

This pull request was closed.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

BUG GH#40498 Add mask to fillna to only modify indexes specified in v… #42859

BUG GH#40498 Add mask to fillna to only modify indexes specified in v… #42859

ghost commented Aug 3, 2021

ghost commented Aug 3, 2021

ghost commented Aug 4, 2021 •

edited by ghost

Loading

ghost commented Aug 4, 2021 •

edited by ghost

Loading

ghost commented Aug 4, 2021 •

edited by ghost

Loading

jreback left a comment

jreback Aug 4, 2021

mzeitlin11 commented Aug 5, 2021

mzeitlin11 commented Aug 5, 2021

ghost commented Aug 5, 2021

mzeitlin11 commented Aug 5, 2021

ghost commented Aug 5, 2021 •

edited by ghost

Loading

mzeitlin11 commented Aug 5, 2021

ghost commented Aug 17, 2021

BUG GH#40498 Add mask to fillna to only modify indexes specified in v… #42859

BUG GH#40498 Add mask to fillna to only modify indexes specified in v… #42859

Conversation

ghost commented Aug 3, 2021

ghost commented Aug 3, 2021

ghost commented Aug 4, 2021 • edited by ghost Loading

ghost commented Aug 4, 2021 • edited by ghost Loading

ghost commented Aug 4, 2021 • edited by ghost Loading

jreback left a comment

Choose a reason for hiding this comment

jreback Aug 4, 2021

Choose a reason for hiding this comment

mzeitlin11 commented Aug 5, 2021

mzeitlin11 commented Aug 5, 2021

ghost commented Aug 5, 2021

mzeitlin11 commented Aug 5, 2021

ghost commented Aug 5, 2021 • edited by ghost Loading

mzeitlin11 commented Aug 5, 2021

ghost commented Aug 17, 2021

ghost commented Aug 4, 2021 •

edited by ghost

Loading

ghost commented Aug 4, 2021 •

edited by ghost

Loading

ghost commented Aug 4, 2021 •

edited by ghost

Loading

ghost commented Aug 5, 2021 •

edited by ghost

Loading