Skip to content

BUG GH#40498 Add mask to fillna to only modify indexes specified in v… #42859

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 4 commits into from
Closed

Conversation

ghost
Copy link

@ghost ghost commented Aug 3, 2021

…alues.

@ghost
Copy link
Author

ghost commented Aug 3, 2021

Going to investigate the failing tests today. Looks like tests outside of the fillna methods are failing since I'm touching more the blocks code and isna(value) is not always valid. Need to see what data types 'value' can be and if there's a way to make this mask work for all of them.

@ghost
Copy link
Author

ghost commented Aug 4, 2021

Looks like the failing tests were related to read_json(json, dtype=False) and missing data / missing values https://github.com/pandas-dev/pandas/issues/28501. The function has been using fillna to replace None with np.nan but this can no longer be done with my changes. I see a few possible solutions:

  1. fillna can no longer be used to fillna(np.nan) [Current Submission] - This is because we need to use np.nan as our way to know which indexes to avoid, meaning it cannot be both part of the mask and a value. There is currently no specific test indicating that this is a required behavior but it does seem to be a common expectation / practice. My first google search for "pandas replace none with np.nan" shows https://stackoverflow.com/questions/23743460/replace-none-with-nan-in-pandas-dataframe
  2. fillna internal functions need to pass values and indexes separately - This means we could use indexes for the mask and values can be anything including np.nan. This wouldn't break existing uses but would require us to touch the code a lot more.
  3. Modify values to an array of the original series with replaced values [Original Submission] - Instead of an array of np.nan and the values at the indexes we need to modify, we would pass an array of the original series with replaced values at the indexes we need to modify. This would be okay because we still have a isna mask for the original series at the end.

Would love to hear if anyone has feedback or suggestions.

@ghost
Copy link
Author

ghost commented Aug 4, 2021

As I'm working through this, I'm wondering if this ticket is still a bug. I'm seeing more and more places where we infer None as np.nan even when inference is not explicit or even when it's explicitly off (e.g. dtype=False).

@ghost
Copy link
Author

ghost commented Aug 4, 2021

One problem is that we're currently coercing dtypes in pandas/core/internals/blocks.py L445/448 when we call nb._maybe_downcast([nb], downcast) and L510/513 blk.convert(datetime=True, numeric=False).

When we're converting this can replace None with np.nan if we're using a Series of integers (numeric) -
df = DataFrame({"A": [1, None, None, 4], "B": [2, None, None, 8]})

However it leaves the None when using a Series with integers and strings (object) -
df = DataFrame({"A": [1, None, None, "four"], "B": [2, None, None, "eight"]})

This means convert has a similar behavior which would need to be fixed if this was to work for the first scenario, where None is being changed to np.nan. However, this behavior can also be seen with read_json, pd.DataFrame and pd.Series so I suspect this is intentional.

@jreback jreback added Bug Missing-data np.nan, pd.NaT, pd.NA, dropna, isnull, interpolate labels Aug 4, 2021
@jreback jreback added this to the 1.4 milestone Aug 4, 2021
Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cc @mzeitlin11 if comments

@@ -924,7 +924,10 @@ def _try_convert_data(self, name, data, use_dtypes=True, convert_dates=True):
if not self.dtype:
if all(notna(data)):
return data, False
return data.fillna(np.nan), True
data = data.where(notna(data), np.nan)
if all(isna(data)):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

use isna(data).all()

@mzeitlin11
Copy link
Member

@mikephung122 have you run ASV's for this? In a previous attempt (#40509) I tried to stay away from internals in fixing this because we'd ideally not degrade performance on usual use cases to fix this pretty niche issue.

IMO the behavior of the function you've modified is not necessarily wrong - its job is to fill missing values with value. It's just that in this specific case the value argument has something of a different meaning - np.nan is being used to indicate "don't fill with this index" which is different from what that function expects.

@mzeitlin11
Copy link
Member

This actually causes a behavior change:

Master:

In [10]: ser = pd.Series([None, None, None])

In [11]: ser.fillna(np.nan)
Out[11]:
0   NaN
1   NaN
2   NaN
dtype: float64

This PR:

In [3]: ser = pd.Series([None, None, None])

In [4]: ser.fillna(np.nan)
Out[4]:
0    None
1    None
2    None
dtype: object

@ghost
Copy link
Author

ghost commented Aug 5, 2021

@mzeitlin11 I found some instructions for the ASV (airspeed velocity) performance regressions in doc/source/development/contributing_codebase.rst. The tests have started but looks like this might take awhile.

I was also reluctant to touch internals but found this possible solution while trying to better understand how it worked. It seemed simple and didn't break any functionality except for using fillna(np.nan) to replace na values (e.g. None) with np.nan. Is this an expected use of fillna? Do you think this behavior change is acceptable?

@mzeitlin11
Copy link
Member

The tests have started but looks like this might take awhile.

Yeah, the whole suite will be slow, but just running ones specific to fillna should be an ok starting point that would be faster.

I was also reluctant to touch internals but found this possible solution while trying to better understand how it worked. It seemed simple and didn't break any functionality except for using fillna(np.nan) to replace na values (e.g. None) with np.nan. Is this an expected use of fillna? Do you think this behavior change is acceptable?

I think (but maybe others disagree) that the existing behavior is better than with this change - eg it makes more sense to support replacing one missing value with another, even if it comes at the cost of the specific bug in the issue (which I'd hope is a less common use case). Especially since this change would add additional logic (and maybe slowdown) to a potentially hot path.

Thanks for looking at this btw, it's a tricky issue! Sorry to bring up concerns but not solutions :)

@ghost
Copy link
Author

ghost commented Aug 5, 2021

@mzeitlin11 I finished the entire ASV test suite and got "BENCHMARKS NOT SIGNIFICANTLY CHANGED." However, I do agree with you. I have seen people use fillna to replace one null with another pretty often and I'm not sure if its worth losing this functionality to fix a much rarer use case. Do you think this is worth adding a test or clarifying in the documentation? The only reason I submitted this as a potential solution was because no test failed and there are other ways to replace nulls.

No problem, I plan to look into / investigate this further and see if there's other possible solutions this week. If anyone else has any ideas or opinions just let me know.

@mzeitlin11
Copy link
Member

Do you think this is worth adding a test or clarifying in the documentation?

Definitely worth a test to avoid an undocumented regression.

If anyone else has any ideas or opinions just let me know.

My guess would be that the fix will require changing logic earlier on, basically handling indices that should be ignored differently than just using NaNs. But might be tricky to get right without a bunch of special casing or performance slowdown

@ghost
Copy link
Author

ghost commented Aug 17, 2021

Going to continue working on this on #42981

This pull request was closed.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug Missing-data np.nan, pd.NaT, pd.NA, dropna, isnull, interpolate
Projects
None yet
Development

Successfully merging this pull request may close these issues.

BUG: Series.fillna is altering None values of unspecified columns
2 participants