-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
BUG GH#40498 Add mask to fillna to only modify indexes specified in v… #42859
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Going to investigate the failing tests today. Looks like tests outside of the fillna methods are failing since I'm touching more the blocks code and isna(value) is not always valid. Need to see what data types 'value' can be and if there's a way to make this mask work for all of them. |
Looks like the failing tests were related to read_json(json, dtype=False) and missing data / missing values https://github.com/pandas-dev/pandas/issues/28501. The function has been using fillna to replace None with np.nan but this can no longer be done with my changes. I see a few possible solutions:
Would love to hear if anyone has feedback or suggestions. |
As I'm working through this, I'm wondering if this ticket is still a bug. I'm seeing more and more places where we infer None as np.nan even when inference is not explicit or even when it's explicitly off (e.g. dtype=False). |
One problem is that we're currently coercing dtypes in pandas/core/internals/blocks.py L445/448 when we call nb._maybe_downcast([nb], downcast) and L510/513 blk.convert(datetime=True, numeric=False). When we're converting this can replace None with np.nan if we're using a Series of integers (numeric) - However it leaves the None when using a Series with integers and strings (object) - This means convert has a similar behavior which would need to be fixed if this was to work for the first scenario, where None is being changed to np.nan. However, this behavior can also be seen with read_json, pd.DataFrame and pd.Series so I suspect this is intentional. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
cc @mzeitlin11 if comments
pandas/io/json/_json.py
Outdated
@@ -924,7 +924,10 @@ def _try_convert_data(self, name, data, use_dtypes=True, convert_dates=True): | |||
if not self.dtype: | |||
if all(notna(data)): | |||
return data, False | |||
return data.fillna(np.nan), True | |||
data = data.where(notna(data), np.nan) | |||
if all(isna(data)): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
use isna(data).all()
@mikephung122 have you run ASV's for this? In a previous attempt (#40509) I tried to stay away from internals in fixing this because we'd ideally not degrade performance on usual use cases to fix this pretty niche issue. IMO the behavior of the function you've modified is not necessarily wrong - its job is to fill missing values with |
This actually causes a behavior change: Master:
This PR:
|
@mzeitlin11 I found some instructions for the ASV (airspeed velocity) performance regressions in doc/source/development/contributing_codebase.rst. The tests have started but looks like this might take awhile. I was also reluctant to touch internals but found this possible solution while trying to better understand how it worked. It seemed simple and didn't break any functionality except for using fillna(np.nan) to replace na values (e.g. None) with np.nan. Is this an expected use of fillna? Do you think this behavior change is acceptable? |
Yeah, the whole suite will be slow, but just running ones specific to fillna should be an ok starting point that would be faster.
I think (but maybe others disagree) that the existing behavior is better than with this change - eg it makes more sense to support replacing one missing value with another, even if it comes at the cost of the specific bug in the issue (which I'd hope is a less common use case). Especially since this change would add additional logic (and maybe slowdown) to a potentially hot path. Thanks for looking at this btw, it's a tricky issue! Sorry to bring up concerns but not solutions :) |
@mzeitlin11 I finished the entire ASV test suite and got "BENCHMARKS NOT SIGNIFICANTLY CHANGED." However, I do agree with you. I have seen people use fillna to replace one null with another pretty often and I'm not sure if its worth losing this functionality to fix a much rarer use case. Do you think this is worth adding a test or clarifying in the documentation? The only reason I submitted this as a potential solution was because no test failed and there are other ways to replace nulls. No problem, I plan to look into / investigate this further and see if there's other possible solutions this week. If anyone else has any ideas or opinions just let me know. |
Definitely worth a test to avoid an undocumented regression.
My guess would be that the fix will require changing logic earlier on, basically handling indices that should be ignored differently than just using NaNs. But might be tricky to get right without a bunch of special casing or performance slowdown |
Going to continue working on this on #42981 |
…alues.