-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
BUG: Group-by numeric type-coercion with datetime #15680
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
GH Bug pandas-dev#14423 During a group-by/apply on a DataFrame, in the presence of one or more DateTime-like columns, Pandas would incorrectly coerce the type of all other columns to numeric. E.g. a String column would be coerced to numeric, producing NaNs. Fix the issue, and add a test.
Codecov Report
@@ Coverage Diff @@
## master #15680 +/- ##
==========================================
- Coverage 91.03% 91% -0.03%
==========================================
Files 143 143
Lines 49392 49410 +18
==========================================
+ Hits 44962 44967 +5
- Misses 4430 4443 +13
Continue to review full report at Codecov.
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm couple of comments
pls add a whatsnew note in bug fixes
ping when green or need review
pandas/core/groupby.py
Outdated
@@ -3566,7 +3568,8 @@ def first_non_None_value(values): | |||
# as we are stacking can easily have object dtypes here | |||
so = self._selected_obj | |||
if (so.ndim == 2 and so.dtypes.apply(is_datetimelike).any()): | |||
result = result._convert(numeric=True) | |||
result = result.apply( | |||
lambda x: pd.to_numeric(x, errors='ignore')) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
don't import pd
import to_numeric from pandas.util.misc (inside the function)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
or u can import pd inside the function is ok too
pandas/tests/groupby/test_groupby.py
Outdated
@@ -4314,6 +4314,16 @@ def test_cummin_cummax(self): | |||
expected = pd.Series([1, 2, 1], name='b') | |||
tm.assert_series_equal(result, expected) | |||
|
|||
def test_numeric_coercion(self): | |||
# GH 14423 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
there are a couple of issues marked duplicate can u add there tests here as well (inside this test is ok)
I tried adding the test for #14873 - my fix does not resolve this issue. It's similar, but unrelated. It seems to be this line in _python_agg_general:
that favours converting the result of the aggregate() back to the same type as the original series. |
@gwpdt ok, we can treat that issue independently. |
I also tried adding the test for #14849 - my fix does not resolve this issue either. The issue with that test is that it is testing for preservation of dtype="object" for a column whose values are all numeric (and therefore the "soft coercion" to numeric type succeeds). In this instance (in the absence of any non-numeric values in the series), even with my fix, it will still convert the dtype of the column from "object" to numeric. I tried removing the conversion altogether. That fixes #14849, but breaks another test (test_time_field_bug / #11324), so I will not commit that change. I still think it is worth contributing my fix, since it improves upon the current situation. A series with string values will not be converted to NaNs when doing groupby/apply on a DF with one or more datetime columns. I will address other review feedback (re: "pd" import) and update the PR. |
AFAICT this only resolved #15421 (w/o breaking anything else). so thats good. |
#14423 I think should be fixable as well. want to give that a shot? (the example is not very good though). |
pandas/tests/groupby/test_groupby.py
Outdated
@@ -4314,6 +4314,16 @@ def test_cummin_cummax(self): | |||
expected = pd.Series([1, 2, 1], name='b') | |||
tm.assert_series_equal(result, expected) | |||
|
|||
def test_numeric_coercion(self): | |||
# GH 14423 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
add a 1-liner explaining this (maybe also make the test name a bit more descriptive)
Rename test_numeric_coercion to test_apply_numeric_coercion_when_datetime, and add tests for GH pandas-dev#15421 and pandas-dev#14423
lgtm. merging shortly. |
thanks! |
closes pandas-dev#14423 closes pandas-dev#15421 closes pandas-dev#15670 During a group-by/apply on a DataFrame, in the presence of one or more DateTime-like columns, Pandas would incorrectly coerce the type of all other columns to numeric. E.g. a String column would be coerced to numeric, producing NaNs. Author: Greg Williams <[email protected]> Closes pandas-dev#15680 from gwpdt/bugfix14423 and squashes the following commits: e1ed104 [Greg Williams] TST: Rename and expand test_numeric_coercion 0a15674 [Greg Williams] CLN: move import, add whatsnew entry c8844e0 [Greg Williams] CLN: PEP8 (whitespace fixes) 46d12c2 [Greg Williams] BUG: Group-by numeric type-coericion with datetime
closes pandas-dev#14423 closes pandas-dev#15421 closes pandas-dev#15670 During a group-by/apply on a DataFrame, in the presence of one or more DateTime-like columns, Pandas would incorrectly coerce the type of all other columns to numeric. E.g. a String column would be coerced to numeric, producing NaNs. Author: Greg Williams <[email protected]> Closes pandas-dev#15680 from gwpdt/bugfix14423 and squashes the following commits: e1ed104 [Greg Williams] TST: Rename and expand test_numeric_coercion 0a15674 [Greg Williams] CLN: move import, add whatsnew entry c8844e0 [Greg Williams] CLN: PEP8 (whitespace fixes) 46d12c2 [Greg Williams] BUG: Group-by numeric type-coericion with datetime
closes #14423
closes #15421
closes #15670
During a group-by/apply on a DataFrame, in the presence of one or more
DateTime-like columns, Pandas would incorrectly coerce the type of all
other columns to numeric. E.g. a String column would be coerced to
numeric, producing NaNs.