Skip to content

BUG: Group-by numeric type-coercion with datetime #15680

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 4 commits into from

Conversation

gwpdt
Copy link
Contributor

@gwpdt gwpdt commented Mar 14, 2017

closes #14423
closes #15421
closes #15670

During a group-by/apply on a DataFrame, in the presence of one or more
DateTime-like columns, Pandas would incorrectly coerce the type of all
other columns to numeric. E.g. a String column would be coerced to
numeric, producing NaNs.

gwpdt added 2 commits March 14, 2017 06:27
GH Bug pandas-dev#14423

During a group-by/apply on a DataFrame, in the presence of one or more
DateTime-like columns, Pandas would incorrectly coerce the type of all
other columns to numeric.  E.g. a String column would be coerced to
numeric, producing NaNs.

Fix the issue, and add a test.
@codecov
Copy link

codecov bot commented Mar 14, 2017

Codecov Report

Merging #15680 into master will decrease coverage by 0.02%.
The diff coverage is 100%.

@@            Coverage Diff             @@
##           master   #15680      +/-   ##
==========================================
- Coverage   91.03%      91%   -0.03%     
==========================================
  Files         143      143              
  Lines       49392    49410      +18     
==========================================
+ Hits        44962    44967       +5     
- Misses       4430     4443      +13
Impacted Files Coverage Δ
pandas/core/groupby.py 94.98% <100%> (ø)
pandas/io/gbq.py 25% <0%> (-58.34%)
pandas/tools/merge.py 93.68% <0%> (-0.27%)
pandas/core/frame.py 97.87% <0%> (-0.1%)
pandas/core/internals.py 93.59% <0%> (-0.06%)
pandas/core/reshape.py 99.27% <0%> (ø)

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 05d70f4...e1ed104. Read the comment docs.

Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm couple of comments
pls add a whatsnew note in bug fixes
ping when green or need review

@@ -3566,7 +3568,8 @@ def first_non_None_value(values):
# as we are stacking can easily have object dtypes here
so = self._selected_obj
if (so.ndim == 2 and so.dtypes.apply(is_datetimelike).any()):
result = result._convert(numeric=True)
result = result.apply(
lambda x: pd.to_numeric(x, errors='ignore'))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

don't import pd
import to_numeric from pandas.util.misc (inside the function)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

or u can import pd inside the function is ok too

@@ -4314,6 +4314,16 @@ def test_cummin_cummax(self):
expected = pd.Series([1, 2, 1], name='b')
tm.assert_series_equal(result, expected)

def test_numeric_coercion(self):
# GH 14423
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

there are a couple of issues marked duplicate can u add there tests here as well (inside this test is ok)

@jreback jreback added Bug Dtype Conversions Unexpected or buggy dtype conversions Groupby labels Mar 14, 2017
@jreback
Copy link
Contributor

jreback commented Mar 14, 2017

pls add tests from these two (you can list all 3 issues in the whatsnew)

xref #14873 (boolean casts)
xref #14849 (datetime)

if this change doesn't fix these, pls lmk.

@gwpdt
Copy link
Contributor Author

gwpdt commented Mar 14, 2017

I tried adding the test for #14873 - my fix does not resolve this issue. It's similar, but unrelated. It seems to be this line in _python_agg_general:

output[name] = self._try_cast(result, obj, numeric_only=True)

that favours converting the result of the aggregate() back to the same type as the original series.

@jreback
Copy link
Contributor

jreback commented Mar 14, 2017

@gwpdt ok, we can treat that issue independently.

@gwpdt
Copy link
Contributor Author

gwpdt commented Mar 14, 2017

I also tried adding the test for #14849 - my fix does not resolve this issue either.

The issue with that test is that it is testing for preservation of dtype="object" for a column whose values are all numeric (and therefore the "soft coercion" to numeric type succeeds).

In this instance (in the absence of any non-numeric values in the series), even with my fix, it will still convert the dtype of the column from "object" to numeric.

I tried removing the conversion altogether. That fixes #14849, but breaks another test (test_time_field_bug / #11324), so I will not commit that change.

I still think it is worth contributing my fix, since it improves upon the current situation. A series with string values will not be converted to NaNs when doing groupby/apply on a DF with one or more datetime columns.

I will address other review feedback (re: "pd" import) and update the PR.

@jreback
Copy link
Contributor

jreback commented Mar 14, 2017

AFAICT this only resolved #15421 (w/o breaking anything else). so thats good.

@jreback
Copy link
Contributor

jreback commented Mar 14, 2017

#14423 I think should be fixable as well. want to give that a shot? (the example is not very good though).

@@ -4314,6 +4314,16 @@ def test_cummin_cummax(self):
expected = pd.Series([1, 2, 1], name='b')
tm.assert_series_equal(result, expected)

def test_numeric_coercion(self):
# GH 14423
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add a 1-liner explaining this (maybe also make the test name a bit more descriptive)

Rename test_numeric_coercion to
test_apply_numeric_coercion_when_datetime, and add tests for GH pandas-dev#15421
and pandas-dev#14423
@gwpdt
Copy link
Contributor Author

gwpdt commented Mar 16, 2017

Agreed - #15670 #15421 and #14423 are all fixed.

@jreback jreback changed the title BUG: Group-by numeric type-coericion with datetime BUG: Group-by numeric type-coercion with datetime Mar 16, 2017
@jreback jreback added this to the 0.20.0 milestone Mar 16, 2017
@jreback
Copy link
Contributor

jreback commented Mar 16, 2017

lgtm. merging shortly.

@jreback
Copy link
Contributor

jreback commented Mar 16, 2017

thanks!

@jreback jreback closed this in 37e5f78 Mar 16, 2017
AnkurDedania pushed a commit to AnkurDedania/pandas that referenced this pull request Mar 21, 2017
closes pandas-dev#14423
closes pandas-dev#15421
closes pandas-dev#15670

During a group-by/apply
on a DataFrame, in the presence of one or more  DateTime-like columns,
Pandas would incorrectly coerce the type of all  other columns to
numeric.  E.g. a String column would be coerced to  numeric, producing
NaNs.

Author: Greg Williams <[email protected]>

Closes pandas-dev#15680 from gwpdt/bugfix14423 and squashes the following commits:

e1ed104 [Greg Williams] TST: Rename and expand test_numeric_coercion
0a15674 [Greg Williams] CLN: move import, add whatsnew entry
c8844e0 [Greg Williams] CLN: PEP8 (whitespace fixes)
46d12c2 [Greg Williams] BUG: Group-by numeric type-coericion with datetime
mattip pushed a commit to mattip/pandas that referenced this pull request Apr 3, 2017
closes pandas-dev#14423
closes pandas-dev#15421
closes pandas-dev#15670

During a group-by/apply
on a DataFrame, in the presence of one or more  DateTime-like columns,
Pandas would incorrectly coerce the type of all  other columns to
numeric.  E.g. a String column would be coerced to  numeric, producing
NaNs.

Author: Greg Williams <[email protected]>

Closes pandas-dev#15680 from gwpdt/bugfix14423 and squashes the following commits:

e1ed104 [Greg Williams] TST: Rename and expand test_numeric_coercion
0a15674 [Greg Williams] CLN: move import, add whatsnew entry
c8844e0 [Greg Williams] CLN: PEP8 (whitespace fixes)
46d12c2 [Greg Williams] BUG: Group-by numeric type-coericion with datetime
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug Dtype Conversions Unexpected or buggy dtype conversions Groupby
Projects
None yet
2 participants