Skip to content

BUG: Series.median() treats NaT as INT_MIN #8617

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
ischwabacher opened this issue Oct 23, 2014 · 13 comments · Fixed by #14117
Closed

BUG: Series.median() treats NaT as INT_MIN #8617

ischwabacher opened this issue Oct 23, 2014 · 13 comments · Fixed by #14117
Labels
Bug Missing-data np.nan, pd.NaT, pd.NA, dropna, isnull, interpolate Testing pandas testing functions or related to the test suite Timedelta Timedelta data type
Milestone

Comments

@ischwabacher
Copy link
Contributor

Fixing this will be easier once numpy/numpy#5222 is fixed.

In [1]: import pandas as pd

In [2]: pd.Series([0, pd.tslib.iNaT], dtype='m8[ns]')
Out[2]: 
0   0 days
1      NaT
dtype: timedelta64[ns]

In [3]: _.median()
Out[3]: 
0   -53375 days, 23:53:38.427388
dtype: timedelta64[ns]
@jreback
Copy link
Contributor

jreback commented Oct 23, 2014

nothing to do with numpy, this is an error here: https://github.com/pydata/pandas/blob/master/pandas/core/nanops.py#L281(just take out the mask and it should work), the mask is getting overwritten

@jreback jreback added Bug Timedelta Timedelta data type Missing-data np.nan, pd.NaT, pd.NA, dropna, isnull, interpolate labels Oct 23, 2014
@jreback jreback added this to the 0.15.1 milestone Oct 23, 2014
@ischwabacher
Copy link
Contributor Author

Yeah, it's not the fix that's the problem (though you do have to carefully propagate the reshaping of the input to the mask; you can't just close it into get_median and expect things to work). The reason I care about the upstream problems is that when I added a bunch of all- and partial NaT arrays to test_nanops.py and extended check_funs to use them appropriately, I got a pile of tests that failed because the numpy ufuncs we're testing against mishandle NaT.

@jreback
Copy link
Contributor

jreback commented Oct 23, 2014

@ischwabacher I fixed all of this for 0.15.0 (aside from this one), which I guess was just not completely tested. You can fix upstream if you want but tests don't rely on numpy very much for these tests.

@ischwabacher
Copy link
Contributor Author

It looks like max and min with skipna=False are still treating NaT as INT_MIN as well, and argmax and argmin are returning nan. Numpy behavior with floats (which appears to be intended) is to return the index of the first nan, so that arr[argmax(arr)].equals(max(arr)). Or would if np.ndarray had an equals method.

In [81]: pd.Series([0, pd.tslib.iNaT], dtype='M8[ns]').max(skipna=False)
Out[81]: Timestamp('1970-01-01 00:00:00')

In [82]: pd.Series([0, pd.tslib.iNaT], dtype='m8[ns]').max(skipna=False)
Out[82]: Timedelta('0 days 00:00:00')

In [83]: pd.Series([0, pd.tslib.iNaT], dtype='M8[ns]').min(skipna=False)
Out[83]: NaT

In [84]: pd.Series([0, pd.tslib.iNaT], dtype='m8[ns]').min(skipna=False)
Out[84]: NaT

@jreback
Copy link
Contributor

jreback commented Oct 24, 2014

for these dtypes skipna=False is not well defined

by definition you are not using a mask so the values are the values
the above look right 2 me (the min/max) for skipna=False

@ischwabacher
Copy link
Contributor Author

Really? Why aren't we treating NaT the same as NaN in that case?

Also, if NDFrame behaves differently from ndarray it makes things harder to test. But I'll look at the tests for floats to see how they're handled, since currently pandas treats NaN differently from the way numpy treats it.

@jreback
Copy link
Contributor

jreback commented Oct 24, 2014

so the min/max for skipna=False are wrong above (81,82).

pandas is NOT numpy. so treatment can be different, though not w/o a good reason. (like numpy doesn't handle anything NaT related correctly). usually pretty good with NaN.

Don't test directly with numpy. Write a function that returns the correct result.

@ischwabacher
Copy link
Contributor Author

Don't test directly with numpy. Write a function that returns the correct result.

And then merge it into numpy. :)

I really think that getting the NaT handling working correctly in numpy is the first step to implementing NA_integer_, since that's effectively what NaT is.

@jreback
Copy link
Contributor

jreback commented Oct 24, 2014

well you can touch numpy if you would like :).

We are going to bypass numpy to support integer NA, using libdynd, essentially a much better numpy (with the same exact API).

@ischwabacher
Copy link
Contributor Author

Welp, I guess that's a thing.

@facaiy
Copy link
Contributor

facaiy commented Aug 13, 2016

@jreback It seems that the bug had been fixed.

In [9]: s = pd.Series([0, pd.tslib.iNaT], dtype='m8[ns]')

In [10]: s.median()
Out[10]: Timedelta('0 days 00:00:00')

In [11]: s.min()
Out[11]: Timedelta('0 days 00:00:00')

In [12]: s.max()
Out[12]: Timedelta('0 days 00:00:00')

In [13]: s = pd.Series([0, pd.NaT], dtype='m8[ns]')

In [14]: s.median()
Out[14]: Timedelta('0 days 00:00:00')

In [15]: s.min()
Out[15]: Timedelta('0 days 00:00:00')

In [16]: s.max()
Out[16]: Timedelta('0 days 00:00:00')

@facaiy
Copy link
Contributor

facaiy commented Aug 13, 2016

More tests may be needed.

@jreback
Copy link
Contributor

jreback commented Aug 15, 2016

xref #12992 can you do a pr for some tests?

@jorisvandenbossche jorisvandenbossche added the Testing pandas testing functions or related to the test suite label Aug 27, 2016
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug Missing-data np.nan, pd.NaT, pd.NA, dropna, isnull, interpolate Testing pandas testing functions or related to the test suite Timedelta Timedelta data type
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants