Skip to content

BUG: Series.asof fails for all NaN Series (GH15713) #15758

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 15 commits into from
Closed
2 changes: 2 additions & 0 deletions doc/source/whatsnew/v0.20.0.txt
Original file line number Diff line number Diff line change
Expand Up @@ -930,3 +930,5 @@ Bug Fixes
- Bug in ``pd.melt()`` where passing a tuple value for ``value_vars`` caused a ``TypeError`` (:issue:`15348`)
- Bug in ``.eval()`` which caused multiline evals to fail with local variables not on the first line (:issue:`15342`)
- Bug in ``pd.read_msgpack`` which did not allow to load dataframe with an index of type ``CategoricalIndex`` (:issue:`15487`)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FYI in the future, if you put the whatnew notes in a blank space in Bug Fixes (these are on purpose), you wont' get merge conflicts

- Bug in ``Series.asof`` which raised an error if the series contained all ``nans`` (:issue:`15713`)
6 changes: 6 additions & 0 deletions pandas/core/generic.py
Original file line number Diff line number Diff line change
Expand Up @@ -3972,6 +3972,12 @@ def asof(self, where, subset=None):
where = Index(where) if is_list else Index([where])

nulls = self.isnull() if is_series else self[subset].isnull().any(1)
if nulls.values.all():
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think the .values is still needed.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The values is still here?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @jorisvandenbossche ... I removed then put it back because I thought it generated a backward compatibility error. Currently the build breaks for Python 2.7.9. Now I saw it has nothing to do with it in Travis CI log: it's a "ci/lint.sh" exiting 1.
I will remove it again and see where the code is unformatted. Thanks

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed the .values

if is_series:
return pd.Series(np.nan, index=where)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is not correct; should have name=self.name

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought about that, @jreback , but when I experimented with a non-null series, I saw that it has no name. I.e.:

result = Series(np.random.randn(4), index=[1, 2, 3, 4]).asof([4, 5])
print result

returns

4   -0.558532
5   -0.558532
dtype: float64
......

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

and that not correct. we always want to propogate the names.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok, let me write the test case and fix for nan and non-nan inputs

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jreback done here.. working on the request below, on simplifying the code

else:
return pd.DataFrame(np.nan, index=where, columns=self.columns)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

see if you can simplify this logic a bit (maybe set the name where is_list is used before)

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hey @jreback , I made a small simplification, pls check if that's ok... if it's ok, now I think everything is good to go

locs = self.index.asof_locs(where, ~(nulls.values))

# mask the missing
Expand Down
40 changes: 32 additions & 8 deletions pandas/tests/frame/test_asof.py
Original file line number Diff line number Diff line change
Expand Up @@ -4,7 +4,6 @@
from pandas import (DataFrame, date_range, Timestamp, Series,
to_datetime)

from pandas.util.testing import assert_frame_equal, assert_series_equal
import pandas.util.testing as tm

from .common import TestData
Expand All @@ -14,9 +13,9 @@ class TestFrameAsof(TestData, tm.TestCase):

def setUp(self):
self.N = N = 50
rng = date_range('1/1/1990', periods=N, freq='53s')
self.rng = date_range('1/1/1990', periods=N, freq='53s')
self.df = DataFrame({'A': np.arange(N), 'B': np.arange(N)},
index=rng)
index=self.rng)

def test_basic(self):

Expand Down Expand Up @@ -51,19 +50,19 @@ def test_subset(self):
# with a subset of A should be the same
result = df.asof(dates, subset='A')
expected = df.asof(dates)
assert_frame_equal(result, expected)
tm.assert_frame_equal(result, expected)

# same with A/B
result = df.asof(dates, subset=['A', 'B'])
expected = df.asof(dates)
assert_frame_equal(result, expected)
tm.assert_frame_equal(result, expected)

# B gives self.df.asof
result = df.asof(dates, subset='B')
expected = df.resample('25s', closed='right').ffill().reindex(dates)
expected.iloc[20:] = 9

assert_frame_equal(result, expected)
tm.assert_frame_equal(result, expected)

def test_missing(self):
# GH 15118
Expand All @@ -75,9 +74,34 @@ def test_missing(self):
result = df.asof('1989-12-31')

expected = Series(index=['A', 'B'], name=Timestamp('1989-12-31'))
assert_series_equal(result, expected)
tm.assert_series_equal(result, expected)

result = df.asof(to_datetime(['1989-12-31']))
expected = DataFrame(index=to_datetime(['1989-12-31']),
columns=['A', 'B'], dtype='float64')
assert_frame_equal(result, expected)
tm.assert_frame_equal(result, expected)

def test_all_nans(self):
# GH 15713
# DataFrame is all nans
result = DataFrame([np.nan]).asof([0])
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

try these with non-defualt indexes and see what happens (your test will break)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indeed, and also, when you have a DataFrame with multiple columns, those columns should be preserved in the result

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

expected = DataFrame([np.nan])
tm.assert_frame_equal(result, expected)

# testing non-default indexes, multiple inputs
dates = date_range('1/1/1990', periods=self.N * 3, freq='25s')
result = DataFrame(np.nan, index=self.rng, columns=['A']).asof(dates)
expected = DataFrame(np.nan, index=dates, columns=['A'])
tm.assert_frame_equal(result, expected)

# testing multiple columns
dates = date_range('1/1/1990', periods=self.N * 3, freq='25s')
result = DataFrame(np.nan, index=self.rng, columns=['A', 'B', 'C']).asof(dates)
expected = DataFrame(np.nan, index=dates, columns=['A', 'B', 'C'])
tm.assert_frame_equal(result, expected)

# testing scalar input
date = date_range('1/1/1990', periods=self.N * 3, freq='25s')[0]
result = DataFrame(np.nan, index=self.rng, columns=['A']).asof(date)
expected = DataFrame(np.nan, index=[date], columns=['A'])
tm.assert_frame_equal(result, expected)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think a scalar input should result in a Series. That is at least the current behaviour for the working non-NaN case:

In [37]: df = pd.DataFrame(np.random.randn(2,2), index=[1,2], columns=['A', 'B'])

In [38]: df
Out[38]: 
          A         B
1 -0.643872  1.375342
2 -0.223192  0.231439

In [39]: df.asof([3])
Out[39]: 
          A         B
3 -0.223192  0.231439

In [40]: df.asof(3)
Out[40]: 
A   -0.223192
B    0.231439
Name: 3, dtype: float64

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What you added is not the correct result I think. It are the original columns that are the index of the resulting series, not [where]. The where does become the name of the Series (i.e. as if you access a row from the dataframe)

Can you add the example above (but then with NaNs instead of the random data) as a test case? The it is really clear what the expected behaviour is.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done, added the tests

21 changes: 21 additions & 0 deletions pandas/tests/series/test_asof.py
Original file line number Diff line number Diff line change
Expand Up @@ -148,3 +148,24 @@ def test_errors(self):
s = Series(np.random.randn(N), index=rng)
with self.assertRaises(ValueError):
s.asof(s.index[0], subset='foo')

def test_all_nans(self):
# GH 15713
# series is all nans
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you add the issue number as a comment

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

result = Series([np.nan]).asof([0])
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you make this a separate test? (as it is not related to errors). Eg test_all_nans

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you also add a case not using zero as the argument?
And can you also add the case of a scalar, and of multiple values? (eg s.asof(10) and s.asof([10, 11])

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

expected = Series([np.nan])
tm.assert_series_equal(result, expected)

# testing non-default indexes
N = 50
rng = date_range('1/1/1990', periods=N, freq='53s')

dates = date_range('1/1/1990', periods=N * 3, freq='25s')
result = Series(np.nan, index=rng).asof(dates)
expected = Series(np.nan, index=dates)
tm.assert_series_equal(result, expected)

# testing scalar input
date = date_range('1/1/1990', periods=N * 3, freq='25s')[0]
result = Series(np.nan, index=rng).asof(date)
assert isnull(result)