Skip to content

BUG: Series.asof fails for all NaN Series (GH15713) #15758

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 15 commits into from

Conversation

ucals
Copy link

@ucals ucals commented Mar 21, 2017

Added the test if the series is all nans
Added the code that check if that's the case: if yes, return the expected output
As this is my first contribution, please comment if I did it right :)
Thanks!

@codecov
Copy link

codecov bot commented Mar 21, 2017

Codecov Report

Merging #15758 into master will decrease coverage by <.01%.
The diff coverage is 100%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master   #15758      +/-   ##
==========================================
- Coverage   91.01%   91.01%   -0.01%     
==========================================
  Files         143      143              
  Lines       49395    49379      -16     
==========================================
- Hits        44959    44940      -19     
- Misses       4436     4439       +3
Impacted Files Coverage Δ
pandas/core/generic.py 96.25% <100%> (+0.01%) ⬆️
pandas/io/gbq.py 25% <0%> (-58.34%) ⬇️
pandas/tseries/common.py 88.09% <0%> (-1.07%) ⬇️
pandas/core/frame.py 97.86% <0%> (-0.1%) ⬇️
pandas/core/algorithms.py 94.41% <0%> (-0.1%) ⬇️
pandas/io/stata.py 93.47% <0%> (-0.05%) ⬇️
pandas/core/strings.py 98.48% <0%> (-0.02%) ⬇️
pandas/indexes/numeric.py 97.1% <0%> (-0.02%) ⬇️
pandas/indexes/multi.py 96.59% <0%> (ø) ⬆️
pandas/core/categorical.py 96.89% <0%> (ø) ⬆️
... and 11 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 92239f5...0765108. Read the comment docs.

@jorisvandenbossche jorisvandenbossche changed the title Bug fix 15713 BUG: Series.asof fails for all NaN Series (GH15713) Mar 21, 2017
Copy link
Member

@jorisvandenbossche jorisvandenbossche left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you also add tests for DataFrame? (both using a subset and not)?

@@ -4,6 +4,8 @@
from pandas import (offsets, Series, notnull,
isnull, date_range, Timestamp)

from pandas.util.testing import assert_series_equal
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No need to import this, you can use it as tm.assert_series_equal(..) as is done in the other tests

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

@@ -148,3 +150,8 @@ def test_errors(self):
s = Series(np.random.randn(N), index=rng)
with self.assertRaises(ValueError):
s.asof(s.index[0], subset='foo')

# series is all nans
result = Series([np.nan]).asof([0])
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you make this a separate test? (as it is not related to errors). Eg test_all_nans

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

@@ -3971,6 +3971,9 @@ def asof(self, where, subset=None):
if not isinstance(where, Index):
where = Index(where) if is_list else Index([where])

if self.isnull().values.all():
return pd.Series([np.nan])
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will not work for a DataFrame (I mean: it will not be the correct return value, see the docstring)

Further, you can put this after the next line where nulls is defined and reuse that

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done! thanks for the comments, feels great to learn how to contribute :)

@jreback jreback added Bug Missing-data np.nan, pd.NaT, pd.NA, dropna, isnull, interpolate Datetime Datetime data dtype labels Mar 21, 2017
@jreback
Copy link
Contributor

jreback commented Mar 22, 2017

somewhat separate. but do we have a guarantee on a not-found indexer?

@chris-b1

In [28]: Series([np.nan, 1, 2], index=[2, 3, 4]).asof([0])
Out[28]: 
0   NaN
dtype: float64

Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you add a whatsnew note (for 0.20 / bug fix section)


def test_all_nans(self):
# series is all nans
result = DataFrame([np.nan]).asof([0])
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

try these with non-defualt indexes and see what happens (your test will break)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indeed, and also, when you have a DataFrame with multiple columns, those columns should be preserved in the result

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

@@ -3972,6 +3972,12 @@ def asof(self, where, subset=None):
where = Index(where) if is_list else Index([where])

nulls = self.isnull() if is_series else self[subset].isnull().any(1)
if nulls.values.all():
if is_series:
return pd.Series([np.nan])
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

need to set the indexes here

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

@@ -3972,6 +3972,12 @@ def asof(self, where, subset=None):
where = Index(where) if is_list else Index([where])

nulls = self.isnull() if is_series else self[subset].isnull().any(1)
if nulls.values.all():
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think the .values is still needed.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The values is still here?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @jorisvandenbossche ... I removed then put it back because I thought it generated a backward compatibility error. Currently the build breaks for Python 2.7.9. Now I saw it has nothing to do with it in Travis CI log: it's a "ci/lint.sh" exiting 1.
I will remove it again and see where the code is unformatted. Thanks

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed the .values


def test_all_nans(self):
# series is all nans
result = Series([np.nan]).asof([0])
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you also add a case not using zero as the argument?
And can you also add the case of a scalar, and of multiple values? (eg s.asof(10) and s.asof([10, 11])

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done


def test_all_nans(self):
# series is all nans
result = DataFrame([np.nan]).asof([0])
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indeed, and also, when you have a DataFrame with multiple columns, those columns should be preserved in the result

Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just some minor comments. ping when all green.

@@ -930,3 +930,5 @@ Bug Fixes
- Bug in ``pd.melt()`` where passing a tuple value for ``value_vars`` caused a ``TypeError`` (:issue:`15348`)
- Bug in ``.eval()`` which caused multiline evals to fail with local variables not on the first line (:issue:`15342`)
- Bug in ``pd.read_msgpack`` which did not allow to load dataframe with an index of type ``CategoricalIndex`` (:issue:`15487`)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FYI in the future, if you put the whatnew notes in a blank space in Bug Fixes (these are on purpose), you wont' get merge conflicts

# testing scalar input
date = date_range('1/1/1990', periods=N * 3, freq='25s')[0]
result = Series(np.nan, index=rng).asof(date)
self.assertTrue(result != result)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

assert isnull(result)

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

@@ -148,3 +148,23 @@ def test_errors(self):
s = Series(np.random.randn(N), index=rng)
with self.assertRaises(ValueError):
s.asof(s.index[0], subset='foo')

def test_all_nans(self):
# series is all nans
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you add the issue number as a comment

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

tm.assert_frame_equal(result, expected)

def test_all_nans(self):
# series is all nans
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you add the issue number as a comment

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

tm.assert_frame_equal(result, expected)

def test_all_nans(self):
# series is all nans
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this comment needs updating

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done... thanks @jreback !!

@jreback jreback added this to the 0.20.0 milestone Mar 23, 2017
@jreback
Copy link
Contributor

jreback commented Mar 23, 2017

lgtm.

@chris-b1 if you can give a quick look (don't merge yet, going to merge a couple of things at once later)

Copy link
Member

@jorisvandenbossche jorisvandenbossche left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Small change needed for the DataFrame case I think

date = date_range('1/1/1990', periods=self.N * 3, freq='25s')[0]
result = DataFrame(np.nan, index=self.rng, columns=['A']).asof(date)
expected = DataFrame(np.nan, index=[date], columns=['A'])
tm.assert_frame_equal(result, expected)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think a scalar input should result in a Series. That is at least the current behaviour for the working non-NaN case:

In [37]: df = pd.DataFrame(np.random.randn(2,2), index=[1,2], columns=['A', 'B'])

In [38]: df
Out[38]: 
          A         B
1 -0.643872  1.375342
2 -0.223192  0.231439

In [39]: df.asof([3])
Out[39]: 
          A         B
3 -0.223192  0.231439

In [40]: df.asof(3)
Out[40]: 
A   -0.223192
B    0.231439
Name: 3, dtype: float64

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What you added is not the correct result I think. It are the original columns that are the index of the resulting series, not [where]. The where does become the name of the Series (i.e. as if you access a row from the dataframe)

Can you add the example above (but then with NaNs instead of the random data) as a test case? The it is really clear what the expected behaviour is.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done, added the tests

Copy link
Member

@jorisvandenbossche jorisvandenbossche left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be the last change I think to get it merged!

return pd.DataFrame(np.nan, index=where,
columns=self.columns)
else:
return pd.Series(np.nan, index=[where])
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So this should just be index=self.columns, name=where I think.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.. The test passes without name - no need to set it - index=self.columns is enough. Thanks @jorisvandenbossche

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That is because you wrote your test without it, then of course it will pass it without it.
The name is essential for a correct result, and should be added both to the code here as to the test.

As I showed before with this simple example of the current behaviour

In [1]: df = pd.DataFrame(np.random.randn(2,2), index=[1,2], columns=['A', 'B'])

In [2]: df
Out[2]: 
          A         B
1  0.387517 -0.571258
2 -0.376436  0.604668

In [3]: df.asof(3)
Out[3]: 
A   -0.376436
B    0.604668
Name: 3, dtype: float64

you can clearly see that the name is the actual value you passed to asof

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

my bad, sorry... just fixed it, it should work now... thanks again @jorisvandenbossche

@@ -3972,6 +3972,16 @@ def asof(self, where, subset=None):
where = Index(where) if is_list else Index([where])

nulls = self.isnull() if is_series else self[subset].isnull().any(1)
if nulls.all():
if is_series:
return pd.Series(np.nan, index=where)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is not correct; should have name=self.name

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought about that, @jreback , but when I experimented with a non-null series, I saw that it has no name. I.e.:

result = Series(np.random.randn(4), index=[1, 2, 3, 4]).asof([4, 5])
print result

returns

4   -0.558532
5   -0.558532
dtype: float64
......

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

and that not correct. we always want to propogate the names.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok, let me write the test case and fix for nan and non-nan inputs

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jreback done here.. working on the request below, on simplifying the code

columns=self.columns)
else:
return pd.Series(np.nan, index=self.columns, name=where[0])

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

see if you can simplify this logic a bit (maybe set the name where is_list is used before)

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hey @jreback , I made a small simplification, pls check if that's ok... if it's ok, now I think everything is good to go


# test name is propagated
result = Series(np.nan, index=[1, 2, 3, 4], name='test').asof([4, 5])
self.assertEqual(result.name, 'test')
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this needs a tm.assert_series_equal with the expected result

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

but I'll do on merge

@jreback
Copy link
Contributor

jreback commented Mar 25, 2017

thanks @ucals ping on green.

@ucals
Copy link
Author

ucals commented Mar 25, 2017

All green. @jreback , @jorisvandenbossche , thanks a lot! You guys are great. This was my first ever contribution to an open source project, and you guys gave me the right direction - feels very good to contribute and learn! I'll try to tackle an intermediate bug now

@jreback jreback closed this in d2f32a0 Mar 26, 2017
@jreback
Copy link
Contributor

jreback commented Mar 26, 2017

thanks!

@ucals ucals deleted the bug-fix-15713 branch March 26, 2017 04:01
mattip pushed a commit to mattip/pandas that referenced this pull request Apr 3, 2017
closes bug pandas-dev#15713

Added the test if the series is all nans  Added the code that check if
that's the case: if yes, return the expected output

Author: Carlos Souza <[email protected]>

Closes pandas-dev#15758 from ucals/bug-fix-15713 and squashes the following commits:

0765108 [Carlos Souza] First simplification, code-block in the same place
bb63964 [Carlos Souza] Propagating Series name
af9a29b [Carlos Souza] Setting name of asof result when scalar input and all nan
b8f078a [Carlos Souza] Small code standard change
7448b96 [Carlos Souza] Fixing scalar input
a080b9b [Carlos Souza] Making scalar input return in a Series
04b7306 [Carlos Souza] Removing .values and formating code PEP8
3f9c7fd [Carlos Souza] Minor comments
70c958f [Carlos Souza] Added tests for non-default indexes, scalar and multiple inputs, and results preserve columns
6b745af [Carlos Souza] Adding DataFrame tests & support, and optimizing the code
89fb6cf [Carlos Souza] BUG pandas-dev#15713 fixing failing tests
17d1d77 [Carlos Souza] BUG pandas-dev#15713 Series.asof return nan when series is all nans!
4e26ab8 [Carlos Souza] BUG pandas-dev#15713 Series.asof return nan when series is all nans.
c78d687 [Carlos Souza] BUG pandas-dev#15713 Series.asof return nan when series is all nans
676a4e5 [Carlos Souza] Test
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug Datetime Datetime data dtype Missing-data np.nan, pd.NaT, pd.NA, dropna, isnull, interpolate
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants