Skip to content

BUG/TST: Fixes isnull behavior on NaT in array. Closes #5443 #5524

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
Jan 6, 2014

Conversation

commonlisp
Copy link
Contributor

I added a test case test_isnull_nat() to test_common.py and a check for NaT in lib.isnullobj. pd.isnull(np.array([pd.NaT])) now yields the correct results ([True]).

closes #5443

@jreback
Copy link
Contributor

jreback commented Nov 15, 2013

this needs to be profiled via test_perf.sh

@commonlisp
Copy link
Contributor Author

I ran test_perf.sh on the patch. Here is the vb_log: https://gist.github.com/commonlisp/7488695

@@ -213,7 +213,7 @@ def isnullobj(ndarray[object] arr):
n = len(arr)
result = np.zeros(n, dtype=np.uint8)
for i from 0 <= i < n:
result[i] = util._checknull(arr[i])
result[i] = util._checknull(arr[i]) or arr[i] is NaT
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this should be in util._checknull itself

@commonlisp
Copy link
Contributor Author

Ah, yes, I did have the check in util._checknull at first, but this creates a circular dependency between util and tslib where NaT is defined. Is that ok? Or perhaps NaT should be factored out.

@jreback
Copy link
Contributor

jreback commented Nov 15, 2013

actually...this whole change is trivial....just change _checknull to checknull in isnullobj; this already checks for NaT. and then profile again (it prob will be fine because isnullobj is only called if its object type in the first place)

@commonlisp
Copy link
Contributor Author

Great call. Using checknull now and added the new explicit test case. Here is a fresh test_perf.sh: https://gist.github.com/commonlisp/7492647. Going to run it again just to make sure there aren't any aberrations.

@commonlisp
Copy link
Contributor Author

I added a gist https://gist.github.com/commonlisp/7547232 with another run of test_perf.sh. It appears to agree with the previous run.

@commonlisp
Copy link
Contributor Author

The test_perf.sh runs for modify isnullobj directly and using checknull do differ somewhat (1.22 vs. 1.87). The question is whether this is within test_perf.sh's margin of error.

@jreback
Copy link
Contributor

jreback commented Nov 21, 2013

that's definitly out of bounds, pls see if its something simple..

@commonlisp
Copy link
Contributor Author

I've run a good number of performance tests. It seems that util.is_float_object, util.is_complex_object, util.is_datetime64_object, and util.is_timedelta64_object, though declared inline functions in numpy_helper.h, are contributing considerably to the runtime (~1.26-1.5x individually). We could factor out yet another version of lib.checknull that skips those checks to support the lib.isnullobj case.

@jreback
Copy link
Contributor

jreback commented Dec 15, 2013

ok I guess having a special version of checknull_object might be useful for object arrays where u can skip certain checks (but still need some that _checknull does not do)

@cancan101
Copy link
Contributor

Any reason not to get this merged in for v0.13?

@jreback
Copy link
Contributor

jreback commented Dec 16, 2013

@cancan101 have to nail down the perf issue, this code is used everywhere

@cancan101
Copy link
Contributor

That makes sense. I wasn't sure if a perf fix was worth waiting for.

@commonlisp
Copy link
Contributor Author

New test_perf.sh for this factoring of lib.checknull: https://gist.github.com/anonymous/8017235

@@ -170,20 +170,23 @@ cdef inline int64_t get_timedelta64_value(val):
cdef double INF = <double> np.inf
cdef double NEGINF = -INF

cpdef checknull_NaT(object val):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this doesn't need to be cpdef (only cdef)

make it _checknull_NaT as well
make it inline and I don't think you need util. when calling _checknull

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

btw - sometimes it can actually be faster to manually inline the function
(via copy/paste) vs. using the inline keyword. It's strange but if you
can't pull out more perf, might be helpful.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jtratner is right...just put it in 2 places is ok (but need to avoid the arr[i] twice...

@commonlisp I know this is tedious..but this type of low-level coding affects so many things....

@commonlisp
Copy link
Contributor Author

I think this is ready for review again. Please let me know if anyone has feedback.

@jreback
Copy link
Contributor

jreback commented Jan 2, 2014

can you squash the commits: https://github.com/pydata/pandas/wiki/Using-Git

and post a perf run...just to check

thanks

@jreback
Copy link
Contributor

jreback commented Jan 3, 2014

checkout this: https://github.com/pydata/pandas/wiki/Using-Git

you don't want to merge in other commits (nor merges)

so git rebase -i origin/master...then delete everything but your commit

then

git push myfork thisbranchname -f

@commonlisp
Copy link
Contributor Author

Commits squashed, not merging any other commits
Recent test_perf.sh https://gist.github.com/commonlisp/8290153

@jreback
Copy link
Contributor

jreback commented Jan 6, 2014

you need to take out all but your commit

@jreback
Copy link
Contributor

jreback commented Jan 6, 2014

do this:

git commit -C HEAD --amend

then force push

this will make it build again

then'll I merge it....

common._isnull_ndarraylike(...) uses lib.isnullobj to check
nulls/NaN/NaT in ndarray, which in turn relies on util._checknull.
_checknull did not know about NaT, but now lib.isnullobj does, while
still maintaining performance by doing arr[i] only once.

Added a test case test_isnull_nat() to test_common.py and
check for NaT in lib.isnullobj. pd.isnull(np.array([pd.NaT]))
now yields the correct results ([True]).
@jreback jreback merged commit f216c74 into pandas-dev:master Jan 6, 2014
@jreback
Copy link
Contributor

jreback commented Jan 6, 2014

@commonlisp
thanks!

first time working thru the process is painful......next time should be easier!

@commonlisp
Copy link
Contributor Author

@jreback Thank you!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

BUG: isnull of an object array failing when NaT
4 participants