Skip to content

BUG: GH11206 where pd.isnull did not consider numpy NaT null #11212

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
Nov 20, 2015

Conversation

kawochen
Copy link
Contributor

closes #11206

@kawochen kawochen force-pushed the BUG-FIX-11206 branch 3 times, most recently from d4a2acc to e937ebc Compare October 1, 2015 04:00
@jreback
Copy link
Contributor

jreback commented Oct 1, 2015

lgtm.

can you run a perf check on the null benchmarks. just confirm nothing has substantially changed.

@jreback jreback added Missing-data np.nan, pd.NaT, pd.NA, dropna, isnull, interpolate Datetime Datetime data dtype labels Oct 1, 2015
@jreback jreback added this to the 0.17.0 milestone Oct 1, 2015
@kawochen kawochen force-pushed the BUG-FIX-11206 branch 2 times, most recently from 597d9d4 to d6b80ce Compare October 2, 2015 04:07
@kawochen
Copy link
Contributor Author

kawochen commented Oct 2, 2015

Updated. Timing for frame_methods.frame_isnull.time_frame_isnull (where all data is non-nan floats) increased by less than 5%.
Added a benchmark where lots of the data is various types of null.

BENCHMARK                                    BEFORE      AFTER     FACTOR
...ame_isnull_has_null.time_frame_isnull    38.18ms   138.88ms   3.63779915x

I also changed the code slightly so that the logic is further short-circuited short-circuited.

@jreback
Copy link
Contributor

jreback commented Oct 2, 2015

the benchmark u are showing is 3x increase?

@kawochen
Copy link
Contributor Author

kawochen commented Oct 2, 2015

Yes.That's the with lots of nulls. I will dig a little to see if that can be avoided.

@jreback
Copy link
Contributor

jreback commented Oct 2, 2015

these routines are hit everywhere, so i am for special casing certain things. IOW, that is why isnullobj exists. But it may make sense to have 2 versions of _checknull_with_nat, because they are called on different array kinds. E.g if you have an int array then it obviously doesn't need to check for None,NaN,NaT at all. most of these cases are already handled, but IIRC, this is hotspotted. So just want to try to nail that down.

@jreback jreback modified the milestones: 0.17.0, 0.17.1 Oct 2, 2015
@kawochen
Copy link
Contributor Author

kawochen commented Oct 3, 2015

Updated.

result[i] = val is NaT or _checknull(val)
return result.view(np.bool_)
result[i] = _check_all_nulls(val)
return result.view(dtype=np.bool_)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I forgot to remove this. I will remove in the next update.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what are you fixing here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you mean the _check_all_nulls? Only object arrays hit this path so I thought it'd be safe to have the most generic version of checknull here?

For the dtype thing, I forgot to remove it after some experiment, but will remove it in the next update.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yep that is fine

@kawochen
Copy link
Contributor Author

kawochen commented Oct 3, 2015

udpated

@@ -1146,7 +1146,7 @@ Bug Fixes
- Bug in ``Categorical.__iter__`` may not returning correct ``datetime`` and ``Period`` (:issue:`10713`)
- Bug in indexing with a ``PeriodIndex`` on an object with a ``PeriodIndex`` (:issue:`4125`)
- Bug in ``read_csv`` with ``engine='c'``: EOF preceded by a comment, blank line, etc. was not handled correctly (:issue:`10728`, :issue:`10548`)

- Bug in ``isnull`` where ``numpy.datetime64('NaT')`` in a ``numpy.array`` was not determined to be null(:issue:`11206`)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let's put this in 0.17.1

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

pls move to 0.17.1

@kawochen
Copy link
Contributor Author

kawochen commented Oct 5, 2015

ok. I'm traveling and will get to this next week.

@jreback
Copy link
Contributor

jreback commented Oct 9, 2015

@kawochen can you update (move whatsnew to 0.17.1)

@kawochen
Copy link
Contributor Author

Will do (still on vacation)

On Oct 9, 2015, at 8:35 AM, Jeff Reback [email protected] wrote:

@kawochen can you update (move whatsnew to 0.17.1)


Reply to this email directly or view it on GitHub.

@kawochen
Copy link
Contributor Author

updated

@jreback
Copy link
Contributor

jreback commented Oct 18, 2015

can u rebase and run a perf check

@kawochen
Copy link
Contributor Author

Saw some horrifying perf hit (30X). I will look at the compiled code later.

@jreback
Copy link
Contributor

jreback commented Oct 21, 2015

ok gr8!

@kawochen
Copy link
Contributor Author

kawochen commented Nov 7, 2015

so using timeit and this code

from pandas import *
import pandas as pd
import numpy as np

np.random.seed(1234)
sample = np.array([NaT, np.nan, None, np.datetime64('NaT'),
    np.timedelta64('NaT'), 0, 1, 2.0, '', 'abcd'])
data = np.random.choice(sample, (1000, 1000))
df = DataFrame(data)

%timeit isnull(df)

the perf hit is 4X. But if I that put into an asv benchmark, it shows 24X. I seem to be having problems with asv lately -- not sure what's wrong.

@jreback
Copy link
Contributor

jreback commented Nov 7, 2015

are you using profile=True in the cython modules?

@pv
Copy link
Contributor

pv commented Nov 8, 2015

The perf hit is different for numpy 1.10 vs 1.9 (and asv probably uses the newest version)

@jreback
Copy link
Contributor

jreback commented Nov 13, 2015

@kawochen pushing, but if you finish in the next few days, pls ping

@jreback jreback modified the milestones: Next Major Release, 0.17.1 Nov 13, 2015
@kawochen
Copy link
Contributor Author

@pv thanks! 😄
@jreback
branch
[ 32.58%] ··· Running ...s.frame_isnull_obj.time_frame_isnull 46.07ms
master
[ 82.58%] ··· Running ...s.frame_isnull_obj.time_frame_isnull 38.91ms

`

@@ -707,12 +708,28 @@ NaT = NaTType()

iNaT = util.get_nat()


Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe blow away _checknull_with_nat(and replace with _check_all_nulls and see if any timings are affected?

@kawochen
Copy link
Contributor Author

Replacing wouldn't work in all cases, because sometimes you don't want timedelta64 to be accepted, e.g. datetime_to_datetime64(), so it would become some sort of manual inlining for _checknull_with_nat to deal with those cases (which might not be a bad thing, depending on how often this is needed).

But I see that tslib.pyx needs some cleaning, as code like if val is np_NaT or val.view('i8') == iNaT is everywhere (np_NaT is not a singleton, so the first half of the expression is almost certainly False, and the val.view('i8') is slower than get_datetime64_val (especially given the regression in numpy 1.10, but that will be fixed soon).

So maybe when others or I clean up this module, we can see how to organize or deal with all the lower level scalar check null functions

@jreback jreback modified the milestones: 0.17.1, Next Major Release Nov 20, 2015
jreback added a commit that referenced this pull request Nov 20, 2015
BUG: GH11206 where pd.isnull did not consider numpy NaT null
@jreback jreback merged commit b60b559 into pandas-dev:master Nov 20, 2015
@jreback
Copy link
Contributor

jreback commented Nov 20, 2015

@kawochen thanks!

always nice PR's from you! keep it up!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Datetime Datetime data dtype Missing-data np.nan, pd.NaT, pd.NA, dropna, isnull, interpolate
Projects
None yet
Development

Successfully merging this pull request may close these issues.

BUG: accept np.datetime64('NaT') in an object dtyped as a null
3 participants