Skip to content

Remove unnecessary iNaT checks from _Period properties #17421

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 6 commits into from
Sep 15, 2017

Conversation

jbrockmendel
Copy link
Member

In the status quo, pd.Period.year goes through several layers of redirection, one of which checks for self.ordinal == iNaT.

cdef class _Period(object):
   [...]
    cdef _field(self, alias):
        base, mult = frequencies.get_freq_code(self.freq)
        return get_period_field(alias, self.ordinal, base)

    property year:
        def __get__(self):
            return self._field(0)

[...]
def get_period_field(int code, int64_t value, int freq):
    cdef accessor f = _get_accessor_func(code)
    if f is NULL:
        raise ValueError('Unrecognized period code: %d' % code)
    if value == iNaT:
        return np.nan
    return f(value, freq)

But Period.ordinal will never be iNaT, because Period.__new__ returns NaT in that case instead of a Period object. So we can skip the value == iNaT check in get_period_field. With that out of the way, we can skip _get_accessor_func and _field and and just write in pyear(ordinal, freq).

This PR changes this property lookup for year, month, day, ...

Speedups ~10% across the board. Before:

In [2]: per = pd.Timestamp.now().to_period('M')
In [4]: %timeit per.month
The slowest run took 12.73 times longer than the fastest. This could mean that an intermediate result is being cached.
1000000 loops, best of 3: 1.4 µs per loop

In [5]: %timeit per.second
The slowest run took 15.01 times longer than the fastest. This could mean that an intermediate result is being cached.
1000000 loops, best of 3: 1.4 µs per loop

In [6]: %timeit per.year
The slowest run took 14.73 times longer than the fastest. This could mean that an intermediate result is being cached.
100000 loops, best of 3: 1.36 µs per loop

In [7]: %timeit per.is_leap_year
The slowest run took 16.94 times longer than the fastest. This could mean that an intermediate result is being cached.
1000000 loops, best of 3: 1.37 µs per loop



In [8]: per = pd.Timestamp.now().to_period('ms')

In [9]: %timeit per.is_leap_year
The slowest run took 16.74 times longer than the fastest. This could mean that an intermediate result is being cached.
1000000 loops, best of 3: 1.5 µs per loop

In [10]: %timeit per.minute
The slowest run took 13.71 times longer than the fastest. This could mean that an intermediate result is being cached.
1000000 loops, best of 3: 1.37 µs per loop

After:

In [2]: per = pd.Timestamp.now().to_period('M')

In [3]: %timeit per.month
The slowest run took 19.96 times longer than the fastest. This could mean that an intermediate result is being cached.
1000000 loops, best of 3: 1.21 µs per loop

In [4]: %timeit per.second
The slowest run took 15.26 times longer than the fastest. This could mean that an intermediate result is being cached.
1000000 loops, best of 3: 1.25 µs per loop

In [5]: %timeit per.year
The slowest run took 16.55 times longer than the fastest. This could mean that an intermediate result is being cached.
1000000 loops, best of 3: 1.21 µs per loop

In [6]: %timeit per.is_leap_year
The slowest run took 15.96 times longer than the fastest. This could mean that an intermediate result is being cached.
1000000 loops, best of 3: 1.24 µs per loop

In [7]: per = pd.Timestamp.now().to_period('ms')

In [8]: %timeit per.is_leap_year
The slowest run took 15.72 times longer than the fastest. This could mean that an intermediate result is being cached.
1000000 loops, best of 3: 1.33 µs per loop

In [9]: %timeit per.minute
The slowest run took 17.03 times longer than the fastest. This could mean that an intermediate result is being cached.
1000000 loops, best of 3: 1.23 µs per loop

  • closes #xxxx
  • tests added / passed
  • passes git diff upstream/master -u -- "*.py" | flake8 --diff
  • whatsnew entry

@gfyoung gfyoung added Clean Internals Related to non-user accessible pandas implementation labels Sep 2, 2017
@codecov
Copy link

codecov bot commented Sep 3, 2017

Codecov Report

Merging #17421 into master will decrease coverage by 0.01%.
The diff coverage is n/a.

Impacted file tree graph

@@            Coverage Diff             @@
##           master   #17421      +/-   ##
==========================================
- Coverage   91.15%   91.14%   -0.02%     
==========================================
  Files         163      163              
  Lines       49581    49581              
==========================================
- Hits        45198    45189       -9     
- Misses       4383     4392       +9
Flag Coverage Δ
#multiple 88.92% <ø> (ø) ⬆️
#single 40.25% <ø> (-0.07%) ⬇️
Impacted Files Coverage Δ
pandas/io/gbq.py 25% <0%> (-58.34%) ⬇️
pandas/core/frame.py 97.72% <0%> (-0.1%) ⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 1981b67...df81546. Read the comment docs.

@codecov
Copy link

codecov bot commented Sep 3, 2017

Codecov Report

Merging #17421 into master will decrease coverage by 0.01%.
The diff coverage is n/a.

Impacted file tree graph

@@            Coverage Diff             @@
##           master   #17421      +/-   ##
==========================================
- Coverage   91.22%    91.2%   -0.02%     
==========================================
  Files         163      163              
  Lines       49586    49586              
==========================================
- Hits        45233    45225       -8     
- Misses       4353     4361       +8
Flag Coverage Δ
#multiple 88.99% <ø> (ø) ⬆️
#single 40.2% <ø> (-0.07%) ⬇️
Impacted Files Coverage Δ
pandas/io/gbq.py 25% <0%> (-58.34%) ⬇️
pandas/core/frame.py 97.77% <0%> (-0.1%) ⬇️
pandas/core/indexes/datetimes.py 95.53% <0%> (+0.09%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 0097cb7...8e09c56. Read the comment docs.

@sinhrks sinhrks added the Period Period data type label Sep 4, 2017
@sinhrks
Copy link
Member

sinhrks commented Sep 4, 2017

can we remove get_period_field completely?

@jbrockmendel
Copy link
Member Author

can we remove get_period_field completely?

AFAICT the only remaining usages are in tests. It is exposed so technically could be used by someone somewhere. I'm trying to err on the side of doing-too-little in PRs to keep the reviewer(s) happy.

@jreback
Copy link
Contributor

jreback commented Sep 6, 2017

can you add asv's for these Period properites (and then run the comparions).

@jbrockmendel
Copy link
Member Author

See additions to the asv period file.

asv continuous -f 1.1 -E virtualenv master HEAD -b period.Properties
[...]
BENCHMARKS NOT SIGNIFICANTLY CHANGED.

For once I actually believe the results!

@jreback
Copy link
Contributor

jreback commented Sep 7, 2017

this looks like it overlaps with #17422 somewhat. ok with merging either first; but I suspect you will want to rework this one to directly call the cython routine (get_freq_codes) . lmk.

@jbrockmendel
Copy link
Member Author

but I suspect you will want to rework this one to directly call the cython routine (get_freq_codes)

My preference would be to merge them separately-- they should not conflict-- and then in a follow-up to #17422 I'll make a pass through all of _libs to replace any calls to frequencies.get_freq_codes with the cython version.

@jreback
Copy link
Contributor

jreback commented Sep 7, 2017

ok that's fine, let's fixup / merge #17422 first

@jbrockmendel
Copy link
Member Author

The Travis error is a greyed-out circle. Not sure what to make of it. Does this require action on my part?

@jreback
Copy link
Contributor

jreback commented Sep 14, 2017

travis was hicupping. why don't you rebase and push again.

@jreback jreback added this to the 0.21.0 milestone Sep 15, 2017
@jreback jreback merged commit 9b21c54 into pandas-dev:master Sep 15, 2017
@jreback
Copy link
Contributor

jreback commented Sep 15, 2017

thanks!

@jbrockmendel jbrockmendel deleted the period_nat branch October 30, 2017 16:24
alanbato pushed a commit to alanbato/pandas that referenced this pull request Nov 10, 2017
No-Stream pushed a commit to No-Stream/pandas that referenced this pull request Nov 28, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Clean Internals Related to non-user accessible pandas implementation Period Period data type
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants