Skip to content

Implement numpy_helper functions directly in cython #18059

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 5 commits into from
Nov 6, 2017

Conversation

jbrockmendel
Copy link
Member

cleanup unused bits of src/datetime.pxd

TL;DR this PR implements get_datetime64_value, get_timedelta64_value, and get_datetime64_unit directly in cython instead of getting them via C. There is no performance penalty. Weaning off of the src directory will make the build less of a hassle.

AFAICT the main reason why get_datetime64_unit, get_datetime64_value, and get_timedelta64_value (as well as is_float_object, is_datetime64_object`...) are implemented in numpy_helper.h and np_datetime.c is because cython/numpy does not make it easy to cimport the relevant symbols: https://github.com/pandas-dev/pandas/blob/master/pandas/_libs/src/numpy.pxd#L481

    # Cannot be supported due to ## ## in macro:
    # bint PyArray_IsScalar(object, verbatim work)

After some wrangling and stackoverflowing, I found a way to implement these directly in cython. Importantly, the C code generated is functionally identical to the existing versions, so there shouldn't be a performance hit.

A big part of the diff comes from moving some declarations from np_datetime's pyx file to the pxd file.

@codecov
Copy link

codecov bot commented Nov 1, 2017

Codecov Report

Merging #18059 into master will decrease coverage by 0.01%.
The diff coverage is n/a.

Impacted file tree graph

@@            Coverage Diff             @@
##           master   #18059      +/-   ##
==========================================
- Coverage   91.25%   91.23%   -0.02%     
==========================================
  Files         163      163              
  Lines       50115    50115              
==========================================
- Hits        45730    45721       -9     
- Misses       4385     4394       +9
Flag Coverage Δ
#multiple 89.04% <ø> (ø) ⬆️
#single 40.23% <ø> (-0.06%) ⬇️
Impacted Files Coverage Δ
pandas/io/gbq.py 25% <0%> (-58.34%) ⬇️
pandas/core/frame.py 97.75% <0%> (-0.1%) ⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 881ee30...ea4bef0. Read the comment docs.

@codecov
Copy link

codecov bot commented Nov 1, 2017

Codecov Report

Merging #18059 into master will increase coverage by <.01%.
The diff coverage is n/a.

Impacted file tree graph

@@            Coverage Diff             @@
##           master   #18059      +/-   ##
==========================================
+ Coverage   91.25%   91.26%   +<.01%     
==========================================
  Files         163      163              
  Lines       50124    50124              
==========================================
+ Hits        45742    45745       +3     
+ Misses       4382     4379       -3
Flag Coverage Δ
#multiple 89.07% <ø> (+0.02%) ⬆️
#single 40.32% <ø> (-0.06%) ⬇️
Impacted Files Coverage Δ
pandas/io/gbq.py 25% <0%> (-58.34%) ⬇️
pandas/core/frame.py 97.75% <0%> (-0.1%) ⬇️
pandas/plotting/_converter.py 65.2% <0%> (+1.81%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update bc69dc6...13b85d8. Read the comment docs.

@gfyoung gfyoung added Clean Internals Related to non-user accessible pandas implementation labels Nov 1, 2017
@gfyoung
Copy link
Member

gfyoung commented Nov 1, 2017

so there shouldn't be a performance hit.

Seem reasonable, but let's back that claim up with asv metrics to be sure.

@jreback
Copy link
Contributor

jreback commented Nov 1, 2017

instead of this, is there a reason you simply don't move datetime.pxd into np_datetime.pxd? (certainly can do some of the code cleanup, but after)

@jbrockmendel
Copy link
Member Author

Seem reasonable, but let's back that claim up with asv metrics to be sure.

Will do. Even with affinity my asv results probably merit outside confirmation.

@jbrockmendel
Copy link
Member Author

instead of this, is there a reason you simply don't move datetime.pxd into np_datetime.pxd? (certainly can do some of the code cleanup, but after)

That may be worth doing, but I see it as orthogonal to this. Cutting numpy_helper out of the equation simplifies things quite a bit (among other things I'm thinking about setup.py where 'include' and 'pxd' specifications seem pretty fragile)

@jbrockmendel
Copy link
Member Author

With appropriate skepticism:

taskset 5 asv continuous -E virtualenv -f 1.1 master HEAD -b timestamp
[...]
      before           after         ratio
     [d80a4b81]       [ea4bef00]
+     5.95±0.02μs      6.79±0.05μs     1.14  timestamp.TimestampProperties.time_is_leap_year
+     5.71±0.03μs         6.28±1μs     1.10  timestamp.TimestampProperties.time_is_quarter_end
-     7.09±0.02μs      6.17±0.02μs     0.87  timestamp.TimestampProperties.time_is_year_start
-        195±50ns        109±0.2ns     0.56  timestamp.TimestampProperties.time_microsecond
-        862±40ms         159±10ms     0.18  binary_ops.Timeseries.time_timestamp_ops_diff2
-        443±10ms         77.5±4ms     0.18  binary_ops.TimeseriesTZ.time_timestamp_ops_diff2
-        469±30ms         63.1±7ms     0.13  binary_ops.TimeseriesTZ.time_timestamp_ops_diff1
-         155±8ms       19.1±0.2ms     0.12  binary_ops.Timeseries.time_timestamp_ops_diff1

taskset 5 asv continuous -E virtualenv -f 1.1 master HEAD -b timeseries
[...]
       before           after         ratio
     [d80a4b81]       [ea4bef00]
+      17.3±0.9ms         21.6±1ms     1.25  timeseries.TimeSeries.time_add_irregular
+     46.3±0.09μs       54.3±0.2μs     1.17  timeseries.SemiMonthOffset.time_begin_incr
+        132±10ms            150ms     1.13  timeseries.DatetimeIndex.time_dti_tz_factorize
-      20.3±0.1μs      18.4±0.09μs     0.90  timeseries.Offsets.time_custom_bday_incr
-      20.6±0.1μs       17.8±0.1μs     0.86  timeseries.Offsets.time_timeseries_day_apply

Digging out the C code produced by cython for the implementation of get_datetime64_value here (and stripping out the traceback comments)

static CYTHON_INLINE npy_datetime __pyx_f_6pandas_5_libs_6tslibs_11np_datetime_get_datetime64_value(PyObject *__pyx_v_obj) {
  npy_datetime __pyx_r;

  __pyx_r = ((PyDatetimeScalarObject *)__pyx_v_obj)->obval;
  goto __pyx_L0;

  /* function exit code */
  __pyx_L0:;
  return __pyx_r;
}

Whereas the version being replaced is:

PANDAS_INLINE npy_datetime get_datetime64_value(PyObject* obj) {
        return ((PyDatetimeScalarObject*)obj)->obval;
    }

cython is really verbose, but I pinged their mailing list and was told that the verbosity should be stripped out by the compiler.

Same story for get_datetime64_unit and get_timedelta64_value.

@jreback
Copy link
Contributor

jreback commented Nov 2, 2017

@chris-b1 can you review

Copy link
Contributor

@chris-b1 chris-b1 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actual changes look fine to me

# int64_t year
# int32_t month, day, hour, min, sec, us, ps, as

ctypedef enum NPY_DATETIMEUNIT:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this enum needed?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, it's leftover from an earlier revision. Will remove.

# numpy object inspection

cdef inline npy_datetime get_datetime64_value(object obj) nogil:
"""
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you make the docstrings more descriptive? Something like "returns int64 value underlying numpy datetime scalar Python object"

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good idea, will change.

@jreback
Copy link
Contributor

jreback commented Nov 3, 2017

rebase as always on re-push

…libs-npy_dtime5

update docstring per reviewer suggestion
@jbrockmendel
Copy link
Member Author

Thoughts on this? I'd like to follow-up by moving a few more pieces of src/datetime.pxd over to np_datetime.pxd

@gfyoung
Copy link
Member

gfyoung commented Nov 6, 2017

ping @jreback


from tslibs.np_datetime cimport get_timedelta64_value, get_datetime64_value
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so this now makes lib depend on np_datetime.pxd, which is ok as tseries_depends now lists this.

@jreback jreback added this to the 0.22.0 milestone Nov 6, 2017
@jreback jreback merged commit f17aa26 into pandas-dev:master Nov 6, 2017
@jreback
Copy link
Contributor

jreback commented Nov 6, 2017

thanks! you will prob need to rebase your other PR's on this just to make sure.

No-Stream pushed a commit to No-Stream/pandas that referenced this pull request Nov 28, 2017
@jbrockmendel jbrockmendel deleted the tslibs-npy_dtime5 branch December 8, 2017 19:40
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Clean Internals Related to non-user accessible pandas implementation
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants