Skip to content

ENH: support for nanosecond time in offset and period #3060

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
Sep 30, 2013
Merged

ENH: support for nanosecond time in offset and period #3060

merged 1 commit into from
Sep 30, 2013

Conversation

wuan
Copy link
Contributor

@wuan wuan commented Mar 15, 2013

closes #1812

This pull request fixes handling of nanosecond times in periods and offsets. In addition it simplifies the internal processing of intraday periods.

In []: pd.date_range('2013-01-01', periods=5, freq=pd.offsets.Nano(5))
Out[]: 
<class 'pandas.tseries.index.DatetimeIndex'>
[2013-01-01 00:00:00, ..., 2013-01-01 00:00:00.000000020]
Length: 5, Freq: 5N, Timezone: None

It requires Numpy 1.7 to work, using Numpy 1.6 yields the original behaviour before this PR.

This is the continuation of #2555 after it was merged without effect fixing issue #1812.

The biggest part is a refactoring of pandas/src/period.c to use a conversion factor matrix for intraday conversion. This minimizes the impact of adding three more intraday time units (ms, us and ns). Thus the amount of frequency conversion methods is reduced to one per intraday frequency source or target. The intraday conversion factor is looked up in the conversion factor matrix inside the method.

Nanosecond support is only available when using numpy >= 1.7. When using with numpy < 1.7 everything except nanosecond support should work as expected.

@jreback
Copy link
Contributor

jreback commented Mar 20, 2013

ypu have some failnig tests.....I think you need to Skip by raise nose.SkipTest,
not self.SkipTest

@wuan
Copy link
Contributor Author

wuan commented Mar 20, 2013

thanks for the hint, the tests are green again

@jreback
Copy link
Contributor

jreback commented Mar 21, 2013

@wesm take a look?

@wesm
Copy link
Member

wesm commented Apr 1, 2013

I'll have a look soon as I can and we can merge for 0.11

@ghost
Copy link

ghost commented Apr 6, 2013

fwiw, I suggest putting this off to 0.12, it's already seen some subtle breakage and it should
get a period of beta testing in master before shipping in a release IMO.
No reflection on the code, just prudence.

@wuan
Copy link
Contributor Author

wuan commented Apr 7, 2013

I agree, it may be better to have a longer period of testing.

@wesm
Copy link
Member

wesm commented Apr 8, 2013

OK let's merge right after 0.11. We'll want to have a 0.11.x maintenance branch in case we need to fix any critical bugs in 0.11.1/2/...

@ghost
Copy link

ghost commented Apr 8, 2013

There's a slew of stuff for "right after 0.11", it looks like we're naturally moving towards a
"merge window" model for releases.

@ghost
Copy link

ghost commented Apr 23, 2013

@wuan, any last minute tweaks before we go live in master?

@wuan
Copy link
Contributor Author

wuan commented Apr 23, 2013

No, not at the moment. I will have a look at #3375 now.

@jreback
Copy link
Contributor

jreback commented May 7, 2013

you rebased on top of master?

can u also post a perf check?

eg

test_perf -b before_commit -t last_commit

anything say > 15% diff

@wuan
Copy link
Contributor Author

wuan commented May 18, 2013

This is a current perofmance check, there is quite a distribution of times:

-------------------------------------------------------------------------------
Test name                                    | head[ms] | base[ms] |  ratio   |
-------------------------------------------------------------------------------
dataframe_reindex                            |   0.3353 |   0.4524 |   0.7412 |
panel_from_dict_equiv_indexes                |  22.9453 |  29.8474 |   0.7688 |
reindex_daterange_backfill                   |   0.1794 |   0.2103 |   0.8530 |
series_drop_duplicates_string                |   0.3813 |   0.4307 |   0.8852 |
append_frame_single_homogenous               |   0.1747 |   0.1964 |   0.8895 |

 ...

timeseries_asof_single                       |   0.0357 |   0.0246 |   1.4484 |
frame_xs_col                                 |   0.0226 |   0.0153 |   1.4767 |
reindex_fillna_backfill_float32              |   0.0796 |   0.0521 |   1.5298 |
timeseries_period_downsample_mean            |  17.5680 |  11.3323 |   1.5503 |
dti_reset_index                              |   0.1311 |   0.0533 |   2.4575 |
-------------------------------------------------------------------------------
Test name                                    | head[ms] | base[ms] |  ratio   |
-------------------------------------------------------------------------------

as times are more often wrapped with Timestamp, there is a performance impact. I can have a further look into this.

@jreback
Copy link
Contributor

jreback commented May 18, 2013

performance is tricky - some times multiple runs will give diff results

so obviously we care about biggest ratios

do a %prun -l func()

in ipython to profile - I usually do on master then on my pr to see what the hotspots
often times its something simple to bring it down

@cpcloud
Copy link
Member

cpcloud commented Aug 1, 2013

it would be great to have this in 0.13 ... i've just pulled down and i'm going to do a little smoke testing

@wuan any thoughts? it's been a while... just want to bump this

@wuan
Copy link
Contributor Author

wuan commented Aug 1, 2013

Thanks for restarting activity ...

At the moment I'm investigating some broken tests due to features introduced in the meantime. I will have some more time mid of august to help bringing things forward.

Benchmark tests results are very strange at the moment, as some of the tests show speed differences, but do not involve new code at all. Still looking into this topic as well.

@wuan
Copy link
Contributor Author

wuan commented Aug 11, 2013

Pulled master and rebased.

There are still problems with strata._datetime_to_stata_elapsed raising a ValueError when there is a np.datetime64 instead of a datetime.datetime object. For me, this seems to be a bug in the strata code. Any thoughts?

@jreback
Copy link
Contributor

jreback commented Aug 13, 2013

do you have an example where I can reproduce this?

It appears (bried glance), that the stata code iterrows over the frame; IF there is not a homogenous (e.g. all dates) data then I think the np.datetime64 will be coerced to datetime.datetime.

Agreed that this is a 'bug', or a at least a bad assumption.

@wuan
Copy link
Contributor Author

wuan commented Aug 13, 2013

Oops, the real problem was cause by a type cast bug introduced with this pull request which is finally gone now. Tests are green again.

At least there is now an open issue #4558 regarding export of np.datetime64 based types in index.

if isinstance(key, datetime):
# needed to localize naive datetimes
stamp = Timestamp(key, tz=self.tz)
elif isinstance(key, Timestamp):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is not necessary, a Timestamp IS and instance of datetime (its a subclass)

@jreback
Copy link
Contributor

jreback commented Aug 13, 2013

What happens on numpy < 1.7, Nano is just not available? or does it raise if I try to index by a nano on an index? (either is ok, just checking)....

There are 2 travis version (the 2.6 and 3.2) that run with numpy 1.6.1, so you can put in a specific test for this, see first item here: https://github.com/pydata/pandas/wiki/Tips-&-Tricks (you could even do something in the code too); there will very shortl be a variable in core/common.py which indicates this, see here: #4534 (as timedeltas64 are very buggy in < 1.7)

@wuan
Copy link
Contributor Author

wuan commented Aug 13, 2013

Nano existed well before this pull request and issue #1812. It just did not work, the effective period was zero.
Should we really introduce code to raise or only a kind of warning message, that numpy >= 1.7 is required to use Nano et. al.?

@jreback
Copy link
Contributor

jreback commented Aug 13, 2013

@wuan

this PR adds the ability to index by nanos, correct? (e.g. more than just representing)
them....

can you add an example to the top of the PR?

(this is with current master)

In [2]: s = Series([Timestamp('20130101 09:01:02')])

In [6]: s+pd.offsets.Milli(5)+pd.offsets.Micro(6)+pd.offsets.Nano(7)
Out[6]: 
0   2013-01-01 09:01:02.005006007
dtype: datetime64[ns]

@jreback
Copy link
Contributor

jreback commented Aug 13, 2013

What does it do in < 1.7? (can you give an example?)

@jreback
Copy link
Contributor

jreback commented Aug 23, 2013

@wuan can you rerun the perf impact?

@wuan
Copy link
Contributor Author

wuan commented Aug 23, 2013

Here is the current performance comparison. I will have a look in the slow operations in the next days.

-------------------------------------------------------------------------------
Test name                                    | head[ms] | base[ms] |  ratio   |
-------------------------------------------------------------------------------
join_dataframe_index_single_key_small        |   8.0736 |   9.6220 |   0.8391 |
match_strings                                |   0.4493 |   0.4973 |   0.9035 |
mask_bools                                   |  57.3673 |  62.3620 |   0.9199 |
frame_repr_tall                              |   3.8280 |   4.1060 |   0.9323 |
groupby_frame_apply_overhead                 |  10.1500 |  10.8813 |   0.9328 |
read_csv_comment2                            |  29.1837 |  31.2207 |   0.9348 |
groupby_sum_booleans                         |   1.4513 |   1.5481 |   0.9375 |
datetimeindex_unique                         |   0.1737 |   0.1846 |   0.9410 |
indexing_dataframe_boolean_no_ne             |  78.3010 |  83.1940 |   0.9412 |
frame_iteritems_cached                       |   0.8077 |   0.8574 |   0.9421 |
dti_reset_index_tz                           |  16.7147 |  17.7270 |   0.9429 |
reindex_fillna_backfill                      |   0.9066 |   0.9583 |   0.9461 |
groupbym_frame_apply                         |  52.6273 |  55.5476 |   0.9474 |
lib_fast_zip_fillna                          |  18.3074 |  19.2664 |   0.9502 |
timeseries_asof_single                       |   0.0447 |   0.0470 |   0.9509 |

...

stat_ops_level_frame_sum                     |   5.1893 |   4.9037 |   1.0582 |
indexing_dataframe_boolean_rows_object       |   0.7664 |   0.7223 |   1.0610 |
frame_reindex_upcast                         |  20.6413 |  19.4256 |   1.0626 |
frame_fancy_lookup                           |   5.1223 |   4.8033 |   1.0664 |
frame_to_csv2                                | 159.7477 | 149.5946 |   1.0679 |
groupby_last                                 |   6.4550 |   6.0356 |   1.0695 |
frame_loc_dups                               |   1.1450 |   1.0684 |   1.0718 |
frame_mult                                   |   8.0183 |   7.4713 |   1.0732 |
indexing_dataframe_boolean_st                | 106.0576 |  95.9617 |   1.1052 |
frame_constructor_ndarray                    |   0.0697 |   0.0630 |   1.1059 |
timeseries_period_downsample_mean            |  22.1117 |   9.4171 |   2.3480 |
-------------------------------------------------------------------------------
Test name                                    | head[ms] | base[ms] |  ratio   |
-------------------------------------------------------------------------------

@jreback
Copy link
Contributor

jreback commented Sep 22, 2013

@wuan any remaining issues in this? quick perf check ?

pls rebase as well

I know this has been a long time ...but can get merged soon

@jreback
Copy link
Contributor

jreback commented Sep 30, 2013

@wuan give me one more rebase one this pls

@jreback
Copy link
Contributor

jreback commented Sep 30, 2013

any final comments @cpcloud @jtratner @wesm
this has been a long-running PR, thanks for the patience @wuan

@wuan
Copy link
Contributor Author

wuan commented Sep 30, 2013

no prob, I hope that the code is now mature enough
thanks for the good work

@wesm
Copy link
Member

wesm commented Sep 30, 2013

This is great work. Happy to see the legacy scikits.timeseries code worked on to add features like this. 👍 from me

@@ -1664,7 +1664,7 @@ def test_timedelta64(self):
from pandas import date_range
from datetime import datetime, timedelta

Series(np.array([1100, 20], dtype='timedelta64[s]')).to_string()
Series(np.array([1100, 20], dtype='timedelta64[ns]')).to_string()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is a smoke test i presume? result is not being used

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This line can be deleted, it really has no effect.

@jreback
Copy link
Contributor

jreback commented Sep 30, 2013

bombs away

jreback added a commit that referenced this pull request Sep 30, 2013
ENH: support for nanosecond time in offset and period
@jreback jreback merged commit ec77315 into pandas-dev:master Sep 30, 2013
@jreback
Copy link
Contributor

jreback commented Sep 30, 2013

@wuan thanks for this!

pls review docs and examples (they are not currently built with 1.7 so prob will fail, but make sure that they look ok locally)....

@cpcloud
Copy link
Member

cpcloud commented Sep 30, 2013

@wuan just recompiled with ur new PR (which is great) but i'm getting an unreachable code warning from Cython

warning: pandas/tslib.pyx:582:23: Unreachable code

offending lines:

        if is_integer_object(other):
            if self.offset is None:
                return Timestamp(self.value + other, tz=self.tzinfo)
                msg = ("Cannot add integral value to Timestamp "
                       "without offset.")
                raise ValueError(msg)
            else:
                return Timestamp((self.offset.__mul__(other)).apply(self))

msg and the raise statement will never be reached because you're returning before those lines are ever executed.

@wuan
Copy link
Contributor Author

wuan commented Sep 30, 2013

The first line returning the new Timestamp object can be removed to enable the unreachable code again. This is only fixing

timestamp + Nano(10)

with numpy < 1.7, but this is just the behaviour before this PR.

I added a branch wuan/post_nanosecond_time with the related fixes.

@jtratner
Copy link
Contributor

Also, the second Timestamp line should be:

return Timestamp((self.offset * other).apply(self))

In case the multiplication needs to happen on the other side
(otherwise, you need to check for NotImplemented and raise an error if
not).

On Mon, Sep 30, 2013 at 3:56 PM, Phillip Cloud [email protected]:

@wuan https://github.com/wuan just recompiled with ur new PR (which is
great) but i'm getting an unreachable code warning from Cython

warning: pandas/tslib.pyx:582:23: Unreachable code

offending lines:

    if is_integer_object(other):
        if self.offset is None:
            return Timestamp(self.value + other, tz=self.tzinfo)
            msg = ("Cannot add integral value to Timestamp "
                   "without offset.")
            raise ValueError(msg)
        else:
            return Timestamp((self.offset.__mul__(other)).apply(self))

msg and the raise statement will never be reached because you're
returning before those lines are ever executed.


Reply to this email directly or view it on GitHubhttps://github.com//pull/3060#issuecomment-25399212
.

@jreback
Copy link
Contributor

jreback commented Oct 8, 2013

@wuan

can take a look at the vbench for : timeseries_period_downsample_mean in vb_suite/timeseries.py

In [3]: rng = period_range('1/1/2000', '1/1/2001', freq='T')

In [4]: ts = Series(np.random.randn(len(rng)), index=rng)

In [5]: %timeit ts.resample('D', how='mean')
100 loops, best of 3: 15.8 ms per loop

this used to take about 1/3 time in v0.12.0...

I think the period conversion was inlined back then...

pls let me know

@jreback jreback mentioned this pull request Oct 8, 2013
13 tasks
@wuan
Copy link
Contributor Author

wuan commented Oct 8, 2013

you mean inlined method in period.c?

I removed them to be compatible to Microsoft CC. If it helps, there should be a workaround using preprocessor logic to enable inlining again.

Should I give it a try?

@jreback
Copy link
Contributor

jreback commented Oct 8, 2013

brief look at the prior was that convert_daytime was effectively inlined (< 0.12).....that seems to be the bottleneck....pls give a try and lmk

@jreback
Copy link
Contributor

jreback commented Oct 11, 2013

any luck with this?

@wuan
Copy link
Contributor Author

wuan commented Oct 11, 2013

sorry, had a very packed week. will do it this weekend

@wuan
Copy link
Contributor Author

wuan commented Oct 12, 2013

I added inlining of functions again to the methods in period.c (please see branch wuan/pandas/c_inline)

Unfortunately I do not see a huge change, the above benchmark results are
0.12: 5.69 ms per loop
master: 13.8 ms per loop
c_inline: 13.4 ms per loop

I will have a closer look at the resample method

@wuan
Copy link
Contributor Author

wuan commented Oct 12, 2013

I think I found the problem.

The Series.resample(...) uses tslib.period_asfreq_array(...) to convert all timestamps to the target period. The conversion function is determined and called within a loop over all samples.

When there is an intraday conversion (like in our case) then column and row indices and the desired conversion factor is then determined for each sample, although it does not change. As there is only the multiplication / division by the factor as an additional operation this unnecessary operation easily leads to runtime which is increased by a factor of two or more.

The fix would be to determine and store the conversion factor in the get_asfreq_info method which is only called once per operation and to use this conversion factor in convert_daytime without further operations.

How much time do I have to implement the improvement?

@jreback
Copy link
Contributor

jreback commented Oct 12, 2013

5 min!

hahh...

still have a week or so.....see what you can do

@jtratner
Copy link
Contributor

@wuan heads-up: if it's at Python-level you might be able to use cache_readonly for it and it'll handle most of that transparently (costs an attribute check + dict lookup at Cython-level instead of recalculation).

@wuan
Copy link
Contributor Author

wuan commented Oct 12, 2013

Thanks, unfortunately its at C-level. But the good thing is that fixing it seems to make the code a lot cleaner.

@wuan
Copy link
Contributor Author

wuan commented Oct 13, 2013

I'm now at 11.4 ms benchmark time. For now I think that this is the maximum improvement one can get. The old code is so fast, because the function in use was just multiplying a constant with the current ordinal. At the moment at least one some conditions are required.

@jreback
Copy link
Contributor

jreback commented Oct 13, 2013

gr8
put up in a new PR pls

@wuan
Copy link
Contributor Author

wuan commented Oct 13, 2013

The code available with #5204 has now a loop time of 10.5 ms (vs. 13.8 ms on current master)

@wuan wuan deleted the nanosecond_time branch January 12, 2014 07:37
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Datetime Datetime data dtype Dtype Conversions Unexpected or buggy dtype conversions Enhancement
Projects
None yet
Development

Successfully merging this pull request may close these issues.

availability of nanosecond periods in DatetimeIndex
5 participants