ENH: support for nanosecond time in offset and period #3060

wuan · 2013-03-15T07:24:05Z

This pull request fixes handling of nanosecond times in periods and offsets. In addition it simplifies the internal processing of intraday periods.

In []: pd.date_range('2013-01-01', periods=5, freq=pd.offsets.Nano(5))
Out[]: 
<class 'pandas.tseries.index.DatetimeIndex'>
[2013-01-01 00:00:00, ..., 2013-01-01 00:00:00.000000020]
Length: 5, Freq: 5N, Timezone: None

It requires Numpy 1.7 to work, using Numpy 1.6 yields the original behaviour before this PR.

This is the continuation of #2555 after it was merged without effect fixing issue #1812.

The biggest part is a refactoring of pandas/src/period.c to use a conversion factor matrix for intraday conversion. This minimizes the impact of adding three more intraday time units (ms, us and ns). Thus the amount of frequency conversion methods is reduced to one per intraday frequency source or target. The intraday conversion factor is looked up in the conversion factor matrix inside the method.

Nanosecond support is only available when using numpy >= 1.7. When using with numpy < 1.7 everything except nanosecond support should work as expected.

jreback · 2013-03-20T17:03:10Z

ypu have some failnig tests.....I think you need to Skip by raise nose.SkipTest,
not self.SkipTest

wuan · 2013-03-20T21:13:05Z

thanks for the hint, the tests are green again

jreback · 2013-03-21T11:25:38Z

@wesm take a look?

wesm · 2013-04-01T02:03:59Z

I'll have a look soon as I can and we can merge for 0.11

ghost · 2013-04-06T19:34:38Z

fwiw, I suggest putting this off to 0.12, it's already seen some subtle breakage and it should
get a period of beta testing in master before shipping in a release IMO.
No reflection on the code, just prudence.

wuan · 2013-04-07T05:26:59Z

I agree, it may be better to have a longer period of testing.

wesm · 2013-04-08T06:07:32Z

OK let's merge right after 0.11. We'll want to have a 0.11.x maintenance branch in case we need to fix any critical bugs in 0.11.1/2/...

ghost · 2013-04-08T06:34:20Z

There's a slew of stuff for "right after 0.11", it looks like we're naturally moving towards a
"merge window" model for releases.

ghost · 2013-04-23T02:25:45Z

@wuan, any last minute tweaks before we go live in master?

wuan · 2013-04-23T05:46:11Z

No, not at the moment. I will have a look at #3375 now.

jreback · 2013-05-07T21:02:39Z

you rebased on top of master?

can u also post a perf check?

eg

test_perf -b before_commit -t last_commit

anything say > 15% diff

wuan · 2013-05-18T05:33:47Z

This is a current perofmance check, there is quite a distribution of times:

-------------------------------------------------------------------------------
Test name                                    | head[ms] | base[ms] |  ratio   |
-------------------------------------------------------------------------------
dataframe_reindex                            |   0.3353 |   0.4524 |   0.7412 |
panel_from_dict_equiv_indexes                |  22.9453 |  29.8474 |   0.7688 |
reindex_daterange_backfill                   |   0.1794 |   0.2103 |   0.8530 |
series_drop_duplicates_string                |   0.3813 |   0.4307 |   0.8852 |
append_frame_single_homogenous               |   0.1747 |   0.1964 |   0.8895 |

 ...

timeseries_asof_single                       |   0.0357 |   0.0246 |   1.4484 |
frame_xs_col                                 |   0.0226 |   0.0153 |   1.4767 |
reindex_fillna_backfill_float32              |   0.0796 |   0.0521 |   1.5298 |
timeseries_period_downsample_mean            |  17.5680 |  11.3323 |   1.5503 |
dti_reset_index                              |   0.1311 |   0.0533 |   2.4575 |
-------------------------------------------------------------------------------
Test name                                    | head[ms] | base[ms] |  ratio   |
-------------------------------------------------------------------------------

as times are more often wrapped with Timestamp, there is a performance impact. I can have a further look into this.

jreback · 2013-05-18T11:29:44Z

performance is tricky - some times multiple runs will give diff results

so obviously we care about biggest ratios

do a %prun -l func()

in ipython to profile - I usually do on master then on my pr to see what the hotspots
often times its something simple to bring it down

cpcloud · 2013-08-01T13:45:39Z

it would be great to have this in 0.13 ... i've just pulled down and i'm going to do a little smoke testing

@wuan any thoughts? it's been a while... just want to bump this

wuan · 2013-08-01T14:06:43Z

Thanks for restarting activity ...

At the moment I'm investigating some broken tests due to features introduced in the meantime. I will have some more time mid of august to help bringing things forward.

Benchmark tests results are very strange at the moment, as some of the tests show speed differences, but do not involve new code at all. Still looking into this topic as well.

wuan · 2013-08-11T20:10:13Z

Pulled master and rebased.

There are still problems with strata._datetime_to_stata_elapsed raising a ValueError when there is a np.datetime64 instead of a datetime.datetime object. For me, this seems to be a bug in the strata code. Any thoughts?

jreback · 2013-08-13T00:18:25Z

do you have an example where I can reproduce this?

It appears (bried glance), that the stata code iterrows over the frame; IF there is not a homogenous (e.g. all dates) data then I think the np.datetime64 will be coerced to datetime.datetime.

Agreed that this is a 'bug', or a at least a bad assumption.

wuan · 2013-08-13T21:38:46Z

Oops, the real problem was cause by a type cast bug introduced with this pull request which is finally gone now. Tests are green again.

At least there is now an open issue #4558 regarding export of np.datetime64 based types in index.

jreback · 2013-08-13T22:07:18Z

pandas/tseries/index.py

        if isinstance(key, datetime):
            # needed to localize naive datetimes
            stamp = Timestamp(key, tz=self.tz)
+        elif isinstance(key, Timestamp):


this is not necessary, a Timestamp IS and instance of datetime (its a subclass)

jreback · 2013-08-13T22:15:06Z

What happens on numpy < 1.7, Nano is just not available? or does it raise if I try to index by a nano on an index? (either is ok, just checking)....

There are 2 travis version (the 2.6 and 3.2) that run with numpy 1.6.1, so you can put in a specific test for this, see first item here: https://github.com/pydata/pandas/wiki/Tips-&-Tricks (you could even do something in the code too); there will very shortl be a variable in core/common.py which indicates this, see here: #4534 (as timedeltas64 are very buggy in < 1.7)

wuan · 2013-08-13T22:23:39Z

Nano existed well before this pull request and issue #1812. It just did not work, the effective period was zero.
Should we really introduce code to raise or only a kind of warning message, that numpy >= 1.7 is required to use Nano et. al.?

jreback · 2013-08-13T22:24:29Z

@wuan

this PR adds the ability to index by nanos, correct? (e.g. more than just representing)
them....

can you add an example to the top of the PR?

(this is with current master)

In [2]: s = Series([Timestamp('20130101 09:01:02')])

In [6]: s+pd.offsets.Milli(5)+pd.offsets.Micro(6)+pd.offsets.Nano(7)
Out[6]: 
0   2013-01-01 09:01:02.005006007
dtype: datetime64[ns]

jreback · 2013-08-13T22:25:36Z

What does it do in < 1.7? (can you give an example?)

jreback · 2013-08-23T02:19:11Z

@wuan can you rerun the perf impact?

wuan · 2013-08-23T09:23:15Z

Here is the current performance comparison. I will have a look in the slow operations in the next days.

-------------------------------------------------------------------------------
Test name                                    | head[ms] | base[ms] |  ratio   |
-------------------------------------------------------------------------------
join_dataframe_index_single_key_small        |   8.0736 |   9.6220 |   0.8391 |
match_strings                                |   0.4493 |   0.4973 |   0.9035 |
mask_bools                                   |  57.3673 |  62.3620 |   0.9199 |
frame_repr_tall                              |   3.8280 |   4.1060 |   0.9323 |
groupby_frame_apply_overhead                 |  10.1500 |  10.8813 |   0.9328 |
read_csv_comment2                            |  29.1837 |  31.2207 |   0.9348 |
groupby_sum_booleans                         |   1.4513 |   1.5481 |   0.9375 |
datetimeindex_unique                         |   0.1737 |   0.1846 |   0.9410 |
indexing_dataframe_boolean_no_ne             |  78.3010 |  83.1940 |   0.9412 |
frame_iteritems_cached                       |   0.8077 |   0.8574 |   0.9421 |
dti_reset_index_tz                           |  16.7147 |  17.7270 |   0.9429 |
reindex_fillna_backfill                      |   0.9066 |   0.9583 |   0.9461 |
groupbym_frame_apply                         |  52.6273 |  55.5476 |   0.9474 |
lib_fast_zip_fillna                          |  18.3074 |  19.2664 |   0.9502 |
timeseries_asof_single                       |   0.0447 |   0.0470 |   0.9509 |

...

stat_ops_level_frame_sum                     |   5.1893 |   4.9037 |   1.0582 |
indexing_dataframe_boolean_rows_object       |   0.7664 |   0.7223 |   1.0610 |
frame_reindex_upcast                         |  20.6413 |  19.4256 |   1.0626 |
frame_fancy_lookup                           |   5.1223 |   4.8033 |   1.0664 |
frame_to_csv2                                | 159.7477 | 149.5946 |   1.0679 |
groupby_last                                 |   6.4550 |   6.0356 |   1.0695 |
frame_loc_dups                               |   1.1450 |   1.0684 |   1.0718 |
frame_mult                                   |   8.0183 |   7.4713 |   1.0732 |
indexing_dataframe_boolean_st                | 106.0576 |  95.9617 |   1.1052 |
frame_constructor_ndarray                    |   0.0697 |   0.0630 |   1.1059 |
timeseries_period_downsample_mean            |  22.1117 |   9.4171 |   2.3480 |
-------------------------------------------------------------------------------
Test name                                    | head[ms] | base[ms] |  ratio   |
-------------------------------------------------------------------------------

jreback · 2013-09-22T00:56:19Z

@wuan any remaining issues in this? quick perf check ?

pls rebase as well

I know this has been a long time ...but can get merged soon

jreback · 2013-09-30T12:14:08Z

@wuan give me one more rebase one this pls

jreback · 2013-09-30T13:02:19Z

any final comments @cpcloud @jtratner @wesm
this has been a long-running PR, thanks for the patience @wuan

wuan · 2013-09-30T13:04:25Z

no prob, I hope that the code is now mature enough
thanks for the good work

wesm · 2013-09-30T18:39:20Z

This is great work. Happy to see the legacy scikits.timeseries code worked on to add features like this. 👍 from me

cpcloud · 2013-09-30T18:51:59Z

pandas/tests/test_format.py

@@ -1664,7 +1664,7 @@ def test_timedelta64(self):
        from pandas import date_range
        from datetime import datetime, timedelta

-        Series(np.array([1100, 20], dtype='timedelta64[s]')).to_string()
+        Series(np.array([1100, 20], dtype='timedelta64[ns]')).to_string()


this is a smoke test i presume? result is not being used

This line can be deleted, it really has no effect.

jreback · 2013-09-30T19:43:52Z

bombs away

ENH: support for nanosecond time in offset and period

jreback · 2013-09-30T19:44:35Z

@wuan thanks for this!

pls review docs and examples (they are not currently built with 1.7 so prob will fail, but make sure that they look ok locally)....

cpcloud · 2013-09-30T19:56:29Z

@wuan just recompiled with ur new PR (which is great) but i'm getting an unreachable code warning from Cython

warning: pandas/tslib.pyx:582:23: Unreachable code

offending lines:

        if is_integer_object(other):
            if self.offset is None:
                return Timestamp(self.value + other, tz=self.tzinfo)
                msg = ("Cannot add integral value to Timestamp "
                       "without offset.")
                raise ValueError(msg)
            else:
                return Timestamp((self.offset.__mul__(other)).apply(self))

msg and the raise statement will never be reached because you're returning before those lines are ever executed.

wuan · 2013-09-30T21:06:05Z

The first line returning the new Timestamp object can be removed to enable the unreachable code again. This is only fixing

timestamp + Nano(10)

with numpy < 1.7, but this is just the behaviour before this PR.

I added a branch wuan/post_nanosecond_time with the related fixes.

jtratner · 2013-09-30T21:06:40Z

Also, the second Timestamp line should be:

return Timestamp((self.offset * other).apply(self))

In case the multiplication needs to happen on the other side
(otherwise, you need to check for NotImplemented and raise an error if
not).

On Mon, Sep 30, 2013 at 3:56 PM, Phillip Cloud [email protected]:

@wuan https://github.com/wuan just recompiled with ur new PR (which is
great) but i'm getting an unreachable code warning from Cython

warning: pandas/tslib.pyx:582:23: Unreachable code

offending lines:
    if is_integer_object(other):
        if self.offset is None:
            return Timestamp(self.value + other, tz=self.tzinfo)
            msg = ("Cannot add integral value to Timestamp "
                   "without offset.")
            raise ValueError(msg)
        else:
            return Timestamp((self.offset.__mul__(other)).apply(self))
msg and the raise statement will never be reached because you're
returning before those lines are ever executed.

—
Reply to this email directly or view it on GitHubhttps://github.com//pull/3060#issuecomment-25399212
.

jreback · 2013-10-08T13:31:39Z

@wuan

can take a look at the vbench for : timeseries_period_downsample_mean in vb_suite/timeseries.py

In [3]: rng = period_range('1/1/2000', '1/1/2001', freq='T')

In [4]: ts = Series(np.random.randn(len(rng)), index=rng)

In [5]: %timeit ts.resample('D', how='mean')
100 loops, best of 3: 15.8 ms per loop

this used to take about 1/3 time in v0.12.0...

I think the period conversion was inlined back then...

pls let me know

wuan · 2013-10-08T13:35:55Z

you mean inlined method in period.c?

I removed them to be compatible to Microsoft CC. If it helps, there should be a workaround using preprocessor logic to enable inlining again.

Should I give it a try?

jreback · 2013-10-08T13:40:02Z

brief look at the prior was that convert_daytime was effectively inlined (< 0.12).....that seems to be the bottleneck....pls give a try and lmk

jreback · 2013-10-11T12:30:15Z

any luck with this?

wuan · 2013-10-11T13:36:34Z

sorry, had a very packed week. will do it this weekend

wuan · 2013-10-12T19:55:33Z

I added inlining of functions again to the methods in period.c (please see branch wuan/pandas/c_inline)

Unfortunately I do not see a huge change, the above benchmark results are
0.12: 5.69 ms per loop
master: 13.8 ms per loop
c_inline: 13.4 ms per loop

I will have a closer look at the resample method

wuan · 2013-10-12T20:58:52Z

I think I found the problem.

The Series.resample(...) uses tslib.period_asfreq_array(...) to convert all timestamps to the target period. The conversion function is determined and called within a loop over all samples.

When there is an intraday conversion (like in our case) then column and row indices and the desired conversion factor is then determined for each sample, although it does not change. As there is only the multiplication / division by the factor as an additional operation this unnecessary operation easily leads to runtime which is increased by a factor of two or more.

The fix would be to determine and store the conversion factor in the get_asfreq_info method which is only called once per operation and to use this conversion factor in convert_daytime without further operations.

How much time do I have to implement the improvement?

jreback · 2013-10-12T21:00:59Z

5 min!

hahh...

still have a week or so.....see what you can do

jtratner · 2013-10-12T21:30:45Z

@wuan heads-up: if it's at Python-level you might be able to use cache_readonly for it and it'll handle most of that transparently (costs an attribute check + dict lookup at Cython-level instead of recalculation).

wuan · 2013-10-12T22:39:57Z

Thanks, unfortunately its at C-level. But the good thing is that fixing it seems to make the code a lot cleaner.

wuan · 2013-10-13T00:38:45Z

I'm now at 11.4 ms benchmark time. For now I think that this is the maximum improvement one can get. The old code is so fast, because the function in use was just multiplying a constant with the current ordinal. At the moment at least one some conditions are required.

jreback · 2013-10-13T00:40:06Z

gr8
put up in a new PR pls

wuan · 2013-10-13T08:24:41Z

The code available with #5204 has now a loop time of 10.5 ms (vs. 13.8 ms on current master)

ghost mentioned this pull request Mar 31, 2013

add support for microsecond periods (GH2145) #2670

Closed

jreback mentioned this pull request Apr 16, 2013

Nanoseconds being truncated in asof #3375

Closed

jreback mentioned this pull request May 3, 2013

DatetimeIndex freq parameter is inconsistent when passing nanosecond frequencies #3524

Closed

jreback reviewed Aug 13, 2013
View reviewed changes

micro + nanosecond time support

2e79864

cpcloud reviewed Sep 30, 2013
View reviewed changes

jreback added a commit that referenced this pull request Sep 30, 2013

Merge pull request #3060 from wuan/nanosecond_time

ec77315

ENH: support for nanosecond time in offset and period

jreback merged commit ec77315 into pandas-dev:master Sep 30, 2013

jreback mentioned this pull request Oct 8, 2013

PERF: perf of e6b99f4 vs v0.12.0 #5135

Closed

13 tasks

wuan mentioned this pull request Oct 13, 2013

improved daytimes conversion performance in period.c #5204

Merged

wuan deleted the nanosecond_time branch January 12, 2014 07:37

jreback mentioned this pull request Jul 26, 2017

BUG: Period with nanoseconds (?) #17053

Closed

ENH: support for nanosecond time in offset and period #3060

ENH: support for nanosecond time in offset and period #3060

Conversation

wuan commented Mar 15, 2013

jreback commented Mar 20, 2013

wuan commented Mar 20, 2013

jreback commented Mar 21, 2013

wesm commented Apr 1, 2013

ghost commented Apr 6, 2013

wuan commented Apr 7, 2013

wesm commented Apr 8, 2013

ghost commented Apr 8, 2013

ghost commented Apr 23, 2013

wuan commented Apr 23, 2013

jreback commented May 7, 2013

wuan commented May 18, 2013

jreback commented May 18, 2013

cpcloud commented Aug 1, 2013

wuan commented Aug 1, 2013

wuan commented Aug 11, 2013

jreback commented Aug 13, 2013

wuan commented Aug 13, 2013

jreback Aug 13, 2013

Choose a reason for hiding this comment

jreback commented Aug 13, 2013

wuan commented Aug 13, 2013

jreback commented Aug 13, 2013

jreback commented Aug 13, 2013

jreback commented Aug 23, 2013

wuan commented Aug 23, 2013

jreback commented Sep 22, 2013

jreback commented Sep 30, 2013

jreback commented Sep 30, 2013

wuan commented Sep 30, 2013

wesm commented Sep 30, 2013

cpcloud Sep 30, 2013

Choose a reason for hiding this comment

wuan Sep 30, 2013

Choose a reason for hiding this comment

jreback commented Sep 30, 2013

jreback commented Sep 30, 2013

cpcloud commented Sep 30, 2013

wuan commented Sep 30, 2013

jtratner commented Sep 30, 2013

jreback commented Oct 8, 2013

wuan commented Oct 8, 2013

jreback commented Oct 8, 2013

jreback commented Oct 11, 2013

wuan commented Oct 11, 2013

wuan commented Oct 12, 2013

wuan commented Oct 12, 2013

jreback commented Oct 12, 2013

jtratner commented Oct 12, 2013

wuan commented Oct 12, 2013

wuan commented Oct 13, 2013

jreback commented Oct 13, 2013

wuan commented Oct 13, 2013