BUG: DatetimeIndex.iter creates a temp array of Timestamp (GH7683) #7709

yrlihuan · 2014-07-09T17:00:31Z

Move __iter__ from DatatimeIndex/PeriodIndex to DatetimeIndexOpsMixin
Adding perf tests for change

 -------------------------------------------------------------------------------
Test name                                    | head[ms] | base[ms] |  ratio   |
 -------------------------------------------------------------------------------
timeseries_iter_datetimeindex_preexit        |  35.6683 | 2546.9604 |   0.0140 |
timeseries_iter_periodindex                  | 5051.6427 | 4353.0200 |   1.1605 |
timeseries_iter_periodindex_preexit          |  50.9930 |  43.8580 |   1.1627 |
timeseries_iter_datetimeindex                | 3463.0601 | 2596.2270 |   1.3339 |

jreback · 2014-07-09T17:10:32Z

gr8! pls also post the result of the vbench in the top section

jreback · 2014-07-09T17:11:13Z

vb_suite/timeseries.py

+idx1 = date_range(start='20140101', freq='T', periods=N)
+idx2 = period_range(start='20140101', freq='T', periods=N)
+
+def iter_n(iterable, n=None):


you don't need this, the vbench will run on different versions

jreback · 2014-07-09T17:12:23Z

https://github.com/pydata/pandas/wiki/Performance-Testing

jreback · 2014-07-09T17:26:17Z

hmm, so time increases for the iteration? I get the early exit issue.

yrlihuan · 2014-07-09T17:31:00Z

From the vbench result, iteration time over PeriodIndex increases 16%. It's consistent for full/partial iteration so I guess that could be from change of using asi8.

Full iteration over DatetimeIndex increases 33% and that's probably because vectorization is used no more.

jreback · 2014-07-09T17:41:59Z

ok, I think that for DatetimeIndex this could be improved quite substantially, by doing this:

replace core/base/DatetimeIndexMixIn/_box_values which calls lib.map_infer and Timestamp with a cython routine to do this (e.g. create the list in cython, filling with Timestamps and return it).

There is quite a bit of overhead of going back/forth between python/cython. This won't work for Periods (as that is python defined).

E.g. take the approach like to_pydatetime which almost does this.

jreback · 2014-07-09T17:47:38Z

then, the call lib.map_infer(values, lib.Timestamp) occurs in a number of places, can fix that after

e.g.
list(DatetimeIndex(values))

jreback · 2014-07-09T18:02:35Z

to_pydatetime is MUCH faster (and effectively equivalent, just need to preserve the offset, which is no biggie). It might be possible to create Timestamps/datetimes in the same routine (with some small adjustments).

In [1]: idx = date_range('20130101',periods=1000000,freq='s')

In [4]: %timeit idx.to_pydatetime()
1 loops, best of 3: 366 ms per loop

In [5]: %timeit [ Timestamp(x) for x in idx.to_pydatetime() ]
1 loops, best of 3: 1.41 s per loop

In [6]: %timeit list(iter(idx))
1 loops, best of 3: 3.64 s per loop

In [7]: x = [ Timestamp(x) for x in idx.to_pydatetime() ]

In [8]: y = list(iter(idx))
x
In [9]: x[0]
Out[9]: Timestamp('2013-01-01 00:00:00')

In [10]: y[0]
Out[10]: Timestamp('2013-01-01 00:00:00', offset='S')

yrlihuan · 2014-07-09T18:14:59Z

that's real improvement. i'd love to see that to happen cause Timestamp could be a real pain

but i'm not familiar with cython and that's beyond me

jreback · 2014-07-09T18:15:14Z

This is from tslib.Timestamp.__new__ you can almost direcly do this

         # make datetime happy
        ts_base = _Timestamp.__new__(cls, ts.dts.year, ts.dts.month,
                                     ts.dts.day, ts.dts.hour, ts.dts.min,
                                     ts.dts.sec, ts.dts.us, ts.tzinfo)

        # fill out rest of data
        ts_base.value = ts.value
        ts_base.offset = offset
        ts_base.nanosecond = ts.dts.ps / 1000

        return ts_base

jreback · 2014-07-09T18:16:28Z

@yrlihuan ok no problem....take a stab if you'd like (cython is actually pretty much like python code, just with types, and a bit harder to debug).

if not I don't think is hard to do. always a learning experience!

jreback · 2014-07-10T14:55:01Z

see #7720. I think its possible to do some sort of chunk converting. e.g. say convert 10000 elements at a time and provide them to the iterator and so on. This will allow early exit to work and at the same time make the whole process quite a bit faster.

want to take a stab at fixing that? (its all in python space)

jreback · 2014-07-10T15:11:28Z

You can do almost exactly this for a chunked iterator : https://github.com/pydata/pandas/blob/master/pandas/io/pytables.py#L1316

jreback · 2014-07-10T15:32:48Z

closing in favor or #7720 (I put in the iterator, will update the perf in a minute)

BUG: DatetimeIndex.__iter__ creates a temp array of Timestamp (GH7683)

a004715

jreback added the Performance label Jul 9, 2014

jreback added this to the 0.15.0 milestone Jul 9, 2014

jreback reviewed Jul 9, 2014
View reviewed changes

jreback added the Timeseries label Jul 9, 2014

sinhrks mentioned this pull request Jul 10, 2014

INT: mixin for PeriodIndex / Datetimeindex for common methods #6469

Closed

19 tasks

jreback closed this Jul 10, 2014

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

BUG: DatetimeIndex.iter creates a temp array of Timestamp (GH7683) #7709

BUG: DatetimeIndex.iter creates a temp array of Timestamp (GH7683) #7709

Uh oh!

yrlihuan commented Jul 9, 2014

Uh oh!

jreback commented Jul 9, 2014

Uh oh!

jreback Jul 9, 2014

Uh oh!

jreback commented Jul 9, 2014

Uh oh!

jreback commented Jul 9, 2014

Uh oh!

yrlihuan commented Jul 9, 2014

Uh oh!

jreback commented Jul 9, 2014

Uh oh!

jreback commented Jul 9, 2014

Uh oh!

jreback commented Jul 9, 2014

Uh oh!

yrlihuan commented Jul 9, 2014

Uh oh!

jreback commented Jul 9, 2014

Uh oh!

jreback commented Jul 9, 2014

Uh oh!

jreback commented Jul 10, 2014

Uh oh!

jreback commented Jul 10, 2014

Uh oh!

jreback commented Jul 10, 2014

Uh oh!

Uh oh!

Uh oh!

BUG: DatetimeIndex.__iter__ creates a temp array of Timestamp (GH7683) #7709

BUG: DatetimeIndex.__iter__ creates a temp array of Timestamp (GH7683) #7709

Uh oh!

Conversation

yrlihuan commented Jul 9, 2014

Uh oh!

jreback commented Jul 9, 2014

Uh oh!

jreback Jul 9, 2014

Choose a reason for hiding this comment

Uh oh!

jreback commented Jul 9, 2014

Uh oh!

jreback commented Jul 9, 2014

Uh oh!

yrlihuan commented Jul 9, 2014

Uh oh!

jreback commented Jul 9, 2014

Uh oh!

jreback commented Jul 9, 2014

Uh oh!

jreback commented Jul 9, 2014

Uh oh!

yrlihuan commented Jul 9, 2014

Uh oh!

jreback commented Jul 9, 2014

Uh oh!

jreback commented Jul 9, 2014

Uh oh!

jreback commented Jul 10, 2014

Uh oh!

jreback commented Jul 10, 2014

Uh oh!

jreback commented Jul 10, 2014

Uh oh!

Uh oh!

BUG: DatetimeIndex.iter creates a temp array of Timestamp (GH7683) #7709

BUG: DatetimeIndex.iter creates a temp array of Timestamp (GH7683) #7709