Skip to content

BUG: DatetimeIndex.__iter__ creates a temp array of Timestamp (GH7683) #7709

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 1 commit into from

Conversation

yrlihuan
Copy link
Contributor

@yrlihuan yrlihuan commented Jul 9, 2014

closes #7683

Move __iter__ from DatatimeIndex/PeriodIndex to DatetimeIndexOpsMixin
Adding perf tests for change

 -------------------------------------------------------------------------------
Test name                                    | head[ms] | base[ms] |  ratio   |
 -------------------------------------------------------------------------------
timeseries_iter_datetimeindex_preexit        |  35.6683 | 2546.9604 |   0.0140 |
timeseries_iter_periodindex                  | 5051.6427 | 4353.0200 |   1.1605 |
timeseries_iter_periodindex_preexit          |  50.9930 |  43.8580 |   1.1627 |
timeseries_iter_datetimeindex                | 3463.0601 | 2596.2270 |   1.3339 |

@jreback jreback added this to the 0.15.0 milestone Jul 9, 2014
@jreback
Copy link
Contributor

jreback commented Jul 9, 2014

gr8! pls also post the result of the vbench in the top section

idx1 = date_range(start='20140101', freq='T', periods=N)
idx2 = period_range(start='20140101', freq='T', periods=N)

def iter_n(iterable, n=None):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you don't need this, the vbench will run on different versions

@jreback
Copy link
Contributor

jreback commented Jul 9, 2014

@jreback
Copy link
Contributor

jreback commented Jul 9, 2014

hmm, so time increases for the iteration? I get the early exit issue.

@yrlihuan
Copy link
Contributor Author

yrlihuan commented Jul 9, 2014

From the vbench result, iteration time over PeriodIndex increases 16%. It's consistent for full/partial iteration so I guess that could be from change of using asi8.

Full iteration over DatetimeIndex increases 33% and that's probably because vectorization is used no more.

@jreback
Copy link
Contributor

jreback commented Jul 9, 2014

ok, I think that for DatetimeIndex this could be improved quite substantially, by doing this:

replace core/base/DatetimeIndexMixIn/_box_values which calls lib.map_infer and Timestamp with a cython routine to do this (e.g. create the list in cython, filling with Timestamps and return it).

There is quite a bit of overhead of going back/forth between python/cython. This won't work for Periods (as that is python defined).

E.g. take the approach like to_pydatetime which almost does this.

@jreback
Copy link
Contributor

jreback commented Jul 9, 2014

then, the call lib.map_infer(values, lib.Timestamp) occurs in a number of places, can fix that after

e.g.
list(DatetimeIndex(values))

@jreback
Copy link
Contributor

jreback commented Jul 9, 2014

to_pydatetime is MUCH faster (and effectively equivalent, just need to preserve the offset, which is no biggie). It might be possible to create Timestamps/datetimes in the same routine (with some small adjustments).

In [1]: idx = date_range('20130101',periods=1000000,freq='s')

In [4]: %timeit idx.to_pydatetime()
1 loops, best of 3: 366 ms per loop

In [5]: %timeit [ Timestamp(x) for x in idx.to_pydatetime() ]
1 loops, best of 3: 1.41 s per loop

In [6]: %timeit list(iter(idx))
1 loops, best of 3: 3.64 s per loop

In [7]: x = [ Timestamp(x) for x in idx.to_pydatetime() ]

In [8]: y = list(iter(idx))
x
In [9]: x[0]
Out[9]: Timestamp('2013-01-01 00:00:00')

In [10]: y[0]
Out[10]: Timestamp('2013-01-01 00:00:00', offset='S')

@yrlihuan
Copy link
Contributor Author

yrlihuan commented Jul 9, 2014

that's real improvement. i'd love to see that to happen cause Timestamp could be a real pain

but i'm not familiar with cython and that's beyond me

@jreback
Copy link
Contributor

jreback commented Jul 9, 2014

This is from tslib.Timestamp.__new__ you can almost direcly do this

         # make datetime happy
        ts_base = _Timestamp.__new__(cls, ts.dts.year, ts.dts.month,
                                     ts.dts.day, ts.dts.hour, ts.dts.min,
                                     ts.dts.sec, ts.dts.us, ts.tzinfo)

        # fill out rest of data
        ts_base.value = ts.value
        ts_base.offset = offset
        ts_base.nanosecond = ts.dts.ps / 1000

        return ts_base

@jreback
Copy link
Contributor

jreback commented Jul 9, 2014

@yrlihuan ok no problem....take a stab if you'd like (cython is actually pretty much like python code, just with types, and a bit harder to debug).

if not I don't think is hard to do. always a learning experience!

@jreback
Copy link
Contributor

jreback commented Jul 10, 2014

see #7720. I think its possible to do some sort of chunk converting. e.g. say convert 10000 elements at a time and provide them to the iterator and so on. This will allow early exit to work and at the same time make the whole process quite a bit faster.

want to take a stab at fixing that? (its all in python space)

@jreback
Copy link
Contributor

jreback commented Jul 10, 2014

You can do almost exactly this for a chunked iterator : https://github.com/pydata/pandas/blob/master/pandas/io/pytables.py#L1316

@jreback
Copy link
Contributor

jreback commented Jul 10, 2014

closing in favor or #7720 (I put in the iterator, will update the perf in a minute)

@jreback jreback closed this Jul 10, 2014
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Datetime Datetime data dtype Performance Memory or execution speed performance
Projects
None yet
Development

Successfully merging this pull request may close these issues.

.iterrows takes too long and generate large memory footprint
2 participants