Iteration over DatetimeIndex stops at 10000 #21012

kittoku · 2018-05-11T09:18:40Z

Code Sample

import pandas as pd

num_index = pd.Index(range(20000))
hour_index = pd.date_range('2000-01-01', periods=20000, freq='H')
minute_index = pd.date_range('2000-01-01', periods=20000, freq='min')

a = 0
for i in num_index: # OK
    a += 1

b = 0
for i in hour_index: # unexpectedly
    b += 1

c = 0
for i in minute_index: # unexpectedly
    c += 1

a, b, c # -> (20000, 10000, 10000)

Problem description

Iteration over DatetimeIndex stops unexpectedly when it's evaluated 10000 times.

NOTE : this happened in 0.23.0rc2, not 0.22.0

Expected Output

(20000, 20000, 20000)

Output of `pd.show_versions()`

INSTALLED VERSIONS

commit: None
python: 3.6.5.final.0
python-bits: 64
OS: Linux
OS-release: 3.10.0-693.21.1.el7.x86_64
machine: x86_64
processor: x86_64
byteorder: little
LC_ALL: None
LANG: ja_JP.UTF-8
LOCALE: ja_JP.UTF-8

pandas: 0.23.0rc2
pytest: None
pip: 10.0.1
setuptools: 39.1.0
Cython: None
numpy: 1.14.3
scipy: None
pyarrow: None
xarray: None
IPython: None
sphinx: None
patsy: None
dateutil: 2.7.2
pytz: 2018.4
blosc: None
bottleneck: None
tables: None
numexpr: None
feather: None
matplotlib: None
openpyxl: None
xlrd: None
xlwt: None
xlsxwriter: None
lxml: None
bs4: None
html5lib: None
sqlalchemy: None
pymysql: None
psycopg2: None
jinja2: None
s3fs: None
fastparquet: None
pandas_gbq: None
pandas_datareader: None

The text was updated successfully, but these errors were encountered:

jorisvandenbossche · 2018-05-11T10:23:07Z

Thanks for the report, can confirm this regression.

jorisvandenbossche · 2018-05-11T10:33:57Z

I suppose this is the cause: #18848

kittoku · 2018-05-12T06:01:43Z

Thanks for comments @jorisvandenbossche .

I'm trying to make a PR, adding __next__ method like the following. But it still fails some tests.

class DatetimeIndex(DatelikeOps, TimelikeOps, DatetimeIndexOpsMixin,
    ...
    def __iter__(self):
        """
        Return an iterator over the boxed values

        Returns
        -------
        Timestamps : ndarray
        """

        # convert in chunks of 10k for efficiency
        chunksize = 10000
        q, r = divmod(len(self), chunksize)
        if r:
            self._chunk_lengths = (chunksize,) * q + (r, 0)
            self._stop_index = q + 1
        else:
            self._chunk_lengths = (chunksize,) * q + (0,)
            self._stop_index = q

        self._chunk_index = 0
        self._iter_index = 0

        return self

    def __next__(self):
        if self._chunk_index == self._stop_index:
            raise StopIteration

        if self._iter_index == 0:
            start_i = self._chunk_index * self._chunk_lengths[0]
            end_i = start_i + self._chunk_lengths[self._chunk_index]
            self._chunk = libts.ints_to_pydatetime(self.asi8[start_i:end_i],
                                                   tz=self.tz, freq=self.freq,
                                                   box="timestamp")

        val = self._chunk[self._iter_index]
        self._iter_index += 1

        if self._iter_index == self._chunk_lengths[self._chunk_index]:
            self._iter_index = 0
            self._chunk_index += 1

        return val

It is complicated, so any improvement or another PR are welcome.

cbertinato · 2018-05-12T14:18:42Z

Not that it's likely to be a huge performance hit, but you could save some space by only tracking the current chunk length instead of keeping a list of lengths:

class DatetimeIndex(DatelikeOps, TimelikeOps, DatetimeIndexOpsMixin,
    ...
    def __iter__(self):
        """
        Return an iterator over the boxed values

        Returns
        -------
        Timestamps : ndarray
        """

        # convert in chunks of 10k for efficiency
        self._iter_chunksize = 10000
        self._stop_index = int(len(self) / self._iter_chunksize)
        self._chunk_index = 0
        self._iter_index = 0
        return self

    def __next__(self):
        if self._chunk_index == self._stop_index:
            raise StopIteration

        if self._iter_index == 0:
            start_i = self._chunk_index * self._iter_chunk_size
            end_i = min((self._chunk_index + 1) * self._iter_chunksize, len(self))
            self._chunk_len = end_i - start_i
            self._chunk = libts.ints_to_pydatetime(self.asi8[start_i:end_i],
                                                   tz=self.tz, freq=self.freq,
                                                   box="timestamp")

        val = self._chunk[self._iter_index]
        self._iter_index += 1

        if self._iter_index == self._chunk_len:
            self._iter_index = 0
            self._chunk_index += 1

        return val

kittoku · 2018-05-12T22:10:32Z

Thank you @cbertinato . This is simpler. But let divmod remain because it cannot iterate over the whole index　when r != 0 .

#add this part
q, r = divmod(len(self), self._iter_chunksize)
self._stop_index = q + (1 if r else 0)

BTW, my change fails TestPanel.test_setitem, TestCategoricalRepr.test_categorical_index_repr_datetime, TestCategoricalRepr.test_categorical_index_repr_datetime_ordered. It seems adding __next__ causes the latter two (as follows).

#pandas/io/formats/printing.py line 153
def pprint_thing(thing, _nest_lvl=0, escape_chars=None, default_escapes=False,
    ...
    if (compat.PY3 and hasattr(thing, '__next__')) or hasattr(thing, 'next'): # -> True
        return compat.text_type(thing)
    ...

I'm not sure, but I think this failure is a features, not a bug.

And now I have no idea about the former one, TestPanel.test_setitem. Sorry.

pytest results:

___________________________ TestPanel.test_setitem ____________________________

self = <pandas.tests.test_panel.TestPanel object at 0x0000000008AED8D0>

    def test_setitem(self):
        with catch_warnings(record=True):

            # LongPanel with one item
            lp = self.panel.filter(['ItemA', 'ItemB']).to_frame()
            with pytest.raises(ValueError):
>               self.panel['ItemE'] = lp

pandas\tests\test_panel.py:484:
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
pandas\core\panel.py:603: in __setitem__
    **self._construct_axes_dict_for_slice(self._AXIS_ORDERS[1:]))
pandas\util\_decorators.py:186: in wrapper
    return func(*args, **kwargs)
pandas\core\frame.py:3563: in reindex
    return super(DataFrame, self).reindex(**kwargs)
pandas\core\generic.py:3685: in reindex
    fill_value, copy).__finalize__(self)
pandas\core\frame.py:3498: in _reindex_axes
    fill_value, limit, tolerance)
pandas\core\frame.py:3506: in _reindex_index
    tolerance=tolerance)
pandas\core\indexes\multi.py:2104: in reindex
    target = MultiIndex.from_tuples(target)
pandas\core\indexes\multi.py:1352: in from_tuples
    arrays = list(lib.to_object_array_tuples(tuples).T)
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

>   tmp = len(rows[i])
E   TypeError: object of type 'Timestamp' has no len()

pandas\_libs\src\inference.pyx:1559: TypeError

__________ TestCategoricalRepr.test_categorical_index_repr_datetime ___________

self = <pandas.tests.categorical.test_repr.TestCategoricalRepr object at 0x0000000008401358>

    def test_categorical_index_repr_datetime(self):
        idx = date_range('2011-01-01 09:00', freq='H', periods=5)
        i = CategoricalIndex(Categorical(idx))
        exp = """CategoricalIndex(['2011-01-01 09:00:00', '2011-01-01 10:00:00',
                      '2011-01-01 11:00:00', '2011-01-01 12:00:00',
                      '2011-01-01 13:00:00'],
                     categories=[2011-01-01 09:00:00, 2011-01-01 10:00:00, 2011-01-01 11:00:00, 2011-01-01 12:00:00, 201
1-01-01 13:00:00], ordered=False, dtype='category')"""  # noqa

>       assert repr(i) == exp
E       assert "CategoricalI...e='category')" == "CategoricalIn...e='category')"
E         Skipping 188 identical leading characters in diff, use -v to show
E         + ategories=[2011-01-01 09:00:00, 2011-01-01 10:00:00, 2011-01-01 11:00:00, 2011-01-01 12:00:00, 2011-01-01 13
:00:00], ordered=False, dtype='category')
E         - ategories=DatetimeIndex(['2011-01-01 09:00:00', '2011-01-01 10:00:00',
E         -                '2011-01-01 11:00:00', '2011-01-01 12:00:00',
E         -                '2011-01-01 13:00:00'],
E         -               dtype='datetime64[ns]', freq='H'), ordered=False, dtype='category')

pandas\tests\categorical\test_repr.py:392: AssertionError

______ TestCategoricalRepr.test_categorical_index_repr_datetime_ordered _______

self = <pandas.tests.categorical.test_repr.TestCategoricalRepr object at 0x0000000008401208>

    def test_categorical_index_repr_datetime_ordered(self):
        idx = date_range('2011-01-01 09:00', freq='H', periods=5)
        i = CategoricalIndex(Categorical(idx, ordered=True))
        exp = """CategoricalIndex(['2011-01-01 09:00:00', '2011-01-01 10:00:00',
                      '2011-01-01 11:00:00', '2011-01-01 12:00:00',
                      '2011-01-01 13:00:00'],
                     categories=[2011-01-01 09:00:00, 2011-01-01 10:00:00, 2011-01-01 11:00:00, 2011-01-01 12:00:00, 201
1-01-01 13:00:00], ordered=True, dtype='category')"""  # noqa

>       assert repr(i) == exp
E       assert "CategoricalI...e='category')" == "CategoricalIn...e='category')"
E         Skipping 188 identical leading characters in diff, use -v to show
E         + ategories=[2011-01-01 09:00:00, 2011-01-01 10:00:00, 2011-01-01 11:00:00, 2011-01-01 12:00:00, 2011-01-01 13
:00:00], ordered=True, dtype='category')
E         - ategories=DatetimeIndex(['2011-01-01 09:00:00', '2011-01-01 10:00:00',
E         -                '2011-01-01 11:00:00', '2011-01-01 12:00:00',
E         -                '2011-01-01 13:00:00'],
E         -               dtype='datetime64[ns]', freq='H'), ordered=True, dtype='category')
pandas\tests\categorical\test_repr.py:412: AssertionError

kittoku · 2018-05-13T11:04:34Z

This looks the cause of failure of TestPanel.test_setitem.

#pandas\core\indexes\multi.py
def from_tuples(cls, tuples, sortorder=None, names=None):
    ...
    if not is_list_like(tuples):
        raise TypeError('Input must be a list / sequence of tuple-likes.')
    elif is_iterator(tuples): # -> unexpected True
        tuples = list(tuples)

It's a naive way, but adding and not isinstance(tuples, DatetimeIndex) to elif works.
I guess there should be a better way to tell 1-dim iterators from 2-dim ones.

cbertinato · 2018-05-13T17:49:27Z

The issue is not so much about dimensionality as it is about the identity of the index as an iterator. This leads to a somewhat deeper question: does it matter whether the index itself is an iterator? Perhaps the answer is: if the tests pass, it doesn't matter.

If it does matter, then another solution is to make a separate iterator for indexes, instead of returning the index itself from __iter__.

Anyhow, something along the lines of what you suggest would fix it, though something a bit more general, such as not isinstance(tuples, Index) might be better in case something like this crops up for other indexes.

kittoku · 2018-05-13T20:47:58Z

A separate iterator sounds good.
All of the three test passed because DatetimeIndex itself is not a iterator.
I'm going to post a PR later.

jorisvandenbossche added Datetime Datetime data dtype Regression Functionality that used to work in a prior pandas version labels May 11, 2018

jorisvandenbossche added this to the 0.23.0 milestone May 11, 2018

kittoku closed this as completed May 12, 2018

kittoku reopened this May 12, 2018

kittoku mentioned this issue May 14, 2018

BUG: Iteration over DatetimeIndex stops at chunksize (GH21012) #21027

Merged

4 tasks

jreback modified the milestones: 0.23.0, 0.23.1 May 14, 2018

jreback closed this as completed in #21027 May 15, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Iteration over DatetimeIndex stops at 10000 #21012

Iteration over DatetimeIndex stops at 10000 #21012

kittoku commented May 11, 2018

INSTALLED VERSIONS

jorisvandenbossche commented May 11, 2018

jorisvandenbossche commented May 11, 2018

kittoku commented May 12, 2018

cbertinato commented May 12, 2018

kittoku commented May 12, 2018 •

edited

Loading

kittoku commented May 13, 2018

cbertinato commented May 13, 2018 •

edited

Loading

kittoku commented May 13, 2018

Iteration over DatetimeIndex stops at 10000 #21012

Iteration over DatetimeIndex stops at 10000 #21012

Comments

kittoku commented May 11, 2018

Code Sample

Problem description

Expected Output

Output of pd.show_versions()

INSTALLED VERSIONS

jorisvandenbossche commented May 11, 2018

jorisvandenbossche commented May 11, 2018

kittoku commented May 12, 2018

cbertinato commented May 12, 2018

kittoku commented May 12, 2018 • edited Loading

kittoku commented May 13, 2018

cbertinato commented May 13, 2018 • edited Loading

kittoku commented May 13, 2018

Output of `pd.show_versions()`

kittoku commented May 12, 2018 •

edited

Loading

cbertinato commented May 13, 2018 •

edited

Loading