Skip to content

Iteration over DatetimeIndex stops at 10000 #21012

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
kittoku opened this issue May 11, 2018 · 8 comments · Fixed by #21027
Closed

Iteration over DatetimeIndex stops at 10000 #21012

kittoku opened this issue May 11, 2018 · 8 comments · Fixed by #21027
Labels
Datetime Datetime data dtype Regression Functionality that used to work in a prior pandas version
Milestone

Comments

@kittoku
Copy link
Contributor

kittoku commented May 11, 2018

Code Sample

import pandas as pd

num_index = pd.Index(range(20000))
hour_index = pd.date_range('2000-01-01', periods=20000, freq='H')
minute_index = pd.date_range('2000-01-01', periods=20000, freq='min')

a = 0
for i in num_index: # OK
    a += 1

b = 0
for i in hour_index: # unexpectedly
    b += 1

c = 0
for i in minute_index: # unexpectedly
    c += 1

a, b, c # -> (20000, 10000, 10000)

Problem description

Iteration over DatetimeIndex stops unexpectedly when it's evaluated 10000 times.

NOTE : this happened in 0.23.0rc2, not 0.22.0

Expected Output

(20000, 20000, 20000)

Output of pd.show_versions()

INSTALLED VERSIONS

commit: None
python: 3.6.5.final.0
python-bits: 64
OS: Linux
OS-release: 3.10.0-693.21.1.el7.x86_64
machine: x86_64
processor: x86_64
byteorder: little
LC_ALL: None
LANG: ja_JP.UTF-8
LOCALE: ja_JP.UTF-8

pandas: 0.23.0rc2
pytest: None
pip: 10.0.1
setuptools: 39.1.0
Cython: None
numpy: 1.14.3
scipy: None
pyarrow: None
xarray: None
IPython: None
sphinx: None
patsy: None
dateutil: 2.7.2
pytz: 2018.4
blosc: None
bottleneck: None
tables: None
numexpr: None
feather: None
matplotlib: None
openpyxl: None
xlrd: None
xlwt: None
xlsxwriter: None
lxml: None
bs4: None
html5lib: None
sqlalchemy: None
pymysql: None
psycopg2: None
jinja2: None
s3fs: None
fastparquet: None
pandas_gbq: None
pandas_datareader: None

@jorisvandenbossche
Copy link
Member

Thanks for the report, can confirm this regression.

@jorisvandenbossche jorisvandenbossche added Datetime Datetime data dtype Regression Functionality that used to work in a prior pandas version labels May 11, 2018
@jorisvandenbossche jorisvandenbossche added this to the 0.23.0 milestone May 11, 2018
@jorisvandenbossche
Copy link
Member

I suppose this is the cause: #18848

@kittoku
Copy link
Contributor Author

kittoku commented May 12, 2018

Thanks for comments @jorisvandenbossche .

I'm trying to make a PR, adding __next__ method like the following. But it still fails some tests.

class DatetimeIndex(DatelikeOps, TimelikeOps, DatetimeIndexOpsMixin,
    ...
    def __iter__(self):
        """
        Return an iterator over the boxed values

        Returns
        -------
        Timestamps : ndarray
        """

        # convert in chunks of 10k for efficiency
        chunksize = 10000
        q, r = divmod(len(self), chunksize)
        if r:
            self._chunk_lengths = (chunksize,) * q + (r, 0)
            self._stop_index = q + 1
        else:
            self._chunk_lengths = (chunksize,) * q + (0,)
            self._stop_index = q

        self._chunk_index = 0
        self._iter_index = 0

        return self

    def __next__(self):
        if self._chunk_index == self._stop_index:
            raise StopIteration

        if self._iter_index == 0:
            start_i = self._chunk_index * self._chunk_lengths[0]
            end_i = start_i + self._chunk_lengths[self._chunk_index]
            self._chunk = libts.ints_to_pydatetime(self.asi8[start_i:end_i],
                                                   tz=self.tz, freq=self.freq,
                                                   box="timestamp")

        val = self._chunk[self._iter_index]
        self._iter_index += 1

        if self._iter_index == self._chunk_lengths[self._chunk_index]:
            self._iter_index = 0
            self._chunk_index += 1

        return val

It is complicated, so any improvement or another PR are welcome.

@cbertinato
Copy link
Contributor

Not that it's likely to be a huge performance hit, but you could save some space by only tracking the current chunk length instead of keeping a list of lengths:

class DatetimeIndex(DatelikeOps, TimelikeOps, DatetimeIndexOpsMixin,
    ...
    def __iter__(self):
        """
        Return an iterator over the boxed values

        Returns
        -------
        Timestamps : ndarray
        """

        # convert in chunks of 10k for efficiency
        self._iter_chunksize = 10000
        self._stop_index = int(len(self) / self._iter_chunksize)
        self._chunk_index = 0
        self._iter_index = 0
        return self

    def __next__(self):
        if self._chunk_index == self._stop_index:
            raise StopIteration

        if self._iter_index == 0:
            start_i = self._chunk_index * self._iter_chunk_size
            end_i = min((self._chunk_index + 1) * self._iter_chunksize, len(self))
            self._chunk_len = end_i - start_i
            self._chunk = libts.ints_to_pydatetime(self.asi8[start_i:end_i],
                                                   tz=self.tz, freq=self.freq,
                                                   box="timestamp")

        val = self._chunk[self._iter_index]
        self._iter_index += 1

        if self._iter_index == self._chunk_len:
            self._iter_index = 0
            self._chunk_index += 1

        return val

@kittoku
Copy link
Contributor Author

kittoku commented May 12, 2018

Thank you @cbertinato . This is simpler. But let divmod remain because it cannot iterate over the whole index when r != 0 .

#add this part
q, r = divmod(len(self), self._iter_chunksize)
self._stop_index = q + (1 if r else 0)

BTW, my change fails TestPanel.test_setitem, TestCategoricalRepr.test_categorical_index_repr_datetime, TestCategoricalRepr.test_categorical_index_repr_datetime_ordered. It seems adding __next__ causes the latter two (as follows).

#pandas/io/formats/printing.py line 153
def pprint_thing(thing, _nest_lvl=0, escape_chars=None, default_escapes=False,
    ...
    if (compat.PY3 and hasattr(thing, '__next__')) or hasattr(thing, 'next'): # -> True
        return compat.text_type(thing)
    ...

I'm not sure, but I think this failure is a features, not a bug.

And now I have no idea about the former one, TestPanel.test_setitem. Sorry.

pytest results:

___________________________ TestPanel.test_setitem ____________________________

self = <pandas.tests.test_panel.TestPanel object at 0x0000000008AED8D0>

    def test_setitem(self):
        with catch_warnings(record=True):

            # LongPanel with one item
            lp = self.panel.filter(['ItemA', 'ItemB']).to_frame()
            with pytest.raises(ValueError):
>               self.panel['ItemE'] = lp

pandas\tests\test_panel.py:484:
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
pandas\core\panel.py:603: in __setitem__
    **self._construct_axes_dict_for_slice(self._AXIS_ORDERS[1:]))
pandas\util\_decorators.py:186: in wrapper
    return func(*args, **kwargs)
pandas\core\frame.py:3563: in reindex
    return super(DataFrame, self).reindex(**kwargs)
pandas\core\generic.py:3685: in reindex
    fill_value, copy).__finalize__(self)
pandas\core\frame.py:3498: in _reindex_axes
    fill_value, limit, tolerance)
pandas\core\frame.py:3506: in _reindex_index
    tolerance=tolerance)
pandas\core\indexes\multi.py:2104: in reindex
    target = MultiIndex.from_tuples(target)
pandas\core\indexes\multi.py:1352: in from_tuples
    arrays = list(lib.to_object_array_tuples(tuples).T)
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

>   tmp = len(rows[i])
E   TypeError: object of type 'Timestamp' has no len()

pandas\_libs\src\inference.pyx:1559: TypeError

__________ TestCategoricalRepr.test_categorical_index_repr_datetime ___________

self = <pandas.tests.categorical.test_repr.TestCategoricalRepr object at 0x0000000008401358>

    def test_categorical_index_repr_datetime(self):
        idx = date_range('2011-01-01 09:00', freq='H', periods=5)
        i = CategoricalIndex(Categorical(idx))
        exp = """CategoricalIndex(['2011-01-01 09:00:00', '2011-01-01 10:00:00',
                      '2011-01-01 11:00:00', '2011-01-01 12:00:00',
                      '2011-01-01 13:00:00'],
                     categories=[2011-01-01 09:00:00, 2011-01-01 10:00:00, 2011-01-01 11:00:00, 2011-01-01 12:00:00, 201
1-01-01 13:00:00], ordered=False, dtype='category')"""  # noqa

>       assert repr(i) == exp
E       assert "CategoricalI...e='category')" == "CategoricalIn...e='category')"
E         Skipping 188 identical leading characters in diff, use -v to show
E         + ategories=[2011-01-01 09:00:00, 2011-01-01 10:00:00, 2011-01-01 11:00:00, 2011-01-01 12:00:00, 2011-01-01 13
:00:00], ordered=False, dtype='category')
E         - ategories=DatetimeIndex(['2011-01-01 09:00:00', '2011-01-01 10:00:00',
E         -                '2011-01-01 11:00:00', '2011-01-01 12:00:00',
E         -                '2011-01-01 13:00:00'],
E         -               dtype='datetime64[ns]', freq='H'), ordered=False, dtype='category')

pandas\tests\categorical\test_repr.py:392: AssertionError

______ TestCategoricalRepr.test_categorical_index_repr_datetime_ordered _______

self = <pandas.tests.categorical.test_repr.TestCategoricalRepr object at 0x0000000008401208>

    def test_categorical_index_repr_datetime_ordered(self):
        idx = date_range('2011-01-01 09:00', freq='H', periods=5)
        i = CategoricalIndex(Categorical(idx, ordered=True))
        exp = """CategoricalIndex(['2011-01-01 09:00:00', '2011-01-01 10:00:00',
                      '2011-01-01 11:00:00', '2011-01-01 12:00:00',
                      '2011-01-01 13:00:00'],
                     categories=[2011-01-01 09:00:00, 2011-01-01 10:00:00, 2011-01-01 11:00:00, 2011-01-01 12:00:00, 201
1-01-01 13:00:00], ordered=True, dtype='category')"""  # noqa

>       assert repr(i) == exp
E       assert "CategoricalI...e='category')" == "CategoricalIn...e='category')"
E         Skipping 188 identical leading characters in diff, use -v to show
E         + ategories=[2011-01-01 09:00:00, 2011-01-01 10:00:00, 2011-01-01 11:00:00, 2011-01-01 12:00:00, 2011-01-01 13
:00:00], ordered=True, dtype='category')
E         - ategories=DatetimeIndex(['2011-01-01 09:00:00', '2011-01-01 10:00:00',
E         -                '2011-01-01 11:00:00', '2011-01-01 12:00:00',
E         -                '2011-01-01 13:00:00'],
E         -               dtype='datetime64[ns]', freq='H'), ordered=True, dtype='category')
pandas\tests\categorical\test_repr.py:412: AssertionError

@kittoku kittoku closed this as completed May 12, 2018
@kittoku kittoku reopened this May 12, 2018
@kittoku
Copy link
Contributor Author

kittoku commented May 13, 2018

This looks the cause of failure of TestPanel.test_setitem.

#pandas\core\indexes\multi.py
def from_tuples(cls, tuples, sortorder=None, names=None):
    ...
    if not is_list_like(tuples):
        raise TypeError('Input must be a list / sequence of tuple-likes.')
    elif is_iterator(tuples): # -> unexpected True
        tuples = list(tuples)

It's a naive way, but adding and not isinstance(tuples, DatetimeIndex) to elif works.
I guess there should be a better way to tell 1-dim iterators from 2-dim ones.

@cbertinato
Copy link
Contributor

cbertinato commented May 13, 2018

The issue is not so much about dimensionality as it is about the identity of the index as an iterator. This leads to a somewhat deeper question: does it matter whether the index itself is an iterator? Perhaps the answer is: if the tests pass, it doesn't matter.

If it does matter, then another solution is to make a separate iterator for indexes, instead of returning the index itself from __iter__.

Anyhow, something along the lines of what you suggest would fix it, though something a bit more general, such as not isinstance(tuples, Index) might be better in case something like this crops up for other indexes.

@kittoku
Copy link
Contributor Author

kittoku commented May 13, 2018

A separate iterator sounds good.
All of the three test passed because DatetimeIndex itself is not a iterator.
I'm going to post a PR later.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Datetime Datetime data dtype Regression Functionality that used to work in a prior pandas version
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants