BUG: Iteration over DatetimeIndex stops at chunksize (GH21012) #21027

kittoku · 2018-05-14T01:02:31Z

closes Iteration over DatetimeIndex stops at 10000 #21012
tests added / passed
passes git diff upstream/master -u -- "*.py" | flake8 --diff
whatsnew entry

avoided using yield and changing DatetimeIndex itself into a iterator. @cbertinato 's advice was helpful.

codecov · 2018-05-14T02:24:15Z

Codecov Report

Merging #21027 into master will decrease coverage by <.01%.
The diff coverage is 96.66%.

@@            Coverage Diff             @@
##           master   #21027      +/-   ##
==========================================
- Coverage   91.82%   91.82%   -0.01%     
==========================================
  Files         153      153              
  Lines       49502    49523      +21     
==========================================
+ Hits        45457    45475      +18     
- Misses       4045     4048       +3

Flag	Coverage Δ
#multiple	`90.22% <96.66%> (-0.01%)`	⬇️
#single	`41.9% <96.66%> (+0.02%)`	⬆️

Impacted Files	Coverage Δ
pandas/core/indexes/datetimes.py	`95.76% <96.66%> (-0.02%)`	⬇️
pandas/util/testing.py	`84.38% <0%> (-0.21%)`	⬇️
pandas/io/parsers.py	`95.46% <0%> (ø)`	⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update d007c2d...ec651d7. Read the comment docs.

cbertinato · 2018-05-14T12:12:56Z

Couple of comments:

Think about changing the name of the iterator class to something a bit less general than Iterator.
Reviewers will eventually ask, but you will need a whatsnew entry.

jreback · 2018-05-14T12:28:15Z

pandas/core/indexes/datetimes.py

@@ -172,6 +172,50 @@ def _new_DatetimeIndex(cls, d):
    return result


+class Iterator:
+    def __init__(self, index):
+        self.data = index.asi8


we already have a BaseIterator class in pandas.io.common, it would need to be moved to pandas.core.common to use it here

but why are you creating a class for this anyhow?

Yes, and I'd prefer to keep this in DatetimeIndex.__iter__ unless it's going to be used elsewhere.

If DatetimeIndex class has __next__ method and __iter__ returns itself, some other codes see DatetimeIndex as a iterator rather than a sequence. This cause some failures (#21012). So I wanted to hide __next__ somehow.

I found BaseIterator. But with it, it would become complicated because we have to make a subclass or override methods of a instance of BaseIterator (is this correct?).

I think that most of the state tracking can be done inside DatetimeIndex.__iter__. Unless I'm misunderstanding, there's no need or desire to add DatetimeIndex.__next__.

hmm... sorry, but I can't come up with the way using only __iter__ and avoiding yield (#18848).

TomAugspurger

This will also need a release note (for 0.23.0)

TomAugspurger · 2018-05-14T13:31:26Z

pandas/tests/indexes/datetimes/test_datetime.py

+        index = date_range('2000-01-01 00:00:00', periods=periods, freq='min')
+        num = 0
+        for i, d in enumerate(index):
+            assert index[num] == d


Use i instead of num?

I wanted to check boundary values including periods == 0.
Without num, it would be a bit complicated, isn't it ?

That's fine then, though you can remove the enumerate since i isn't used anywhere I think.

TomAugspurger · 2018-05-14T13:33:23Z

pandas/core/indexes/datetimes.py

@@ -172,6 +172,50 @@ def _new_DatetimeIndex(cls, d):
    return result


+class Iterator:
+    def __init__(self, index):
+        self.data = index.asi8


Yes, and I'd prefer to keep this in DatetimeIndex.__iter__ unless it's going to be used elsewhere.

kittoku · 2018-05-14T19:50:13Z

I'm not sure where to write a whatsnew (this is not a bug in 0.22.0).
Please tell me.

cbertinato · 2018-05-14T20:01:18Z

Good point on the whatsnew. I suppose it's not necessary in this case.

jorisvandenbossche · 2018-05-14T20:20:17Z

Yes, whatsnew is not necessary as it was not a bug before.

IMO, this PR adds a lot of complexity, but probably necessary if we want to keep both:

the optimization to iterate over chunks (PERF: improve perf of index iteration (GH7683) #7720)
the removal of yield to please dill (Make DatetimeIndex iterator pickleable by dill #18848)

I don't know how important either of those are (eg couldn't this be fixed in dill itself? Other pickle libraries have no problem with it?), so difficult to assess for me.
If one of the above restrictions could be removed, we wouldn't need this added complexity of returning a custom iterator object

cbertinato · 2018-05-14T20:30:43Z

An alternative is the original fix proposed by @kittoku which entails making DateTimeIndex into an iterator and adding a condition in from_tuples to avoid the code breaking the test. It is a kind of hack, but it does preclude the need to use a custom iterator.

kittoku · 2018-05-14T23:00:07Z

So there are 4 options from what I see now.

1 : DatetimeIndex.__iter__ returns iter(self)
good : the simplest , dill can pickle bad : low performance

2 : DatetimeIndex.__iter__ returns generator using yield (like below).
good : simpler bad : dill can't pickle

    # from GH7720
    def __iter__(self):
        """
        Return an iterator over the boxed values
        Returns
        -------
        Timestamps : ndarray
        """

        # convert in chunks of 10k for efficiency
        data = self.asi8
        l = len(self)
        chunksize = 10000
        chunks = int(l / chunksize) + 1
        for i in range(chunks):
            start_i = i*chunksize
            end_i = min((i+1)*chunksize,l)
            converted = tslib.ints_to_pydatetime(data[start_i:end_i], tz=self.tz, offset=self.offset, box=True)
            for v in converted:
            yield v

3 : DatetimeIndex.__iter__ returns a separate iterator (like this PR).
good : not need to edit other codes , dill can pickle bad : complex, need to make a class

4 : DatetimeIndex.__iter__ returns itself as a iterator and edit other codes (like below).
good : dill can pickle bad : complex
suggestions in #21012 and

#pandas\core\indexes\multi.py
def from_tuples(cls, tuples, sortorder=None, names=None):
    ...
    if not is_list_like(tuples):
        raise TypeError('Input must be a list / sequence of tuple-likes.')
    elif is_iterator(tuples): # add not isinstance(tuples, Index) here
        tuples = list(tuples)

#pandas/io/formats/printing.py
# exchange `if` and `elif` to catch DatetimeIndex as sequence before assumed as iterator
def pprint_thing(thing, _nest_lvl=0, escape_chars=None, default_escapes=False,
    ...
    if (compat.PY3 and hasattr(thing, '__next__')) or hasattr(thing, 'next'):
        return compat.text_type(thing)
    elif (isinstance(thing, dict) and
          _nest_lvl < get_option("display.pprint_nest_depth")):
        result = _pprint_dict(thing, _nest_lvl, quote_strings=True,
                              max_seq_items=max_seq_items)
    elif (is_sequence(thing) and
          _nest_lvl < get_option("display.pprint_nest_depth")):
        result = _pprint_seq(thing, _nest_lvl, escape_chars=escape_chars,
                             quote_strings=quote_strings,
                             max_seq_items=max_seq_items)

I'm not in need to pickle by dill, so option 2 is best to me. I thought option 3 was the second. But now option 4 looks not bad because an idea of DatetimeIndex as iterator seems not so strange.

jreback · 2018-05-14T23:39:28Z

sure this is a bug, but not a blocker for 0.23.0

kittoku · 2018-05-15T01:14:59Z

It is odd of me to say that, but could we discard my PR and cancel a commit of #18848 (take option 2)?

I can't find a reason to pickle DatetimeIndex.__iter__() while we can pickle DatetimeIndex.
I want to hear others' opinions.

jreback · 2018-05-15T01:23:08Z

@kittoku that's a good point. We don't even have a test for that , and to be honest most folks use cloudpickle anyhow.

kittoku · 2018-05-15T01:49:51Z

May I leave the test?

jreback · 2018-05-15T02:28:22Z

yes test is ok

jreback · 2018-05-15T10:03:10Z

thanks @kittoku

yeah the original (not-released) patched wasn't doc-ced / tested, and unclear how this actually solves a problem, while breaking this.

jreback requested changes May 14, 2018

View reviewed changes

TomAugspurger reviewed May 14, 2018

View reviewed changes

TomAugspurger added this to the 0.23.0 milestone May 14, 2018

TomAugspurger added Blocker Blocking issue or pull request for an upcoming release Bug labels May 14, 2018

jreback removed this from the 0.23.0 milestone May 14, 2018

jreback removed the Blocker Blocking issue or pull request for an upcoming release label May 14, 2018

jreback added this to the 0.23.0 milestone May 15, 2018

BUG: Iteration over DatetimeIndex stops at chunksize (GH21012)

ec651d7

jreback approved these changes May 15, 2018

View reviewed changes

jreback merged commit 186ce4e into pandas-dev:master May 15, 2018

jorisvandenbossche mentioned this pull request May 15, 2018

Make DatetimeIndex iterator pickleable by dill #18848

Merged

4 tasks

kittoku deleted the add_iterator branch June 25, 2018 20:59

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

BUG: Iteration over DatetimeIndex stops at chunksize (GH21012) #21027

BUG: Iteration over DatetimeIndex stops at chunksize (GH21012) #21027

kittoku commented May 14, 2018

codecov bot commented May 14, 2018 •

edited

Loading

cbertinato commented May 14, 2018

jreback May 14, 2018

jreback May 14, 2018

TomAugspurger May 14, 2018

kittoku May 14, 2018

TomAugspurger May 14, 2018

kittoku May 14, 2018

TomAugspurger left a comment

TomAugspurger May 14, 2018

kittoku May 14, 2018

TomAugspurger May 14, 2018

TomAugspurger May 14, 2018

kittoku commented May 14, 2018

cbertinato commented May 14, 2018

jorisvandenbossche commented May 14, 2018

cbertinato commented May 14, 2018

kittoku commented May 14, 2018

jreback commented May 14, 2018

kittoku commented May 15, 2018

jreback commented May 15, 2018

kittoku commented May 15, 2018

jreback commented May 15, 2018

jreback commented May 15, 2018

BUG: Iteration over DatetimeIndex stops at chunksize (GH21012) #21027

BUG: Iteration over DatetimeIndex stops at chunksize (GH21012) #21027

Conversation

kittoku commented May 14, 2018

codecov bot commented May 14, 2018 • edited Loading

Codecov Report

cbertinato commented May 14, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

TomAugspurger left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

kittoku commented May 14, 2018

cbertinato commented May 14, 2018

jorisvandenbossche commented May 14, 2018

cbertinato commented May 14, 2018

kittoku commented May 14, 2018

jreback commented May 14, 2018

kittoku commented May 15, 2018

jreback commented May 15, 2018

kittoku commented May 15, 2018

jreback commented May 15, 2018

jreback commented May 15, 2018

codecov bot commented May 14, 2018 •

edited

Loading