Skip to content

BUG: Iteration over DatetimeIndex stops at chunksize (GH21012) #21027

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
May 15, 2018
Merged

BUG: Iteration over DatetimeIndex stops at chunksize (GH21012) #21027

merged 1 commit into from
May 15, 2018

Conversation

kittoku
Copy link
Contributor

@kittoku kittoku commented May 14, 2018

avoided using yield and changing DatetimeIndex itself into a iterator. @cbertinato 's advice was helpful.

@codecov
Copy link

codecov bot commented May 14, 2018

Codecov Report

Merging #21027 into master will decrease coverage by <.01%.
The diff coverage is 96.66%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master   #21027      +/-   ##
==========================================
- Coverage   91.82%   91.82%   -0.01%     
==========================================
  Files         153      153              
  Lines       49502    49523      +21     
==========================================
+ Hits        45457    45475      +18     
- Misses       4045     4048       +3
Flag Coverage Δ
#multiple 90.22% <96.66%> (-0.01%) ⬇️
#single 41.9% <96.66%> (+0.02%) ⬆️
Impacted Files Coverage Δ
pandas/core/indexes/datetimes.py 95.76% <96.66%> (-0.02%) ⬇️
pandas/util/testing.py 84.38% <0%> (-0.21%) ⬇️
pandas/io/parsers.py 95.46% <0%> (ø) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update d007c2d...ec651d7. Read the comment docs.

@cbertinato
Copy link
Contributor

Couple of comments:

  • Think about changing the name of the iterator class to something a bit less general than Iterator.
  • Reviewers will eventually ask, but you will need a whatsnew entry.

@@ -172,6 +172,50 @@ def _new_DatetimeIndex(cls, d):
return result


class Iterator:
def __init__(self, index):
self.data = index.asi8
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we already have a BaseIterator class in pandas.io.common, it would need to be moved to pandas.core.common to use it here

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

but why are you creating a class for this anyhow?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, and I'd prefer to keep this in DatetimeIndex.__iter__ unless it's going to be used elsewhere.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If DatetimeIndex class has __next__ method and __iter__ returns itself, some other codes see DatetimeIndex as a iterator rather than a sequence. This cause some failures (#21012). So I wanted to hide __next__ somehow.

I found BaseIterator. But with it, it would become complicated because we have to make a subclass or override methods of a instance of BaseIterator (is this correct?).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that most of the state tracking can be done inside DatetimeIndex.__iter__. Unless I'm misunderstanding, there's no need or desire to add DatetimeIndex.__next__.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmm... sorry, but I can't come up with the way using only __iter__ and avoiding yield (#18848).

Copy link
Contributor

@TomAugspurger TomAugspurger left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will also need a release note (for 0.23.0)

index = date_range('2000-01-01 00:00:00', periods=periods, freq='min')
num = 0
for i, d in enumerate(index):
assert index[num] == d
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use i instead of num?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wanted to check boundary values including periods == 0.
Without num, it would be a bit complicated, isn't it ?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's fine then, though you can remove the enumerate since i isn't used anywhere I think.

@@ -172,6 +172,50 @@ def _new_DatetimeIndex(cls, d):
return result


class Iterator:
def __init__(self, index):
self.data = index.asi8
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, and I'd prefer to keep this in DatetimeIndex.__iter__ unless it's going to be used elsewhere.

@TomAugspurger TomAugspurger added this to the 0.23.0 milestone May 14, 2018
@TomAugspurger TomAugspurger added Blocker Blocking issue or pull request for an upcoming release Bug labels May 14, 2018
@kittoku
Copy link
Contributor Author

kittoku commented May 14, 2018

I'm not sure where to write a whatsnew (this is not a bug in 0.22.0).
Please tell me.

@cbertinato
Copy link
Contributor

Good point on the whatsnew. I suppose it's not necessary in this case.

@jorisvandenbossche
Copy link
Member

Yes, whatsnew is not necessary as it was not a bug before.

IMO, this PR adds a lot of complexity, but probably necessary if we want to keep both:

I don't know how important either of those are (eg couldn't this be fixed in dill itself? Other pickle libraries have no problem with it?), so difficult to assess for me.
If one of the above restrictions could be removed, we wouldn't need this added complexity of returning a custom iterator object

@cbertinato
Copy link
Contributor

An alternative is the original fix proposed by @kittoku which entails making DateTimeIndex into an iterator and adding a condition in from_tuples to avoid the code breaking the test. It is a kind of hack, but it does preclude the need to use a custom iterator.

@kittoku
Copy link
Contributor Author

kittoku commented May 14, 2018

So there are 4 options from what I see now.

1 : DatetimeIndex.__iter__ returns iter(self)
good : the simplest , dill can pickle bad : low performance

2 : DatetimeIndex.__iter__ returns generator using yield (like below).
good : simpler bad : dill can't pickle

    # from GH7720
    def __iter__(self):
        """
        Return an iterator over the boxed values
        Returns
        -------
        Timestamps : ndarray
        """

        # convert in chunks of 10k for efficiency
        data = self.asi8
        l = len(self)
        chunksize = 10000
        chunks = int(l / chunksize) + 1
        for i in range(chunks):
            start_i = i*chunksize
            end_i = min((i+1)*chunksize,l)
            converted = tslib.ints_to_pydatetime(data[start_i:end_i], tz=self.tz, offset=self.offset, box=True)
            for v in converted:
            yield v

3 : DatetimeIndex.__iter__ returns a separate iterator (like this PR).
good : not need to edit other codes , dill can pickle bad : complex, need to make a class

4 : DatetimeIndex.__iter__ returns itself as a iterator and edit other codes (like below).
good : dill can pickle bad : complex
suggestions in #21012 and

#pandas\core\indexes\multi.py
def from_tuples(cls, tuples, sortorder=None, names=None):
    ...
    if not is_list_like(tuples):
        raise TypeError('Input must be a list / sequence of tuple-likes.')
    elif is_iterator(tuples): # add not isinstance(tuples, Index) here
        tuples = list(tuples)
#pandas/io/formats/printing.py
# exchange `if` and `elif` to catch DatetimeIndex as sequence before assumed as iterator
def pprint_thing(thing, _nest_lvl=0, escape_chars=None, default_escapes=False,
    ...
    if (compat.PY3 and hasattr(thing, '__next__')) or hasattr(thing, 'next'):
        return compat.text_type(thing)
    elif (isinstance(thing, dict) and
          _nest_lvl < get_option("display.pprint_nest_depth")):
        result = _pprint_dict(thing, _nest_lvl, quote_strings=True,
                              max_seq_items=max_seq_items)
    elif (is_sequence(thing) and
          _nest_lvl < get_option("display.pprint_nest_depth")):
        result = _pprint_seq(thing, _nest_lvl, escape_chars=escape_chars,
                             quote_strings=quote_strings,
                             max_seq_items=max_seq_items)

I'm not in need to pickle by dill, so option 2 is best to me. I thought option 3 was the second. But now option 4 looks not bad because an idea of DatetimeIndex as iterator seems not so strange.

@jreback jreback removed this from the 0.23.0 milestone May 14, 2018
@jreback jreback removed the Blocker Blocking issue or pull request for an upcoming release label May 14, 2018
@jreback
Copy link
Contributor

jreback commented May 14, 2018

sure this is a bug, but not a blocker for 0.23.0

@kittoku
Copy link
Contributor Author

kittoku commented May 15, 2018

It is odd of me to say that, but could we discard my PR and cancel a commit of #18848 (take option 2)?

I can't find a reason to pickle DatetimeIndex.__iter__() while we can pickle DatetimeIndex.
I want to hear others' opinions.

@jreback
Copy link
Contributor

jreback commented May 15, 2018

@kittoku that's a good point. We don't even have a test for that , and to be honest most folks use cloudpickle anyhow.

@jreback jreback added this to the 0.23.0 milestone May 15, 2018
@kittoku
Copy link
Contributor Author

kittoku commented May 15, 2018

May I leave the test?

@jreback
Copy link
Contributor

jreback commented May 15, 2018

yes test is ok

@jreback jreback merged commit 186ce4e into pandas-dev:master May 15, 2018
@jreback
Copy link
Contributor

jreback commented May 15, 2018

thanks @kittoku

yeah the original (not-released) patched wasn't doc-ced / tested, and unclear how this actually solves a problem, while breaking this.

@kittoku kittoku deleted the add_iterator branch June 25, 2018 20:59
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Iteration over DatetimeIndex stops at 10000
5 participants