-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
BUG: Iteration over DatetimeIndex stops at chunksize (GH21012) #21027
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Codecov Report
@@ Coverage Diff @@
## master #21027 +/- ##
==========================================
- Coverage 91.82% 91.82% -0.01%
==========================================
Files 153 153
Lines 49502 49523 +21
==========================================
+ Hits 45457 45475 +18
- Misses 4045 4048 +3
Continue to review full report at Codecov.
|
Couple of comments:
|
pandas/core/indexes/datetimes.py
Outdated
@@ -172,6 +172,50 @@ def _new_DatetimeIndex(cls, d): | |||
return result | |||
|
|||
|
|||
class Iterator: | |||
def __init__(self, index): | |||
self.data = index.asi8 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we already have a BaseIterator
class in pandas.io.common
, it would need to be moved to pandas.core.common
to use it here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
but why are you creating a class for this anyhow?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, and I'd prefer to keep this in DatetimeIndex.__iter__
unless it's going to be used elsewhere.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If DatetimeIndex
class has __next__
method and __iter__
returns itself, some other codes see DatetimeIndex
as a iterator rather than a sequence. This cause some failures (#21012). So I wanted to hide __next__
somehow.
I found BaseIterator
. But with it, it would become complicated because we have to make a subclass or override methods of a instance of BaseIterator
(is this correct?).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that most of the state tracking can be done inside DatetimeIndex.__iter__
. Unless I'm misunderstanding, there's no need or desire to add DatetimeIndex.__next__
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hmm... sorry, but I can't come up with the way using only __iter__
and avoiding yield
(#18848).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This will also need a release note (for 0.23.0)
index = date_range('2000-01-01 00:00:00', periods=periods, freq='min') | ||
num = 0 | ||
for i, d in enumerate(index): | ||
assert index[num] == d |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Use i
instead of num
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wanted to check boundary values including periods == 0
.
Without num
, it would be a bit complicated, isn't it ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's fine then, though you can remove the enumerate
since i
isn't used anywhere I think.
pandas/core/indexes/datetimes.py
Outdated
@@ -172,6 +172,50 @@ def _new_DatetimeIndex(cls, d): | |||
return result | |||
|
|||
|
|||
class Iterator: | |||
def __init__(self, index): | |||
self.data = index.asi8 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, and I'd prefer to keep this in DatetimeIndex.__iter__
unless it's going to be used elsewhere.
I'm not sure where to write a whatsnew (this is not a bug in 0.22.0). |
Good point on the whatsnew. I suppose it's not necessary in this case. |
Yes, whatsnew is not necessary as it was not a bug before. IMO, this PR adds a lot of complexity, but probably necessary if we want to keep both:
I don't know how important either of those are (eg couldn't this be fixed in dill itself? Other pickle libraries have no problem with it?), so difficult to assess for me. |
An alternative is the original fix proposed by @kittoku which entails making |
So there are 4 options from what I see now. 1 : 2 :
3 : 4 :
I'm not in need to pickle by dill, so option 2 is best to me. I thought option 3 was the second. But now option 4 looks not bad because an idea of DatetimeIndex as iterator seems not so strange. |
sure this is a bug, but not a blocker for 0.23.0 |
It is odd of me to say that, but could we discard my PR and cancel a commit of #18848 (take option 2)? I can't find a reason to pickle |
@kittoku that's a good point. We don't even have a test for that , and to be honest most folks use |
May I leave the test? |
yes test is ok |
thanks @kittoku yeah the original (not-released) patched wasn't doc-ced / tested, and unclear how this actually solves a problem, while breaking this. |
git diff upstream/master -u -- "*.py" | flake8 --diff
avoided using
yield
and changing DatetimeIndex itself into a iterator. @cbertinato 's advice was helpful.