-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
BUG: Timestamp.replace chaining not compat with datetime.replace #15683 #16110
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
doc/source/whatsnew/v0.20.0.txt
Outdated
@@ -22,6 +22,7 @@ Highlights include: | |||
- Support for S3 handling now uses ``s3fs``, see :ref:`here <whatsnew_0200.api_breaking.s3>` | |||
- Google BigQuery support now uses the ``pandas-gbq`` library, see :ref:`here <whatsnew_0200.api_breaking.gbq>` | |||
- Switched the test framework to use `pytest <http://doc.pytest.org/en/latest>`__ (:issue:`13097`) | |||
- Fixed an issue with ``Timestamp.replace(tzinfo)`` around DST changes (:issue:`15683`) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can move this down to the Bug Fixes section (the Conversion subsection).
@@ -1280,6 +1280,25 @@ def test_ambiguous_compat(self): | |||
self.assertEqual(result_pytz.to_pydatetime().tzname(), | |||
result_dateutil.to_pydatetime().tzname()) | |||
|
|||
def test_tzreplace_issue_15683(self): | |||
"""Regression test for issue 15683.""" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Usually we just leave a comment on the first line of the function like # GH 15683
result_dt = dt.replace(tzinfo=tzinfo) | ||
result_pd = Timestamp(dt).replace(tzinfo=tzinfo) | ||
|
||
self.assertEqual(result_dt.timestamp(), result_pd.timestamp()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We've recently switched to pytest for our test-runner, so you can write these as
assert result_dt.timesamp() == result_pd.timestamp()
...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, nice. :)
Codecov Report
@@ Coverage Diff @@
## master #16110 +/- ##
==========================================
+ Coverage 90.84% 90.86% +0.01%
==========================================
Files 159 159
Lines 50771 50771
==========================================
+ Hits 46122 46132 +10
+ Misses 4649 4639 -10
Continue to review full report at Codecov.
|
Codecov Report
@@ Coverage Diff @@
## master #16110 +/- ##
==========================================
- Coverage 91.04% 91.01% -0.03%
==========================================
Files 162 162
Lines 49555 49567 +12
==========================================
- Hits 45115 45114 -1
- Misses 4440 4453 +13
Continue to review full report at Codecov.
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what happened to the original change? this just needs a localization I think
_TSObject ts | ||
|
||
# set to naive if needed | ||
_tzinfo = self.tzinfo | ||
value = self.value | ||
if _tzinfo is not None: | ||
value = tz_convert_single(value, 'UTC', _tzinfo) | ||
value_tz = tz_convert_single(value, _tzinfo, 'UTC') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this doesn't make any sense
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
change the definition of tz_convert_single
to cpdef int64_t
and here make value_tz
and offset
int64_t
definitions
@@ -726,16 +728,14 @@ class Timestamp(_Timestamp): | |||
_tzinfo = tzinfo | |||
|
|||
# reconstruct & check bounds | |||
value = pandas_datetimestruct_to_datetime(PANDAS_FR_ns, &dts) | |||
ts_input = datetime(dts.year, dts.month, dts.day, dts.hour, dts.min, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what exactly are you doing here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When you changed the tzinfo from old_tz
to new_tz
in the old implementation, the code would first do a conversion "from UTC to old_tz" and later "from new_tz to UTC" in order to calculate a correct offset for the value
.
However, this was the reason for the problems I discussed in the issue (e.g., off-by-1h-offsets were calculated and invalid errors "date does not exist in timezone" were raised when converting "from new_tz to UTC"). These issues only occured around DST changes when the new offset resulted in a DST switch, so they were not covered before.
What I now did was simply constructing a new Timestamp instance and let the Pandas internals do the correct calculations for value. Unfortunately, the low-level (C-)stuff is not very well documented so my solution for creating a correct result Timestamp might not be the most optimal one.
However, correct code is better than optimal code, so I hope this PR is good enough for now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
pls run the asv suite and see if this regresses anything. perf is a big deal and this goes thru a few more conversions; this is why I am harping on the impl.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there a fast low-level function that takes a year, month, day, hour, minute, second, nano and a tzinfo and calculates value so that I can avoid creating a datetime
object? Something like convert_to_tsobject()
but with a different signature? Or can I call this function in a different way?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you can trace what these are doing and just replicate it. there is some overhead to creating it in this way.
def test_tzreplace_issue_15683(self): | ||
"""Regression test for issue 15683.""" | ||
dt = datetime(2016, 3, 27, 1) | ||
tzinfo = pytz.timezone('CET').localize(dt, is_dst=False).tzinfo |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this need testing with multiple tz's that hit DST
Looks like the tests are failing b/c I used |
@jreback It's not only a compatibility issue. |
@sscherfke this is like adjusting, then readjusting things again. Start with the code I put up before. |
@jreback The proposed Since all my attempts to fix localizing the timestamp failed, I just thought "why not just calculate a new value for the given time-struct and tzinfo?". I know that this code is not the very best solution but this was the best that I could come up with based on the documentation available: # What I actually want to to is:
#
# value = value_for(dts, tzinfo)
#
# The easiest thing I found that does this is creating a tsobject from a datetime() and taking its "value":
ts_input = datetime(dts.year, dts.month, dts.day, dts.hour, dts.min, dts.sec, dts.us, tzinfo=_tzinfo)
ts = convert_to_tsobject(ts_input, _tzinfo, None, 0, 0)
# Since datetime() has no nanoseconds, add them manually:
value = ts.value + (dts.ps // 1000) The good thing of this approach is that we don't calculate any wrong offsets and that there also won't be any false "ambiguous time" errors. I'd gladly improve the above code with anything that correctly calculates |
How do we proceed now? My solution is not optional but unfortunately, I have no idea how to improve it. Nonetheless, it fixes an obvious bug and the tests are passing now, so I’d be really happy if this gets merged. |
doc/source/whatsnew/v0.20.0.txt
Outdated
@@ -1576,6 +1576,7 @@ Conversion | |||
|
|||
- Bug in ``Timestamp.replace`` now raises ``TypeError`` when incorrect argument names are given; previously this raised ``ValueError`` (:issue:`15240`) | |||
- Bug in ``Timestamp.replace`` with compat for passing long integers (:issue:`15030`) | |||
- Bug in ``Timestamp.replace`` when replacing ``tzinfo`` around DST changes (:issue:`15683`) | |||
- Bug in ``Timestamp`` returning UTC based time/date attributes when a timezone was provided (:issue:`13303`) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this will need to move to 0.20.1
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay, np
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I moved it to 0.20.2 since there was no file for 0.20.1.
@@ -726,16 +728,14 @@ class Timestamp(_Timestamp): | |||
_tzinfo = tzinfo | |||
|
|||
# reconstruct & check bounds | |||
value = pandas_datetimestruct_to_datetime(PANDAS_FR_ns, &dts) | |||
ts_input = datetime(dts.year, dts.month, dts.day, dts.hour, dts.min, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
pls run the asv suite and see if this regresses anything. perf is a big deal and this goes thru a few more conversions; this is why I am harping on the impl.
@sscherfke can you run the asv's and post any discrepancies. |
@jreback I ran |
result_dt = dt.replace(tzinfo=tzinfo) | ||
result_pd = Timestamp(dt).replace(tzinfo=tzinfo) | ||
|
||
if hasattr(result_dt, 'timestamp'): # New method in Py 3.3 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what is this? 'New method in Py 3.3'?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The datetime.datetime.timestamp()
method is new in 3.3.
I developed the Patch on 3.6 and then the GitHub CI failed because of this method.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
and why are you doing this? what does it do
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am just doing various tests to make sure, pandas timestamps behave like datetimes, which they did not always do before.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
and what exactly does this do? pls show an example of calling timestamp on a datetime.datetime.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am just checking if both, pandas and datetime dates, are equal in direct comparision and if converted to timestamps. The later check only works on Python 3, so I put a check around it. I could have writte a "timestamp" function for Python 2, but that would have been just too much untrivial work for very little benefit.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My question would be if that timestamp check is needed?
If the timestamp values (which gives the POSIX time) differ, wouldn't the actual datetime.datetime
objects also not evaluate to not equal?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In the issue, I had some cases where the datetimes differed but the timestamps were equal. Since the timestamps were always equal even if the datetimes differed, this check might not be necessary but I wanted to do the extra check just to be sure.
The issue that this patch closes was not detected by the existing (very extensive) test suite, so I thought more checks won't hurt. :)
pls run ALL the benchmarks. The issue is that this could affect a LOT of things (prob not, but checking). |
I've run the complete avs suite three times now, but I got strange/random looking results. E.g.:
And here are the bad results of the third run (I failed to copy the results of the second run before they vanished beyond my bash history):
As you can see, there's very little overlap between the two datasets (only Anyway, I still think that correct code is more important than a few percent of performance. |
@sscherfke there is often quite some variation in the timings from one build to another (which makes it sometimes difficult to interpret a single run). So ratio's around 1.3 can most of the time be considered as 'normal', unless they would show up consistently. In the first run the first two are related to datetimes, so maybe you could repeat those to see whether the slowdown is consistent.
Yep, but running the asv is also a way to at least have an idea (and to indicate where a possible problem could be), and if there is a performance problem to look into ways to improve that keeping the correctness. |
result_dt = dt.replace(tzinfo=tzinfo).replace(tzinfo=None) | ||
result_pd = Timestamp(dt).replace(tzinfo=tzinfo).replace(tzinfo=None) | ||
|
||
if hasattr(result_dt, 'timestamp'): # New method in Py 3.3 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the correct 3rd comparison is Timestamp(result_dt) == Timestamp(result_pd)
, in fact to make this more clear I would do:
for left, right in [(Timestamp, Timestamp), (lambda x: x, lambda x: x), (lambda x: x, lamba x: x.to_pydatetime())]:
assert left(result_dt) == right(result_pd)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Imho, this is a lot harder to read and understand.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
well, you need to make one of the changes
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree the loop is in this case harder to read
BTW, @jreback Timestamp(dt)
and dt.timestamp()
is not equivalent. dt.timestamp()
is more like Timestamp.value
(although I am not sure how dt.timestamp()
handles timezones)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we don't use dt.timestamp()
anywhere, so a test of this is irrelevant.
Timestamp(result_dt) == Timestamp(result_pd)
is the actual test that needs to pass.
I’m really getting tired of this. I literally spent days trying to understand Pandas and to find a solution that works correctly. I had very little constructive help in the process. When I finally created he PR, I hoped for a "Thanks for your efforts. merged" but all I’m getting is pointless nit-picking on my solution. If that's the way you treat new contributors, good luck for the future. Do whatever you like with this PR. I'm out. |
@sscherfke I am very sorry you feel that way. And thanks for your feedback on the process. As it is certainly important to use how new (or old) contributors experience this. But, you also have to understand that a PR will almost never be just "merged, thanks". On the one hand we want to keep / obtain a high quality code base where new changes meet certain requirements, do not introduce other bugs or have large performance improvements (and not saying your PR would not meet that! just pointing out in general). And also we want to keep a certain consistency in the codebase (so just merging all PRs as we get them, would result in too much variety). But at the same time, we certainly have to be aware of that such a process can become nitpicky, and try to avoid that! For this PR, I hope you would like to reconsider trying to finish it, as this is a regression that would be nice to see fixed.
|
I know. I exaggerated a bit to make my point. I’ve contributing to and maintining my own OSS projects since many years now. The main drawback of my patch is the creation of a |
the general approach is ok as long as there are not perf regressions. This is a very very narrow code change, so yes certainly correctness is first, BUT if this causes perf issues with the vast majority of code would need another soln. I don't think perf is actually a problem, but needs some checking.
|
While looking in the PR / this issue, I wondered the following:
So the |
Any chance that this gets eventually merged? None of the core devs have come up with a better or more performant solution and the bug is still a bug … |
I haven't gone through the whole discussion, but IIRC we needed a new ASV benchmark in addition to the existing ones. I added a simple one to https://github.com/TomAugspurger/pandas/tree/sscherfke-master class TimeTimestamp(object):
goal_time = 0.5
def setup(self):
self.ts = pd.Timestamp("2016-03-28") # after DST
self.tzinfo = pytz.timezone("CET")
def time_timestamp_replace(self):
self.ts.replace(tzinfo=self.tzinfo).replace(tzinfo=None) Output
So there's some work to be done there. I can take a look after Scipy next week. |
doc/source/whatsnew/v0.20.2.txt
Outdated
@@ -37,6 +37,7 @@ Bug Fixes | |||
Conversion | |||
^^^^^^^^^^ | |||
|
|||
- Bug in ``Timestamp.replace`` when replacing ``tzinfo`` around DST changes (:issue:`15683`) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
needs to go in 0.21.0
This question is still unanswered. |
if you can rebase and move docs to 0.21.0 |
I think this problem (#16110 (comment)) is still a problem for updating this? (unless the current one using stlib |
Yes, it would still be a problem but none of the core devs has come up with a good solution yet. I don’t know enough about Pandas to improve my solution. |
Maybe you just gotta descide whether in this particular case you want good performance or correct results (at least for now). If you should decide that correct results are more important, I’d glady update my PR for the next release. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
try the change about typing and see how perf is
_TSObject ts | ||
|
||
# set to naive if needed | ||
_tzinfo = self.tzinfo | ||
value = self.value | ||
if _tzinfo is not None: | ||
value = tz_convert_single(value, 'UTC', _tzinfo) | ||
value_tz = tz_convert_single(value, _tzinfo, 'UTC') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
change the definition of tz_convert_single
to cpdef int64_t
and here make value_tz
and offset
int64_t
definitions
#17356 is a clean version of this PR. |
git diff upstream/master --name-only -- '*.py' | flake8 --diff