Skip to content

BUG: Timestamp.replace chaining not compat with datetime.replace #15683 #16110

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 8 commits into from
Closed

Conversation

sscherfke
Copy link

@@ -22,6 +22,7 @@ Highlights include:
- Support for S3 handling now uses ``s3fs``, see :ref:`here <whatsnew_0200.api_breaking.s3>`
- Google BigQuery support now uses the ``pandas-gbq`` library, see :ref:`here <whatsnew_0200.api_breaking.gbq>`
- Switched the test framework to use `pytest <http://doc.pytest.org/en/latest>`__ (:issue:`13097`)
- Fixed an issue with ``Timestamp.replace(tzinfo)`` around DST changes (:issue:`15683`)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can move this down to the Bug Fixes section (the Conversion subsection).

@@ -1280,6 +1280,25 @@ def test_ambiguous_compat(self):
self.assertEqual(result_pytz.to_pydatetime().tzname(),
result_dateutil.to_pydatetime().tzname())

def test_tzreplace_issue_15683(self):
"""Regression test for issue 15683."""
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Usually we just leave a comment on the first line of the function like # GH 15683

result_dt = dt.replace(tzinfo=tzinfo)
result_pd = Timestamp(dt).replace(tzinfo=tzinfo)

self.assertEqual(result_dt.timestamp(), result_pd.timestamp())
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We've recently switched to pytest for our test-runner, so you can write these as

assert result_dt.timesamp() == result_pd.timestamp()
...

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, nice. :)

@codecov
Copy link

codecov bot commented Apr 24, 2017

Codecov Report

Merging #16110 into master will increase coverage by 0.01%.
The diff coverage is n/a.

Impacted file tree graph

@@            Coverage Diff             @@
##           master   #16110      +/-   ##
==========================================
+ Coverage   90.84%   90.86%   +0.01%     
==========================================
  Files         159      159              
  Lines       50771    50771              
==========================================
+ Hits        46122    46132      +10     
+ Misses       4649     4639      -10
Flag Coverage Δ
#multiple 88.64% <ø> (+0.01%) ⬆️
#single 40.34% <ø> (ø) ⬆️
Impacted Files Coverage Δ
pandas/core/common.py 90.68% <0%> (-0.35%) ⬇️
pandas/core/indexes/datetimes.py 95.33% <0%> (-0.1%) ⬇️
pandas/plotting/_converter.py 65.35% <0%> (+1.81%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 844013b...b7c2e86. Read the comment docs.

@codecov
Copy link

codecov bot commented Apr 24, 2017

Codecov Report

Merging #16110 into master will decrease coverage by 0.02%.
The diff coverage is n/a.

Impacted file tree graph

@@            Coverage Diff             @@
##           master   #16110      +/-   ##
==========================================
- Coverage   91.04%   91.01%   -0.03%     
==========================================
  Files         162      162              
  Lines       49555    49567      +12     
==========================================
- Hits        45115    45114       -1     
- Misses       4440     4453      +13
Flag Coverage Δ
#multiple 88.79% <ø> (-0.01%) ⬇️
#single 40.24% <ø> (-0.07%) ⬇️
Impacted Files Coverage Δ
pandas/io/gbq.py 25% <0%> (-58.34%) ⬇️
pandas/core/dtypes/concat.py 98.07% <0%> (-0.21%) ⬇️
pandas/core/groupby.py 92.07% <0%> (-0.14%) ⬇️
pandas/core/frame.py 97.72% <0%> (-0.1%) ⬇️
pandas/core/internals.py 94.21% <0%> (-0.1%) ⬇️
pandas/core/indexes/category.py 98.52% <0%> (-0.01%) ⬇️
pandas/core/algorithms.py 94.15% <0%> (ø) ⬆️
pandas/core/generic.py 92.03% <0%> (+0.04%) ⬆️
pandas/core/indexes/base.py 95.93% <0%> (+0.05%) ⬆️
pandas/compat/numpy/__init__.py 94.11% <0%> (+0.17%) ⬆️
... and 1 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 870b6a6...dd0e33f. Read the comment docs.

Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what happened to the original change? this just needs a localization I think

_TSObject ts

# set to naive if needed
_tzinfo = self.tzinfo
value = self.value
if _tzinfo is not None:
value = tz_convert_single(value, 'UTC', _tzinfo)
value_tz = tz_convert_single(value, _tzinfo, 'UTC')
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this doesn't make any sense

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

change the definition of tz_convert_single to cpdef int64_t and here make value_tz and offset int64_t definitions

@@ -726,16 +728,14 @@ class Timestamp(_Timestamp):
_tzinfo = tzinfo

# reconstruct & check bounds
value = pandas_datetimestruct_to_datetime(PANDAS_FR_ns, &dts)
ts_input = datetime(dts.year, dts.month, dts.day, dts.hour, dts.min,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what exactly are you doing here?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When you changed the tzinfo from old_tz to new_tz in the old implementation, the code would first do a conversion "from UTC to old_tz" and later "from new_tz to UTC" in order to calculate a correct offset for the value.

However, this was the reason for the problems I discussed in the issue (e.g., off-by-1h-offsets were calculated and invalid errors "date does not exist in timezone" were raised when converting "from new_tz to UTC"). These issues only occured around DST changes when the new offset resulted in a DST switch, so they were not covered before.

What I now did was simply constructing a new Timestamp instance and let the Pandas internals do the correct calculations for value. Unfortunately, the low-level (C-)stuff is not very well documented so my solution for creating a correct result Timestamp might not be the most optimal one.

However, correct code is better than optimal code, so I hope this PR is good enough for now.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

pls run the asv suite and see if this regresses anything. perf is a big deal and this goes thru a few more conversions; this is why I am harping on the impl.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a fast low-level function that takes a year, month, day, hour, minute, second, nano and a tzinfo and calculates value so that I can avoid creating a datetime object? Something like convert_to_tsobject() but with a different signature? Or can I call this function in a different way?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you can trace what these are doing and just replicate it. there is some overhead to creating it in this way.

def test_tzreplace_issue_15683(self):
"""Regression test for issue 15683."""
dt = datetime(2016, 3, 27, 1)
tzinfo = pytz.timezone('CET').localize(dt, is_dst=False).tzinfo
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this need testing with multiple tz's that hit DST

@sscherfke
Copy link
Author

Looks like the tests are failing b/c I used datetime.datetime.timestamp() which is new in Py3.3.

@jreback jreback changed the title Fix issue #15683 BUG: Timestamp.replace chaining not compat with datetime.replace #15683 Apr 25, 2017
@jreback jreback added Bug Compat pandas objects compatability with Numpy or Python functions Datetime Datetime data dtype labels Apr 25, 2017
@sscherfke
Copy link
Author

@jreback It's not only a compatibility issue. Timestamp.replace() computes wrong results under certain circumstances that would still be wrong if there was no datetime.replace() to be compatible with.

@jreback
Copy link
Contributor

jreback commented Apr 25, 2017

@sscherfke this is like adjusting, then readjusting things again. Start with the code I put up before.

@sscherfke
Copy link
Author

sscherfke commented Apr 25, 2017

@jreback The proposed tz_localize_to_utc approach leads to this error.

Since all my attempts to fix localizing the timestamp failed, I just thought "why not just calculate a new value for the given time-struct and tzinfo?". I know that this code is not the very best solution but this was the best that I could come up with based on the documentation available:

# What I actually want to to is:
#
#     value = value_for(dts, tzinfo)
#
# The easiest thing I found that does this is creating a tsobject from a datetime() and taking its "value":
ts_input = datetime(dts.year, dts.month, dts.day, dts.hour, dts.min, dts.sec, dts.us, tzinfo=_tzinfo)
ts = convert_to_tsobject(ts_input, _tzinfo, None, 0, 0)
# Since datetime() has no nanoseconds, add them manually:
value = ts.value + (dts.ps // 1000)

The good thing of this approach is that we don't calculate any wrong offsets and that there also won't be any false "ambiguous time" errors.

I'd gladly improve the above code with anything that correctly calculates value = value_for(dts, tzinfo)

@sscherfke
Copy link
Author

How do we proceed now?

My solution is not optional but unfortunately, I have no idea how to improve it.

Nonetheless, it fixes an obvious bug and the tests are passing now, so I’d be really happy if this gets merged.

@@ -1576,6 +1576,7 @@ Conversion

- Bug in ``Timestamp.replace`` now raises ``TypeError`` when incorrect argument names are given; previously this raised ``ValueError`` (:issue:`15240`)
- Bug in ``Timestamp.replace`` with compat for passing long integers (:issue:`15030`)
- Bug in ``Timestamp.replace`` when replacing ``tzinfo`` around DST changes (:issue:`15683`)
- Bug in ``Timestamp`` returning UTC based time/date attributes when a timezone was provided (:issue:`13303`)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this will need to move to 0.20.1

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay, np

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I moved it to 0.20.2 since there was no file for 0.20.1.

@@ -726,16 +728,14 @@ class Timestamp(_Timestamp):
_tzinfo = tzinfo

# reconstruct & check bounds
value = pandas_datetimestruct_to_datetime(PANDAS_FR_ns, &dts)
ts_input = datetime(dts.year, dts.month, dts.day, dts.hour, dts.min,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

pls run the asv suite and see if this regresses anything. perf is a big deal and this goes thru a few more conversions; this is why I am harping on the impl.

@jreback
Copy link
Contributor

jreback commented May 12, 2017

@sscherfke can you run the asv's and post any discrepancies.

@sscherfke
Copy link
Author

sscherfke commented May 15, 2017

@jreback I ran asv continuous -f 1.1 -b replace origin/master HEAD (the replace bench. seems to be the only one affected by this patch) and it said everything is okay.

result_dt = dt.replace(tzinfo=tzinfo)
result_pd = Timestamp(dt).replace(tzinfo=tzinfo)

if hasattr(result_dt, 'timestamp'): # New method in Py 3.3
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what is this? 'New method in Py 3.3'?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The datetime.datetime.timestamp() method is new in 3.3.
I developed the Patch on 3.6 and then the GitHub CI failed because of this method.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

and why are you doing this? what does it do

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am just doing various tests to make sure, pandas timestamps behave like datetimes, which they did not always do before.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

and what exactly does this do? pls show an example of calling timestamp on a datetime.datetime.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am just checking if both, pandas and datetime dates, are equal in direct comparision and if converted to timestamps. The later check only works on Python 3, so I put a check around it. I could have writte a "timestamp" function for Python 2, but that would have been just too much untrivial work for very little benefit.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My question would be if that timestamp check is needed?
If the timestamp values (which gives the POSIX time) differ, wouldn't the actual datetime.datetime objects also not evaluate to not equal?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the issue, I had some cases where the datetimes differed but the timestamps were equal. Since the timestamps were always equal even if the datetimes differed, this check might not be necessary but I wanted to do the extra check just to be sure.

The issue that this patch closes was not detected by the existing (very extensive) test suite, so I thought more checks won't hurt. :)

@jreback
Copy link
Contributor

jreback commented May 16, 2017

@jreback I ran asv continuous -f 1.1 -b replace origin/master HEAD (the replace bench. seems to be the only one affected by this patch) and it said everything is okay.

pls run ALL the benchmarks. The issue is that this could affect a LOT of things (prob not, but checking). .replace is called by the dateutil code itself.

@sscherfke
Copy link
Author

I've run the complete avs suite three times now, but I got strange/random looking results. E.g.:
"Bad" results from the first run:

+     1.08±0.01ms      1.62±0.01ms     1.50  io_bench.read_parse_dates_iso8601.time_read_parse_dates_iso8601
+      5.42±0.2ms       7.78±0.2ms     1.44  io_bench.read_csv_infer_datetime_format_custom.time_read_csv_infer_datetime_format_custom
+     9.89±0.02ms       13.9±0.1ms     1.41  inference.to_numeric_downcast.time_downcast('int32', 'signed')
+      25.0±0.5ms       30.2±0.8ms     1.21  groupby.groupby_agg_builtins.time_groupby_agg_builtins2
+        1.50±0ms       1.82±0.2ms     1.21  io_bench.read_csv_infer_datetime_format_ymd.time_read_csv_infer_datetime_format_ymd
+     7.72±0.01ms       9.30±0.7ms     1.20  io_bench.frame_to_csv_date_formatting.time_frame_to_csv_date_formatting
+     1.57±0.03ms       1.85±0.1ms     1.18  inference.to_numeric_downcast.time_downcast('datetime64', 'float')
+       419±0.5ms         485±30ms     1.16  groupby.Groups.time_groupby_groups('object_large')
+       157±0.9ms          179±2ms     1.14  frame_methods.Dropna.time_dropna_axis1_all_mixed_dtypes
+       150±0.4ms         172±10ms     1.14  io_bench.frame_to_csv_mixed.time_frame_to_csv_mixed
+     2.66±0.03ms       2.99±0.1ms     1.12  binary_ops.Ops.time_frame_add(True, 1)
+      53.7±0.2μs       59.4±0.1μs     1.11  timeseries.SemiMonthOffset.time_end_incr_n
+     2.61±0.03ms      2.88±0.04ms     1.10  binary_ops.Ops.time_frame_mult(True, 1)
+      53.3±0.1μs       58.8±0.1μs     1.10  timeseries.SemiMonthOffset.time_begin_incr_n

And here are the bad results of the third run (I failed to copy the results of the second run before they vanished beyond my bash history):

+     1.38±0.01ms      1.82±0.09ms     1.32  frame_ctor.FrameConstructorDTIndexFromOffsets.time_frame_ctor('Micro', 1)
+      17.1±0.2ms         20.9±1ms     1.22  frame_methods.frame_fancy_lookup.time_frame_fancy_lookup_all
+     1.39±0.01ms      1.67±0.07ms     1.20  frame_ctor.FrameConstructorDTIndexFromOffsets.time_frame_ctor('Easter', 2)
+      76.8±0.4ms         91.7±4ms     1.19  gil.nogil_rolling_algos_fast.time_nogil_rolling_skew
+     1.34±0.01ms       1.59±0.1ms     1.19  frame_ctor.FrameConstructorDTIndexFromOffsets.time_frame_ctor('Micro', 2)
+          15.2ms           18.0ms     1.18  frame_ctor.FrameConstructorDTIndexFromOffsets.time_frame_ctor('FY5253Quarter_1', 2)
+     1.34±0.02ms      1.58±0.07ms     1.18  frame_ctor.FrameConstructorDTIndexFromOffsets.time_frame_ctor('Milli', 2)
+      78.2±0.4ms         91.9±1ms     1.17  gil.nogil_rolling_algos_fast.time_nogil_rolling_std
+      69.3±0.6ms       81.0±0.6ms     1.17  gil.nogil_rolling_algos_fast.time_nogil_rolling_var
+      9.16±0.1ms       10.6±0.3ms     1.16  algorithms.Algorithms.time_add_overflow_second_arg_nan
+      8.66±0.1ms       9.99±0.2ms     1.15  frame_methods.frame_mask_bools.time_frame_mask_floats
+         131±2ms          150±2ms     1.15  frame_methods.series_string_vector_slice.time_series_string_vector_slice
+        1.01±0ms      1.16±0.02ms     1.15  indexing.DataFrameIndexing.time_loc_dups
+     7.27±0.05ms       8.28±0.4ms     1.14  groupby.groupby_size.time_groupby_dt_size
+     9.98±0.09ms       11.4±0.3ms     1.14  inference.to_numeric_downcast.time_downcast('int32', 'integer')
+      76.4±0.5ms       86.7±0.5ms     1.14  gil.nogil_rolling_algos_fast.time_nogil_rolling_min
+      12.3±0.1ms       13.8±0.6ms     1.13  io_sql.ReadSQLTypes.time_datetime_read_and_parse_sqlalchemy
+         108±1ms          122±2ms     1.12  frame_methods.frame_duplicated.time_frame_duplicated
+          9.76ms           10.9ms     1.12  frame_ctor.FrameConstructorDTIndexFromOffsets.time_frame_ctor('FY5253Quarter_1', 1)
+     1.45±0.01ms      1.62±0.04ms     1.12  frame_methods.frame_assign_timeseries_index.time_frame_assign_timeseries_index
+      38.7±0.2ms       43.2±0.9ms     1.11  gil.NoGilGroupby.time_min_2
+     1.96±0.02ms      2.18±0.01ms     1.11  frame_methods.frame_fancy_lookup.time_frame_fancy_lookup
+           311ms            346ms     1.11  gil.nogil_read_csv.time_read_csv
+     1.53±0.01ms       1.70±0.1ms     1.11  frame_ctor.FrameConstructorDTIndexFromOffsets.time_frame_ctor('BQuarterEnd', 1)
+      50.9±0.3μs       56.3±0.2μs     1.11  timeseries.SemiMonthOffset.time_end_incr
+        68.1±1ms       75.3±0.3ms     1.11  gil.nogil_rolling_algos_fast.time_nogil_rolling_mean
+     13.0±0.06ms      14.3±0.04ms     1.10  parser_vb.read_csv2.time_comment
+     1.95±0.01ms      2.15±0.02ms     1.10  frame_methods.Interpolate.time_interpolate_some_good_infer

As you can see, there's very little overlap between the two datasets (only time_downcast('int32', 'signed') is in both datasets), so I don't know how much I would trust the results.

Anyway, I still think that correct code is more important than a few percent of performance.

@jorisvandenbossche
Copy link
Member

@sscherfke there is often quite some variation in the timings from one build to another (which makes it sometimes difficult to interpret a single run). So ratio's around 1.3 can most of the time be considered as 'normal', unless they would show up consistently.

In the first run the first two are related to datetimes, so maybe you could repeat those to see whether the slowdown is consistent.

Anyway, I still think that correct code is more important than a few percent of performance.

Yep, but running the asv is also a way to at least have an idea (and to indicate where a possible problem could be), and if there is a performance problem to look into ways to improve that keeping the correctness.

result_dt = dt.replace(tzinfo=tzinfo).replace(tzinfo=None)
result_pd = Timestamp(dt).replace(tzinfo=tzinfo).replace(tzinfo=None)

if hasattr(result_dt, 'timestamp'): # New method in Py 3.3
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the correct 3rd comparison is Timestamp(result_dt) == Timestamp(result_pd), in fact to make this more clear I would do:

for left, right in [(Timestamp, Timestamp), (lambda x: x, lambda x: x), (lambda x: x, lamba x: x.to_pydatetime())]:
     assert left(result_dt) == right(result_pd)

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Imho, this is a lot harder to read and understand.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

well, you need to make one of the changes

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree the loop is in this case harder to read

BTW, @jreback Timestamp(dt) and dt.timestamp() is not equivalent. dt.timestamp() is more like Timestamp.value (although I am not sure how dt.timestamp() handles timezones)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we don't use dt.timestamp() anywhere, so a test of this is irrelevant.

Timestamp(result_dt) == Timestamp(result_pd) is the actual test that needs to pass.

@sscherfke
Copy link
Author

I’m really getting tired of this. I literally spent days trying to understand Pandas and to find a solution that works correctly. I had very little constructive help in the process.

When I finally created he PR, I hoped for a "Thanks for your efforts. merged" but all I’m getting is pointless nit-picking on my solution.

If that's the way you treat new contributors, good luck for the future. Do whatever you like with this PR. I'm out.

@jorisvandenbossche
Copy link
Member

@sscherfke I am very sorry you feel that way. And thanks for your feedback on the process. As it is certainly important to use how new (or old) contributors experience this.

But, you also have to understand that a PR will almost never be just "merged, thanks". On the one hand we want to keep / obtain a high quality code base where new changes meet certain requirements, do not introduce other bugs or have large performance improvements (and not saying your PR would not meet that! just pointing out in general). And also we want to keep a certain consistency in the codebase (so just merging all PRs as we get them, would result in too much variety).
What I am trying to say is that PRs will almost always get comments and reviews, often an iterative process (which can be tiring due the non-direct lagged interaction through github).

But at the same time, we certainly have to be aware of that such a process can become nitpicky, and try to avoid that!

For this PR, I hope you would like to reconsider trying to finish it, as this is a regression that would be nice to see fixed.
In order to do that, let my ask following questions:

@sscherfke
Copy link
Author

But, you also have to understand that a PR will almost never be just "merged, thanks".

I know. I exaggerated a bit to make my point. I’ve contributing to and maintining my own OSS projects since many years now.

The main drawback of my patch is the creation of a datetime instance but, as I said multiple times before, I don't know any low-level pandas function that does exactly what I want. The previous implementations/souggestions by @jreback calculated wrong results.

@jreback
Copy link
Contributor

jreback commented May 18, 2017

the general approach is ok as long as there are not perf regressions. This is a very very narrow code change, so yes certainly correctness is first, BUT if this causes perf issues with the vast majority of code would need another soln. I don't think perf is actually a problem, but needs some checking.

.replace is actually implicity called a bunch by pytz so this is non-trival and prob not covered by our asv's.

@jorisvandenbossche
Copy link
Member

While looking in the PR / this issue, I wondered the following:

In [97]: pd.Timestamp("2016-03-27 03:00", tz='CET')
Out[97]: Timestamp('2016-03-27 03:00:00+0200', tz='CET')

In [98]: pd.Timestamp("2016-03-27 03:00", tz='CET').replace(hour=1)
Out[98]: Timestamp('2016-03-27 01:00:00+0200', tz='CET')

In [99]: dt = datetime.datetime(2016, 3, 27, 3)

In [101]: tzinfo = pytz.timezone('CET').localize(dt, is_dst=False).tzinfo

In [102]: datetime.datetime(2013, 10, 27, 3, 0, tzinfo=tzinfo)
Out[102]: datetime.datetime(2013, 10, 27, 3, 0, tzinfo=<DstTzInfo 'CET' CEST+2:00:00 DST>)

In [103]: datetime.datetime(2013, 10, 27, 3, 0, tzinfo=tzinfo).replace(hour=1)
Out[103]: datetime.datetime(2013, 10, 27, 1, 0, tzinfo=<DstTzInfo 'CET' CEST+2:00:00 DST>)

So the replace is actually producing invalid times? (as due to the hour change, the timezone should be changed from +2 to +1). Which is both for datetime.datetime and pd.Timestamp, so not necessarily related to this PR. But, it is the reason I think that the (simpler) patch of de-localizing to naive in the beginning and re-localizing at the end during the replace method does not work, as errors can get raised in the re-localization when you get invalid dates.

@sscherfke
Copy link
Author

Any chance that this gets eventually merged? None of the core devs have come up with a better or more performant solution and the bug is still a bug …

@TomAugspurger
Copy link
Contributor

I haven't gone through the whole discussion, but IIRC we needed a new ASV benchmark in addition to the existing ones. I added a simple one to https://github.com/TomAugspurger/pandas/tree/sscherfke-master

class TimeTimestamp(object):

    goal_time = 0.5

    def setup(self):
        self.ts = pd.Timestamp("2016-03-28")  # after DST
        self.tzinfo = pytz.timezone("CET")

    def time_timestamp_replace(self):
        self.ts.replace(tzinfo=self.tzinfo).replace(tzinfo=None)

Output

       before           after         ratio
     [15db50bb]       [ab7ea875]
+      10.8±0.2μs       18.3±0.8μs     1.70  timestamp.TimeTimestamp.time_timestamp_replace

So there's some work to be done there. I can take a look after Scipy next week.

@@ -37,6 +37,7 @@ Bug Fixes
Conversion
^^^^^^^^^^

- Bug in ``Timestamp.replace`` when replacing ``tzinfo`` around DST changes (:issue:`15683`)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

needs to go in 0.21.0

@sscherfke
Copy link
Author

sscherfke commented Jul 10, 2017

For the specific question of @sscherfke about the construction of a Timestamp in cython form all its parts (see https://github.com/pandas-dev/pandas/pull/16110/files#r114604812), is there such a function? (@jreback?)

This question is still unanswered.

@jreback
Copy link
Contributor

jreback commented Aug 17, 2017

if you can rebase and move docs to 0.21.0

@jorisvandenbossche
Copy link
Member

I think this problem (#16110 (comment)) is still a problem for updating this? (unless the current one using stlib datetime.datetime is not a performance bottleneck)

@sscherfke
Copy link
Author

Yes, it would still be a problem but none of the core devs has come up with a good solution yet. I don’t know enough about Pandas to improve my solution.

@sscherfke
Copy link
Author

Maybe you just gotta descide whether in this particular case you want good performance or correct results (at least for now).

If you should decide that correct results are more important, I’d glady update my PR for the next release.

Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

try the change about typing and see how perf is

_TSObject ts

# set to naive if needed
_tzinfo = self.tzinfo
value = self.value
if _tzinfo is not None:
value = tz_convert_single(value, 'UTC', _tzinfo)
value_tz = tz_convert_single(value, _tzinfo, 'UTC')
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

change the definition of tz_convert_single to cpdef int64_t and here make value_tz and offset int64_t definitions

@sscherfke
Copy link
Author

#17356 is a clean version of this PR.

@sscherfke sscherfke closed this Aug 28, 2017
@gfyoung gfyoung added this to the No action milestone Aug 28, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug Compat pandas objects compatability with Numpy or Python functions Datetime Datetime data dtype
Projects
None yet
Development

Successfully merging this pull request may close these issues.

BUG: Timestamp.replace chaining not compat with datetime.replace
5 participants