Skip to content

BUG: GH29461 strftime() with nanoseconds for Timestamp #34317

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 30 commits into from
Closed

BUG: GH29461 strftime() with nanoseconds for Timestamp #34317

wants to merge 30 commits into from

Conversation

matteosantama
Copy link
Contributor

@matteosantama matteosantama commented May 22, 2020

The new method displays %f to the full precision, including trailing zeroes. I could add a check that only displays nanoseconds if they are non-zero.

Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

pls also add a note in 1.1. bug fix for datetime section

@jreback jreback added Output-Formatting __repr__ of pandas objects, to_string Datetime Datetime data dtype labels May 22, 2020
@matteosantama matteosantama changed the title strftime() with nanoseconds for Timestamp BUG: GH 29461 strftime() with nanoseconds for Timestamp May 22, 2020
@matteosantama matteosantama changed the title BUG: GH 29461 strftime() with nanoseconds for Timestamp BUG: GH29461 strftime() with nanoseconds for Timestamp May 22, 2020
Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you see if we have any benchmarks for these and run them (or add some and run them).

I am afraid this is going to be quite slow.

@jreback
Copy link
Contributor

jreback commented May 22, 2020

pls also merge master

@matteosantama
Copy link
Contributor Author

Added benchmarks. Looks like it does have a bit of a performance impact.

[100.00%] ··· ====================== =============
                       fmt                        
              ---------------------- -------------
                %Y-%m-%d %H:%M:%S     3.18±0.02μs 
               %Y-%m-%d %H:%M:%S.%f   3.58±0.04μs 
              ====================== =============

       before           after         ratio
     [cb35d8a9]       [2ac690c3]
     <clipboard^2>       <strftime>
+     3.18±0.02μs      4.07±0.02μs     1.28  tslibs.timestamp.TimestampMethods.time_strftime('%Y-%m-%d %H:%M:%S')
+     3.58±0.04μs      4.13±0.02μs     1.15  tslibs.timestamp.TimestampMethods.time_strftime('%Y-%m-%d %H:%M:%S.%f')

SOME BENCHMARKS HAVE CHANGED SIGNIFICANTLY.
PERFORMANCE DECREASED.

Weirdly, the performance hit was greater when we DON'T have %f and we call super() (at least in terms of percentage). We shouldn't really be seeing any change in that case. I can't quite figure that one out. I see a few options

  1. Performance impact is not worth the benefit and we scrap the idea
  2. Performance impact is acceptable and we proceed
  3. Can we move this function to Cython? Not totally sure what this would entail

But I think if we can figure out why immediately calling super() is so much slower we'll have a better idea of where we need to make improvements.

@jreback
Copy link
Contributor

jreback commented May 25, 2020

@matteosantama I am more interested in series.dt.strftime(..) impact here. a tiny hit on Timestamp (likely because its an additional function call) is immaterial. We may or may not have this benchmark.

@matteosantama
Copy link
Contributor Author

Ah, of course. I wrote a benchmark, and it seems that performance has actually increased. I actually don't find this too surprising because strftime() in datetime.py does some unnecessary processing before calling time.strftime(). My implementation calls it directly. I think datetime.strftime() was written before time.strftime() could handle microseconds and timezones.

 before           after         ratio
     [c29e3950]       [f6ba1c9b]
     <master^2>       <strftime>
-       1.86±0.1s       1.67±0.01s     0.90  timeseries.DatetimeAccessor.time_dt_accessor_strftime('US/Eastern', '%Y-%m-%d %H:%M:%S.%f%z', 'NS')
-        769±60ms         679±20ms     0.88  timeseries.DatetimeAccessor.time_dt_accessor_strftime(None, '%Y-%m-%d %H:%M:%S%z', 'NS')
-       1.79±0.1s       1.56±0.03s     0.87  timeseries.DatetimeAccessor.time_dt_accessor_strftime('US/Eastern', '%Y-%m-%d %H:%M:%S.%f%z', 'T')
-       1.74±0.2s       1.49±0.03s     0.86  timeseries.DatetimeAccessor.time_dt_accessor_strftime('US/Eastern', '%Y-%m-%d %H:%M:%S%z', 'S')
-       1.79±0.1s       1.53±0.03s     0.86  timeseries.DatetimeAccessor.time_dt_accessor_strftime('US/Eastern', '%Y-%m-%d %H:%M:%S.%f%z', 'S')
-      1.79±0.08s       1.52±0.02s     0.85  timeseries.DatetimeAccessor.time_dt_accessor_strftime('US/Eastern', '%Y-%m-%d %H:%M:%S%z', 'NS')
-        808±50ms         675±20ms     0.83  timeseries.DatetimeAccessor.time_dt_accessor_strftime(None, '%Y-%m-%d %H:%M:%S.%f%z', 'S')
-        801±30ms         663±10ms     0.83  timeseries.DatetimeAccessor.time_dt_accessor_strftime(None, '%Y-%m-%d %H:%M:%S.%f%z', 'T')
-       1.83±0.1s       1.49±0.03s     0.82  timeseries.DatetimeAccessor.time_dt_accessor_strftime('US/Eastern', '%Y-%m-%d %H:%M:%S%z', 'T')
-        931±10ms         748±10ms     0.80  timeseries.DatetimeAccessor.time_dt_accessor_strftime('UTC', '%Y-%m-%d %H:%M:%S%z', 'NS')
-        946±10ms          759±5ms     0.80  timeseries.DatetimeAccessor.time_dt_accessor_strftime(tzutc(), '%Y-%m-%d %H:%M:%S%z', 'NS')
-        924±20ms          735±3ms     0.79  timeseries.DatetimeAccessor.time_dt_accessor_strftime('UTC', '%Y-%m-%d %H:%M:%S%z', 'T')
-        948±20ms          753±8ms     0.79  timeseries.DatetimeAccessor.time_dt_accessor_strftime(tzutc(), '%Y-%m-%d %H:%M:%S%z', 'S')
-        940±20ms         730±10ms     0.78  timeseries.DatetimeAccessor.time_dt_accessor_strftime('UTC', '%Y-%m-%d %H:%M:%S%z', 'S')
-        964±20ms          744±7ms     0.77  timeseries.DatetimeAccessor.time_dt_accessor_strftime(tzutc(), '%Y-%m-%d %H:%M:%S%z', 'T')
-        979±10ms         754±10ms     0.77  timeseries.DatetimeAccessor.time_dt_accessor_strftime(tzutc(), '%Y-%m-%d %H:%M:%S.%f%z', 'T')
-        989±20ms          758±4ms     0.77  timeseries.DatetimeAccessor.time_dt_accessor_strftime(tzutc(), '%Y-%m-%d %H:%M:%S.%f%z', 'S')
-        971±30ms         739±10ms     0.76  timeseries.DatetimeAccessor.time_dt_accessor_strftime('UTC', '%Y-%m-%d %H:%M:%S.%f%z', 'S')
-        980±30ms          737±7ms     0.75  timeseries.DatetimeAccessor.time_dt_accessor_strftime('UTC', '%Y-%m-%d %H:%M:%S.%f%z', 'T')

SOME BENCHMARKS HAVE CHANGED SIGNIFICANTLY.
PERFORMANCE INCREASED.

@matteosantama
Copy link
Contributor Author

@jreback CI is failing because of some doc string issues. The current documentation for strftime is pretty useless.

If you think this branch is worth merging, I can update it.


def time_dt_accessor(self, tz):
def time_dt_accessor(self, *args):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you use explicit parameters here instead of args?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I'll fix that.

# only do additional processing if necessary
if self.nanosecond and '%f' in format:
newformat = []
for ch in format:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you not just use a regex here instead?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, OK that would probably be cleaner

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Datetime Datetime data dtype Output-Formatting __repr__ of pandas objects, to_string
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Timestamp.strftime(): missing support for nanoseconds
3 participants