Skip to content

[READY] Improved performance of Period's default formatter (period_format) #51459

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged

Conversation

smarie
Copy link
Contributor

@smarie smarie commented Feb 17, 2023

Period's default formatter (period_format) is now significantly (~twice) faster.

  • This improves performance of str(Period), repr(Period), and Period.strftime(fmt=None),
  • as well as PeriodArray.strftime(fmt=None), PeriodIndex.strftime(fmt=None) and PeriodIndex.format(fmt=None).
  • Finally, to_csv operations involving PeriodArray or PeriodIndex with default date_format are also significantly accelerated.

ASVs:

> asv compare d82f9ddb c1bbfd6c
       before           after         ratio
     [d82f9ddb]       [c1bbfd6c]
-     3.12±0.05μs         998±40ns     0.32  tslibs.period.PeriodUnaryMethods.time_repr('M')
-     3.31±0.05μs       1.58±0.1μs     0.48  tslibs.period.PeriodUnaryMethods.time_repr('min')
-     3.15±0.09μs      1.05±0.07μs     0.33  tslibs.period.PeriodUnaryMethods.time_str('M')
-     3.20±0.05μs       1.75±0.1μs     0.55  tslibs.period.PeriodUnaryMethods.time_str('min')
-      2.90±0.1μs         850±40ns     0.29  tslibs.period.PeriodUnaryMethods.time_strftime_default('M')
-     3.01±0.01μs       1.48±0.1μs     0.49  tslibs.period.PeriodUnaryMethods.time_strftime_default('min')
-     4.16±0.05ms      2.37±0.08ms     0.57  strftime.PeriodStrftime.time_frame_period_formatting_default(1000, 'D')
-     4.35±0.06ms       2.63±0.1ms     0.60  strftime.PeriodStrftime.time_frame_period_formatting_default(1000, 'H')
-      40.3±0.6ms       21.8±0.2ms     0.54  strftime.PeriodStrftime.time_frame_period_formatting_default(10000, 'D')
-      43.5±0.2ms         26.3±1ms     0.60  strftime.PeriodStrftime.time_frame_period_formatting_default(10000, 'H')
      4.58±0.08ms      2.88±0.07ms    ~0.63  strftime.PeriodStrftime.time_frame_period_formatting_index_default(1000, 'D')
       4.73±0.2ms      3.02±0.05ms    ~0.64  strftime.PeriodStrftime.time_frame_period_formatting_index_default(1000, 'H')
-      42.8±0.3ms       24.0±0.5ms     0.56  strftime.PeriodStrftime.time_frame_period_formatting_index_default(10000, 'D')
-      43.2±0.3ms       27.3±0.3ms     0.63  strftime.PeriodStrftime.time_frame_period_formatting_index_default(10000, 'H')
-     4.16±0.05ms       2.41±0.3ms     0.58  strftime.PeriodStrftime.time_frame_period_to_str(1000, 'D')
-     4.28±0.03ms       2.71±0.1ms     0.63  strftime.PeriodStrftime.time_frame_period_to_str(1000, 'H')
-      41.6±0.7ms       22.7±0.2ms     0.55  strftime.PeriodStrftime.time_frame_period_to_str(10000, 'D')
-      42.9±0.6ms       26.0±0.5ms     0.61  strftime.PeriodStrftime.time_frame_period_to_str(10000, 'H')

EDIT: additional for to_csv

-     5.94±0.05ms       3.94±0.3ms     0.66  io.csv.ToCSVPeriod.time_frame_period_formatting(1000, 'D') 
-      6.39±0.1ms      4.37±0.04ms     0.68  io.csv.ToCSVPeriod.time_frame_period_formatting(1000, 'H') 
-      51.8±0.2ms       30.2±0.3ms     0.58  io.csv.ToCSVPeriod.time_frame_period_formatting(10000, 'D')
-       59.8±20ms       35.0±0.2ms     0.59  io.csv.ToCSVPeriod.time_frame_period_formatting(10000, 'H')
-      7.09±0.2ms       3.94±0.2ms     0.56  io.csv.ToCSVPeriod.time_frame_period_no_format(1000, 'D')  
-      6.72±0.1ms      4.30±0.03ms     0.64  io.csv.ToCSVPeriod.time_frame_period_no_format(1000, 'H')  
-        54.0±2ms       30.5±0.3ms     0.56  io.csv.ToCSVPeriod.time_frame_period_no_format(10000, 'D') 
-        54.4±1ms       35.3±0.5ms     0.65  io.csv.ToCSVPeriod.time_frame_period_no_format(10000, 'H')
       9.94±0.3ms       9.70±0.1ms     0.98  io.csv.ToCSVPeriodIndex.time_frame_period_formatting_index(1000, 'D')
      9.69±0.07ms       9.84±0.7ms     1.02  io.csv.ToCSVPeriodIndex.time_frame_period_formatting_index(1000, 'H')
       85.4±0.9ms       84.9±0.9ms     0.99  io.csv.ToCSVPeriodIndex.time_frame_period_formatting_index(10000, 'D')
         89.3±2ms         87.3±3ms     0.98  io.csv.ToCSVPeriodIndex.time_frame_period_formatting_index(10000, 'H')
-      6.32±0.3ms      3.78±0.03ms     0.60  io.csv.ToCSVPeriodIndex.time_frame_period_no_format_index(1000, 'D')
-      6.30±0.2ms       4.60±0.4ms     0.73  io.csv.ToCSVPeriodIndex.time_frame_period_no_format_index(1000, 'H')
         50.9±2ms         33.9±8ms    ~0.66  io.csv.ToCSVPeriodIndex.time_frame_period_no_format_index(10000, 'D')
         52.6±1ms         35.6±6ms    ~0.68  io.csv.ToCSVPeriodIndex.time_frame_period_no_format_index(10000, 'H')

@smarie
Copy link
Contributor Author

smarie commented Feb 17, 2023

It seems that the failing tests are not related to this PR's contents, do you confirm ?

@smarie
Copy link
Contributor Author

smarie commented Feb 18, 2023

I added a few asvs for the to_csv operations and updated the what's new. This is now ready for review @MarcoGorelli

Note that I do not understand why ToCSVPeriod.time_frame_period_formatting is accelerated too, since it has a custom date_format. Maybe there is some default period formatting operation happening as an intermediate step ? To investigate... in another PR maybe

I found the reason and added a docstring on that bench function. The date_format argument is not used by to_csv when the PeriodArray is not the index of the dataframe, so the formatting uses the default format, and therefore it is impacted by the acceleration.

…t was improved: the date_format parameter is not taken into account today.
@smarie
Copy link
Contributor Author

smarie commented Feb 20, 2023

Note: the failing test seeems to be related to a network issue, so completely unrelated.

FAILED pandas/tests/io/parser/common/test_file_buffer_url.py::test_file_descriptor_leak[c_high] - AssertionError: ([], [pconn(fd=14, family=<AddressFamily.AF_INET6: 10>, type=<SocketKind.SOCK_STREAM: 1>, laddr=addr(ip='::1', port=35970), raddr=addr(ip='::1', port=3306), status='ESTABLISHED')])

@mroeschke since you have reviewed my similar PRs for strftime in the past, do not hesitate to jump in too.

Note that once this PR will be merged, #51298 will be finally small enough to be revewed

@smarie smarie changed the title Improved performance of Period's default formatter (period_format) [READY] Improved performance of Period's default formatter (period_format) Feb 21, 2023
@simonjayhawkins simonjayhawkins added Performance Memory or execution speed performance Output-Formatting __repr__ of pandas objects, to_string labels Feb 22, 2023
Copy link
Member

@MarcoGorelli MarcoGorelli left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice one, the changes themselves look good to me! Just got a question

Regarding date_format only taking effect for Period if it's in the index - that looks like a bug to me, well done for noticing! I don't see any issue open about this, do you want to open one? (else I can, no worries)

Comment on lines 78 to 86
def time_frame_period_formatting_custom(self, obs, fq):
self.data["p"].dt.strftime(date_format="%Y-%m-%d --- %H:%M:%S")

def time_frame_period_formatting_iso8601_strftime_Z(self, obs, fq):
self.data["p"].dt.strftime(date_format="%Y-%m-%dT%H:%M:%SZ")

def time_frame_period_formatting_iso8601_strftime_offset(self, obs, fq):
"""Not optimized yet as %z is not supported by `convert_strftime_format`"""
self.data["p"].dt.strftime(date_format="%Y-%m-%dT%H:%M:%S%z")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why do these not show up in the asv results you posted?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good point, it seems that I forgot them in the copy paste. I'll try to rerun asv partially to get them

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

EDIT: Actually I had removed all asvs that were expected not to show improvements (custom date formats). Here are all ASVs for PeriodStrftime:

       15.4±0.2ms         17.1±2ms    ~1.11  strftime.PeriodStrftime.time_frame_period_formatting_custom(1000, 'D')
       15.0±0.7ms       14.1±0.2ms     0.94  strftime.PeriodStrftime.time_frame_period_formatting_custom(1000, 'H')
          142±6ms          153±5ms     1.08  strftime.PeriodStrftime.time_frame_period_formatting_custom(10000, 'D')
          143±5ms          157±7ms     1.10  strftime.PeriodStrftime.time_frame_period_formatting_custom(10000, 'H')
-     4.37±0.07ms      2.24±0.06ms     0.51  strftime.PeriodStrftime.time_frame_period_formatting_default(1000, 'D')
-     4.72±0.06ms       2.90±0.1ms     0.61  strftime.PeriodStrftime.time_frame_period_formatting_default(1000, 'H')
-        42.8±1ms       24.1±0.7ms     0.56  strftime.PeriodStrftime.time_frame_period_formatting_default(10000, 'D')
-      48.4±0.5ms       24.4±0.2ms     0.50  strftime.PeriodStrftime.time_frame_period_formatting_index_default(10000, 'D')
-        50.9±3ms       27.1±0.7ms     0.53  strftime.PeriodStrftime.time_frame_period_formatting_index_default(10000, 'H')
       7.52±0.7ms       7.22±0.3ms     0.96  strftime.PeriodStrftime.time_frame_period_formatting_iso8601_strftime_Z(1000, 'D')
       7.12±0.1ms       7.00±0.3ms     0.98  strftime.PeriodStrftime.time_frame_period_formatting_iso8601_strftime_Z(1000, 'H')
         75.2±4ms         68.1±1ms    ~0.91  strftime.PeriodStrftime.time_frame_period_formatting_iso8601_strftime_Z(10000, 'D')
         71.9±4ms         77.3±8ms     1.08  strftime.PeriodStrftime.time_frame_period_formatting_iso8601_strftime_Z(10000, 'H')
       13.5±0.5ms       14.0±0.4ms     1.03  strftime.PeriodStrftime.time_frame_period_formatting_iso8601_strftime_offset(1000, 'D')
         15.9±3ms         14.4±1ms    ~0.91  strftime.PeriodStrftime.time_frame_period_formatting_iso8601_strftime_offset(1000, 'H')
          135±3ms          148±7ms     1.10  strftime.PeriodStrftime.time_frame_period_formatting_iso8601_strftime_offset(10000, 'D')
          140±3ms          136±2ms     0.97  strftime.PeriodStrftime.time_frame_period_formatting_iso8601_strftime_offset(10000, 'H')

@smarie
Copy link
Contributor Author

smarie commented Feb 24, 2023

Thanks @MarcoGorelli !

Regarding date_format only taking effect for Period if it's in the index - that looks like a bug to me, well done for noticing! I don't see any issue open about this, do you want to open one? (else I can, no worries)

Done #51621

Sylvain MARIE added 2 commits February 28, 2023 13:43
…ure/44764_perf_issue_new_period

# Conflicts:
#	doc/source/whatsnew/v1.5.4.rst
@smarie smarie requested a review from MarcoGorelli February 28, 2023 12:46
Copy link
Member

@MarcoGorelli MarcoGorelli left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me, thanks @smarie !

Leaving open for a bit in case others have comments

@MarcoGorelli MarcoGorelli added this to the 2.1 milestone Feb 28, 2023
@smarie
Copy link
Contributor Author

smarie commented Mar 13, 2023

@jbrockmendel @MarcoGorelli would you have a few mins this week to check this PR so that I can move to the next one ?
or maybe @mroeschke ?

@MarcoGorelli
Copy link
Member

yup taking a look now (was off last week)

return f"{dts.year}"

elif freq_group == FR_QTR and (is_fmt_none or fmt == "%FQ%q"):
# get quarter and modify dts.year to be the fiscal year (?)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the changes themselves look good, I'm just confused by this (?). are you asking for confirmation?

Copy link
Contributor Author

@smarie smarie Mar 13, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes @MarcoGorelli , "fiscal year" is the wording to confirm in this comment, I wasn't sure that this was the right definition.

get_yq docstring is not very explicit about this, apart from stating Sets dts.year in-place.
Maybe this docstring could be improved according to the actual answer.

cdef int get_yq(int64_t ordinal, int freq, npy_datetimestruct* dts):

Sylvain MARIE added 4 commits March 14, 2023 22:24
@MarcoGorelli
Copy link
Member

ok should be good - I'll give it a final look and merge this week

Copy link
Member

@MarcoGorelli MarcoGorelli left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've had another look, and to be honest I'm slightly scared about merging something this user-facing

What I'm most concerned about is that not all these reprs seems to be explicitly tested for. Some of them are, e.g.

def test_millisecond_repr(self):
p = Period("2000-01-01 12:15:02.123")
assert repr(p) == "Period('2000-01-01 12:15:02.123', 'L')"
def test_microsecond_repr(self):
p = Period("2000-01-01 12:15:02.123567")
assert repr(p) == "Period('2000-01-01 12:15:02.123567', 'U')"

but then FR_SEC isn't (as far as I can tell).

Reckon we should add such unit tests for each period group (perhaps in a separate PR, which we could get in very quickly)? That would give me much more confidence about merging this

Really sorry this is taking so long

@smarie
Copy link
Contributor Author

smarie commented Mar 18, 2023

Reckon we should add such unit tests for each period group (perhaps in a separate PR, which we could get in very quickly)? That would give me much more confidence about merging this

I did not review the available tests extensively so maybe this is indeed needed. I'll have a look one of these days

@github-actions
Copy link
Contributor

This pull request is stale because it has been open for thirty days with no activity. Please update and respond to this comment if you're still interested in working on this.

@github-actions github-actions bot added the Stale label Apr 26, 2023
@smarie
Copy link
Contributor Author

smarie commented Apr 26, 2023

Yes this is still active and even ready for merge, but it would require that another PR pushes a few more tests first so that @MarcoGorelli is more confident about the non-regressive impact :)

…ure/44764_perf_issue_new_period

# Conflicts:
#	doc/source/whatsnew/v2.1.0.rst
@smarie smarie mentioned this pull request May 6, 2023
3 tasks
@smarie smarie requested a review from MarcoGorelli May 6, 2023 13:18
@smarie
Copy link
Contributor Author

smarie commented May 6, 2023

Now that #53003 is merged this should be ok - all new tests seem to pass.

Copy link
Member

@MarcoGorelli MarcoGorelli left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cool, let's get this in, thanks @smarie !

@MarcoGorelli
Copy link
Member

let's get this in, thanks @smarie

@MarcoGorelli MarcoGorelli merged commit 2cdca01 into pandas-dev:main May 7, 2023
@smarie
Copy link
Contributor Author

smarie commented May 8, 2023

Great news @MarcoGorelli thanks !
I'll now finally move back to the main PR of this thread: #51298
Stay tuned ;)

Thanks again !

Rylie-W pushed a commit to Rylie-W/pandas that referenced this pull request May 19, 2023
…_format`) (pandas-dev#51459)

* Improved performance of default period formatting (`period_format`). Added corresponding ASVs

* Improved ASV for period frames and datetimes

* What's new

* Update asv_bench/benchmarks/strftime.py

* Fixed whats new backticks

* Completed whatsnew

* Added ASVs for to_csv for period

* Aligned the namings

* Completed Whats new

* Added a docstring explaining why the ASV bench with custom date format was improved: the date_format parameter is not taken into account today.

* Moved whatsnew to 2.0.0

* Moved whatsnew to 2.1

* Improved docstring as per code review

* Renamed asv params as per code review

* Fixed ASV comment as per code review

* ASV: renamed parameters as per code review

* Improved `period_format`: now the performance is the same when no format is provided and if an explicit `fmt` is provided matching the actual default format for this `freq`

* Code review: Improved strftime ASV: set_index is now in the setup

* Removed useless main

* Removed wrong code

* Improved ASVs for period formatting: now there is a "default explicit" everywhere

* Update pandas/_libs/tslibs/period.pyx

* Update pandas/_libs/tslibs/period.pyx

* Update pandas/_libs/tslibs/period.pyx

* Minor refactoring to avoid retesting for none several time

* Fixed issue: bool does not exist, using bint

* Added missing quarter variable as cdef

* Fixed asv bug

* Code review: fixed docstring

* Update doc/source/whatsnew/v2.1.0.rst

* fixup

---------

Co-authored-by: Sylvain MARIE <[email protected]>
Co-authored-by: Marco Edward Gorelli <[email protected]>
Co-authored-by: MarcoGorelli <>
@smarie smarie deleted the feature/44764_perf_issue_new_period branch May 22, 2023 16:17
Daquisu pushed a commit to Daquisu/pandas that referenced this pull request Jul 8, 2023
…_format`) (pandas-dev#51459)

* Improved performance of default period formatting (`period_format`). Added corresponding ASVs

* Improved ASV for period frames and datetimes

* What's new

* Update asv_bench/benchmarks/strftime.py

* Fixed whats new backticks

* Completed whatsnew

* Added ASVs for to_csv for period

* Aligned the namings

* Completed Whats new

* Added a docstring explaining why the ASV bench with custom date format was improved: the date_format parameter is not taken into account today.

* Moved whatsnew to 2.0.0

* Moved whatsnew to 2.1

* Improved docstring as per code review

* Renamed asv params as per code review

* Fixed ASV comment as per code review

* ASV: renamed parameters as per code review

* Improved `period_format`: now the performance is the same when no format is provided and if an explicit `fmt` is provided matching the actual default format for this `freq`

* Code review: Improved strftime ASV: set_index is now in the setup

* Removed useless main

* Removed wrong code

* Improved ASVs for period formatting: now there is a "default explicit" everywhere

* Update pandas/_libs/tslibs/period.pyx

* Update pandas/_libs/tslibs/period.pyx

* Update pandas/_libs/tslibs/period.pyx

* Minor refactoring to avoid retesting for none several time

* Fixed issue: bool does not exist, using bint

* Added missing quarter variable as cdef

* Fixed asv bug

* Code review: fixed docstring

* Update doc/source/whatsnew/v2.1.0.rst

* fixup

---------

Co-authored-by: Sylvain MARIE <[email protected]>
Co-authored-by: Marco Edward Gorelli <[email protected]>
Co-authored-by: MarcoGorelli <>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Output-Formatting __repr__ of pandas objects, to_string Performance Memory or execution speed performance Stale
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants