Skip to content

[READY] Improved performance of Period's default formatter (period_format) #51459

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
Merged
Show file tree
Hide file tree
Changes from 3 commits
Commits
Show all changes
53 commits
Select commit Hold shift + click to select a range
2504a90
Improved performance of default period formatting (`period_format`). …
Feb 17, 2023
c1bbfd6
Improved ASV for period frames and datetimes
Feb 17, 2023
3693b30
What's new
Feb 17, 2023
3c4efea
Update asv_bench/benchmarks/strftime.py
smarie Feb 17, 2023
7ccb423
Merge branch 'main' of https://github.com/pandas-dev/pandas into feat…
Feb 17, 2023
e4a58e7
Fixed whats new backticks
Feb 17, 2023
714d518
Merge branch 'feature/44764_perf_issue_new_period' of https://github.…
Feb 17, 2023
04fbf6b
Completed whatsnew
Feb 17, 2023
279114d
Added ASVs for to_csv for period
Feb 18, 2023
35ec182
Aligned the namings
Feb 18, 2023
28d8b5d
Completed Whats new
Feb 18, 2023
a69aeca
Added a docstring explaining why the ASV bench with custom date forma…
Feb 19, 2023
705ff81
Merge branch 'main' of https://github.com/pandas-dev/pandas into feat…
Feb 24, 2023
08d8d5e
Merge branch 'main' of https://github.com/pandas-dev/pandas into feat…
Feb 28, 2023
c1ff5fb
Moved whatsnew to 2.0.0
Feb 28, 2023
5b0acc8
Moved whatsnew to 2.1
Mar 1, 2023
f4cef3a
Merge branch 'main' of https://github.com/pandas-dev/pandas into feat…
Mar 1, 2023
4cefded
Merge branch 'main' of https://github.com/pandas-dev/pandas into feat…
Mar 2, 2023
99b82f9
Merge branch 'main' of https://github.com/pandas-dev/pandas into feat…
Mar 3, 2023
3df0666
Merge branch 'main' into feature/44764_perf_issue_new_period
smarie Mar 3, 2023
576e475
Merge branch 'main' of https://github.com/pandas-dev/pandas into feat…
Mar 6, 2023
d55e47f
Improved docstring as per code review
Mar 6, 2023
316d2b7
Merge remote-tracking branch 'origin/feature/44764_perf_issue_new_per…
Mar 6, 2023
e62867e
Merge branch 'main' into feature/44764_perf_issue_new_period
smarie Mar 7, 2023
2cd568c
Renamed asv params as per code review
Mar 8, 2023
a07d89e
Fixed ASV comment as per code review
Mar 8, 2023
8ad4827
ASV: renamed parameters as per code review
Mar 8, 2023
03d9778
Improved `period_format`: now the performance is the same when no for…
Mar 8, 2023
8d21053
Code review: Improved strftime ASV: set_index is now in the setup
Mar 8, 2023
13acb29
Merge remote-tracking branch 'origin/feature/44764_perf_issue_new_per…
Mar 8, 2023
4fd20ad
Removed useless main
Mar 8, 2023
5d4bfc1
Removed wrong code
Mar 8, 2023
77b8b7a
Improved ASVs for period formatting: now there is a "default explicit…
Mar 8, 2023
24a6b63
Merge branch 'main' of https://github.com/pandas-dev/pandas into feat…
Mar 8, 2023
a309388
Update pandas/_libs/tslibs/period.pyx
smarie Mar 8, 2023
17eedeb
Update pandas/_libs/tslibs/period.pyx
smarie Mar 8, 2023
14b7489
Update pandas/_libs/tslibs/period.pyx
smarie Mar 8, 2023
e1a81ef
Minor refactoring to avoid retesting for none several time
Mar 8, 2023
faf97eb
Merge branch 'feature/44764_perf_issue_new_period' of https://github.…
Mar 8, 2023
f6ced94
Fixed issue: bool does not exist, using bint
Mar 8, 2023
3da1e4b
Added missing quarter variable as cdef
Mar 8, 2023
55d180e
Fixed asv bug
Mar 8, 2023
ea9fc47
Merge branch 'main' of https://github.com/pandas-dev/pandas into feat…
Mar 9, 2023
4becd71
Merge branch 'main' of https://github.com/pandas-dev/pandas into feat…
Mar 10, 2023
74cbabf
Merge branch 'main' into feature/44764_perf_issue_new_period
smarie Mar 13, 2023
b3d5963
Merge branch 'main' of https://github.com/pandas-dev/pandas into feat…
Mar 14, 2023
1301795
Merge remote-tracking branch 'origin/feature/44764_perf_issue_new_per…
Mar 14, 2023
61ce96e
Code review: fixed docstring
Mar 15, 2023
6890776
Merge branch 'main' of https://github.com/pandas-dev/pandas into feat…
Mar 15, 2023
e14c383
Merge branch 'main' into feature/44764_perf_issue_new_period
smarie Mar 16, 2023
ac68ab9
Merge branch 'main' of https://github.com/pandas-dev/pandas into feat…
May 6, 2023
1a05c22
Update doc/source/whatsnew/v2.1.0.rst
MarcoGorelli May 6, 2023
d3c19d0
fixup
May 6, 2023
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
68 changes: 66 additions & 2 deletions asv_bench/benchmarks/strftime.py
Original file line number Diff line number Diff line change
Expand Up @@ -32,10 +32,13 @@ def time_frame_date_formatting_custom(self, obs):
def time_frame_datetime_to_str(self, obs):
self.data["dt"].astype(str)

def time_frame_datetime_formatting_default_date_only(self, obs):
def time_frame_datetime_formatting_default(self, obs):
self.data["dt"].dt.strftime(date_format=None)

def time_frame_datetime_formatting_default_explicit_date_only(self, obs):
self.data["dt"].dt.strftime(date_format="%Y-%m-%d")

def time_frame_datetime_formatting_default(self, obs):
def time_frame_datetime_formatting_default_explicit(self, obs):
self.data["dt"].dt.strftime(date_format="%Y-%m-%d %H:%M:%S")

def time_frame_datetime_formatting_default_with_float(self, obs):
Expand All @@ -45,6 +48,44 @@ def time_frame_datetime_formatting_custom(self, obs):
self.data["dt"].dt.strftime(date_format="%Y-%m-%d --- %H:%M:%S")


class PeriodStrftime:
timeout = 1500
params = ([1000, 10000], ["D", "H"])
param_names = ["obs", "fq"]

def setup(self, obs, fq):
self.data = pd.DataFrame(
{
"p": pd.period_range(start="2000-01-01", periods=obs, freq=fq),
"r": [np.random.uniform()] * obs,
}
)

def time_frame_period_to_str(self, obs, fq):
self.data["p"].astype(str)

def time_frame_period_formatting_default(self, obs, fq):
"""Note that as opposed to datetimes, the default format of periods are
many and depend from the period characteristics, so we have almost no chance
to reach the same level of performance if a 'default' format string is
explicitly provided by the user. See
time_frame_datetime_formatting_default_explicit above."""
self.data["p"].dt.strftime(date_format=None)

def time_frame_period_formatting_index_default(self, obs, fq):
self.data.set_index("p").index.format()

def time_frame_period_formatting_custom(self, obs, fq):
self.data["p"].dt.strftime(date_format="%Y-%m-%d --- %H:%M:%S")

def time_frame_period_formatting_iso8601_strftime_Z(self, obs, fq):
self.data["p"].dt.strftime(date_format="%Y-%m-%dT%H:%M:%SZ")

def time_frame_period_formatting_iso8601_strftime_offset(self, obs, fq):
"""Not optimized yet as %z is not supported by `convert_strftime_format`"""
self.data["p"].dt.strftime(date_format="%Y-%m-%dT%H:%M:%S%z")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why do these not show up in the asv results you posted?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good point, it seems that I forgot them in the copy paste. I'll try to rerun asv partially to get them

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

EDIT: Actually I had removed all asvs that were expected not to show improvements (custom date formats). Here are all ASVs for PeriodStrftime:

       15.4±0.2ms         17.1±2ms    ~1.11  strftime.PeriodStrftime.time_frame_period_formatting_custom(1000, 'D')
       15.0±0.7ms       14.1±0.2ms     0.94  strftime.PeriodStrftime.time_frame_period_formatting_custom(1000, 'H')
          142±6ms          153±5ms     1.08  strftime.PeriodStrftime.time_frame_period_formatting_custom(10000, 'D')
          143±5ms          157±7ms     1.10  strftime.PeriodStrftime.time_frame_period_formatting_custom(10000, 'H')
-     4.37±0.07ms      2.24±0.06ms     0.51  strftime.PeriodStrftime.time_frame_period_formatting_default(1000, 'D')
-     4.72±0.06ms       2.90±0.1ms     0.61  strftime.PeriodStrftime.time_frame_period_formatting_default(1000, 'H')
-        42.8±1ms       24.1±0.7ms     0.56  strftime.PeriodStrftime.time_frame_period_formatting_default(10000, 'D')
-      48.4±0.5ms       24.4±0.2ms     0.50  strftime.PeriodStrftime.time_frame_period_formatting_index_default(10000, 'D')
-        50.9±3ms       27.1±0.7ms     0.53  strftime.PeriodStrftime.time_frame_period_formatting_index_default(10000, 'H')
       7.52±0.7ms       7.22±0.3ms     0.96  strftime.PeriodStrftime.time_frame_period_formatting_iso8601_strftime_Z(1000, 'D')
       7.12±0.1ms       7.00±0.3ms     0.98  strftime.PeriodStrftime.time_frame_period_formatting_iso8601_strftime_Z(1000, 'H')
         75.2±4ms         68.1±1ms    ~0.91  strftime.PeriodStrftime.time_frame_period_formatting_iso8601_strftime_Z(10000, 'D')
         71.9±4ms         77.3±8ms     1.08  strftime.PeriodStrftime.time_frame_period_formatting_iso8601_strftime_Z(10000, 'H')
       13.5±0.5ms       14.0±0.4ms     1.03  strftime.PeriodStrftime.time_frame_period_formatting_iso8601_strftime_offset(1000, 'D')
         15.9±3ms         14.4±1ms    ~0.91  strftime.PeriodStrftime.time_frame_period_formatting_iso8601_strftime_offset(1000, 'H')
          135±3ms          148±7ms     1.10  strftime.PeriodStrftime.time_frame_period_formatting_iso8601_strftime_offset(10000, 'D')
          140±3ms          136±2ms     0.97  strftime.PeriodStrftime.time_frame_period_formatting_iso8601_strftime_offset(10000, 'H')



class BusinessHourStrftime:
timeout = 1500
params = [1000, 10000]
Expand All @@ -62,3 +103,26 @@ def time_frame_offset_str(self, obs):

def time_frame_offset_repr(self, obs):
self.data["off"].apply(repr)


if __name__ == "__main__":
# A __main__ to easily debug this script
for cls in (DatetimeStrftime, PeriodStrftime, BusinessHourStrftime):
all_params = dict()
all_p_values = cls.params
if len(cls.param_names) == 1:
all_p_values = (all_p_values,)
for p_name, p_values in zip(cls.param_names, all_p_values):
all_params[p_name] = p_values

from itertools import product

for case in product(*all_params.values()):
p_dict = {p_name: p_val for p_name, p_val in zip(all_params.keys(), case)}
print(f"{cls.__name__} - {p_dict}")
o = cls()
o.setup(**p_dict)
for m_name, m in cls.__dict__.items():
if callable(m):
print(m_name)
m(o, **p_dict)
12 changes: 12 additions & 0 deletions asv_bench/benchmarks/tslibs/period.py
Original file line number Diff line number Diff line change
Expand Up @@ -70,6 +70,18 @@ def time_now(self, freq):
def time_asfreq(self, freq):
self.per.asfreq("A")

def time_str(self, freq):
str(self.per)

def time_repr(self, freq):
repr(self.per)

def time_strftime_default(self, freq):
self.per.strftime(None)

def time_strftime_custom(self, freq):
self.per.strftime("%b. %d, %Y was a %A")


class PeriodConstructor:
params = [["D"], [True, False]]
Expand Down
9 changes: 9 additions & 0 deletions doc/source/whatsnew/v1.5.4.rst
Original file line number Diff line number Diff line change
Expand Up @@ -22,6 +22,15 @@ Bug fixes
~~~~~~~~~
-

.. ---------------------------------------------------------------------------
.. _whatsnew_154.perf:

Performance improvements
~~~~~~~~~~~~~~~~~~~~~~~~
- :class:`Period`'s default formatter (`period_format`) is now significantly
(~twice) faster. This improves performance of `str(Period)`, `repr(Period)`, and
:meth:`Period.strftime(fmt=None)`.

.. ---------------------------------------------------------------------------
.. _whatsnew_154.other:

Expand Down
113 changes: 78 additions & 35 deletions pandas/_libs/tslibs/period.pyx
Original file line number Diff line number Diff line change
Expand Up @@ -1152,46 +1152,89 @@ cdef int64_t period_ordinal_to_dt64(int64_t ordinal, int freq) except? -1:


cdef str period_format(int64_t value, int freq, object fmt=None):
cdef:
int freq_group

if value == NPY_NAT:
return "NaT"

if isinstance(fmt, str):
# Encode using current locale, in case fmt contains non-utf8 chars
fmt = <bytes>util.string_encode_locale(fmt)

if fmt is None:
freq_group = get_freq_group(freq)
if freq_group == FR_ANN:
fmt = b"%Y"
elif freq_group == FR_QTR:
fmt = b"%FQ%q"
elif freq_group == FR_MTH:
fmt = b"%Y-%m"
elif freq_group == FR_WK:
left = period_asfreq(value, freq, FR_DAY, 0)
right = period_asfreq(value, freq, FR_DAY, 1)
return f"{period_format(left, FR_DAY)}/{period_format(right, FR_DAY)}"
elif freq_group == FR_BUS or freq_group == FR_DAY:
fmt = b"%Y-%m-%d"
elif freq_group == FR_HR:
fmt = b"%Y-%m-%d %H:00"
elif freq_group == FR_MIN:
fmt = b"%Y-%m-%d %H:%M"
elif freq_group == FR_SEC:
fmt = b"%Y-%m-%d %H:%M:%S"
elif freq_group == FR_MS:
fmt = b"%Y-%m-%d %H:%M:%S.%l"
elif freq_group == FR_US:
fmt = b"%Y-%m-%d %H:%M:%S.%u"
elif freq_group == FR_NS:
fmt = b"%Y-%m-%d %H:%M:%S.%n"
else:
raise ValueError(f"Unknown freq: {freq}")
return _period_default_format(value, freq)
else:
if isinstance(fmt, str):
# Encode using current locale, in case fmt contains non-utf8 chars
fmt = <bytes>util.string_encode_locale(fmt)

return _period_strftime(value, freq, fmt)

return _period_strftime(value, freq, fmt)

cdef str _period_default_format(int64_t value, int freq):
"""A faster default formatting function leveraging string formatting."""

cdef:
int freq_group, quarter
npy_datetimestruct dts

# fill dts
get_date_info(value, freq, &dts)

# get the appropriate format depending on frequency group
freq_group = get_freq_group(freq)
if freq_group == FR_ANN:
# fmt = b"%Y"
return f"{dts.year}"

elif freq_group == FR_QTR:
# fmt = b"%FQ%q"
# get quarter and modify dts.year to be the fiscal year (?)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the changes themselves look good, I'm just confused by this (?). are you asking for confirmation?

Copy link
Contributor Author

@smarie smarie Mar 13, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes @MarcoGorelli , "fiscal year" is the wording to confirm in this comment, I wasn't sure that this was the right definition.

get_yq docstring is not very explicit about this, apart from stating Sets dts.year in-place.
Maybe this docstring could be improved according to the actual answer.

cdef int get_yq(int64_t ordinal, int freq, npy_datetimestruct* dts):

quarter = get_yq(value, freq, &dts)
return f"{dts.year}Q{quarter}"

elif freq_group == FR_MTH:
# fmt = b"%Y-%m"
return f"{dts.year}-{dts.month:02d}"

elif freq_group == FR_WK:
# special: start_date/end_date. Recurse
left = period_asfreq(value, freq, FR_DAY, 0)
right = period_asfreq(value, freq, FR_DAY, 1)
return f"{period_format(left, FR_DAY)}/{period_format(right, FR_DAY)}"

elif freq_group == FR_BUS or freq_group == FR_DAY:
# fmt = b"%Y-%m-%d"
return f"{dts.year}-{dts.month:02d}-{dts.day:02d}"

elif freq_group == FR_HR:
# fmt = b"%Y-%m-%d %H:00"
return f"{dts.year}-{dts.month:02d}-{dts.day:02d} {dts.hour:02d}:00"

elif freq_group == FR_MIN:
# fmt = b"%Y-%m-%d %H:%M"
return (f"{dts.year}-{dts.month:02d}-{dts.day:02d} "
f"{dts.hour:02d}:{dts.min:02d}")

elif freq_group == FR_SEC:
# fmt = b"%Y-%m-%d %H:%M:%S"
return (f"{dts.year}-{dts.month:02d}-{dts.day:02d} "
f"{dts.hour:02d}:{dts.min:02d}:{dts.sec:02d}")

elif freq_group == FR_MS:
# fmt = b"%Y-%m-%d %H:%M:%S.%l"
return (f"{dts.year}-{dts.month:02d}-{dts.day:02d} "
f"{dts.hour:02d}:{dts.min:02d}:{dts.sec:02d}"
f".{(dts.us // 1_000):03d}")

elif freq_group == FR_US:
# fmt = b"%Y-%m-%d %H:%M:%S.%u"
return (f"{dts.year}-{dts.month:02d}-{dts.day:02d} "
f"{dts.hour:02d}:{dts.min:02d}:{dts.sec:02d}"
f".{(dts.us):06d}")

elif freq_group == FR_NS:
# fmt = b"%Y-%m-%d %H:%M:%S.%n"
return (f"{dts.year}-{dts.month:02d}-{dts.day:02d} "
f"{dts.hour:02d}:{dts.min:02d}:{dts.sec:02d}"
f".{((dts.us * 1000) + (dts.ps // 1000)):09d}")

else:
raise ValueError(f"Unknown freq: {freq}")


cdef list extra_fmts = [(b"%q", b"^`AB`^"),
Expand Down