-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
Performance issue with pandas/core/common.py -> maybe_box_datetimelike #30520
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Can you post some profiling results for this? |
The following is the profiler output of the execution of my program with the current (non-optimized) version of the
Next is the profiler output of exactly the same code, but with the optimized (according to the proposed fix)
If you wish I can provide a specific code, which can reproduce the issue. Nevertheless, even if you don't see this as a performance issue, I think it still makes sense to fix it, as it is not necessary to convert |
That's a pretty compelling performance difference, thanks. PR would be welcome. |
pls show a specific reproducible example that exhibits the perf issue - it will be needed for asvs in any event |
Is the performance issue here really with From some quick adhoc timings it looks like >70% of the time spent in In [1]: import pandas as pd; pd.__version__
Out[1]: '0.26.0.dev0+1469.ge817ffff3'
In [2]: ts = pd.Timestamp('2020')
In [3]: %timeit pd.Timestamp(ts)
856 ns ± 12.5 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
In [4]: %timeit pd.core.common.maybe_box_datetimelike(ts)
1.18 µs ± 11.3 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each) Which looks like can be improved by a simple In [5]: def maybe_coerce_ts(ts):
...: if not isinstance(ts, pd.Timestamp):
...: return pd.Timestamp(ts)
...: return ts
...:
In [6]: %timeit maybe_coerce_ts(ts)
118 ns ± 0.631 ns per loop (mean ± std. dev. of 7 runs, 10000000 loops each) So, to me, it looks like the fix should go in the |
This seems reasonable. We would have to check that there aren't any other arguments passed, not sure what kind of hit that would entail. Might pick up some ground by not needing to create a new object. |
Ah, yeah, that makes it a little more convoluted. I'm fine with making both suggested changes, to the constructor and |
I'm currently trying to write the ASV performance test for this pull request. If I understand correctly, the test should be located under |
Depends a little bit on what you're profiling. If youre profiling only the Timestamp constructor, then it goes in benchmarks/tslibs/timestamp.py::TimestampConstruction. If you're profiling maybe_box_datetimelike, maybe inference.py? You've got some discretion in this case. |
this is subsumed by #30676 (though if you still think there is a perf benefit after that certainly ping) |
Code Sample, a copy-pastable example if possible
Problem description
This function determines whether
value
is of type(np.datetime64, datetime)
and if so, converts it intotslibs.Timestamp
. However, the classtslibs.Timestamp
is already a subclass ofdatetime
. Therefore, even if the objectvalue
is already of typetslibs.Timestamp
, it will be needlessly converted one more time. This issue has large performance, when working with large dataframes, which contain datet time objects. This issue could be fixed by changing the conditionif isinstance(value, (np.datetime64, datetime)):
to:
if isinstance(value, (np.datetime64, datetime)) and not isinstance(value, tslibs.Timestamp):
The text was updated successfully, but these errors were encountered: