-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
PERF: BlockPlacement.copy() speed-up #10073
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Copying of a `BlockPlacement` is currently slower than it could be: * `copy()` cannot be accessed from python, one currently needs to re-implement `copy()` if one wants to duplicate a `BlockPlacement` instance * `copy()` tries to infer a slice for any array contained in the `BlockPlacement`. * After `copy()` passes an array to the constructor, a sanity check in the form of `np.require(val, dtype=np.int64, requirements='W')` checks that the array is a valid input for a `BlockPlacement`. This is unnecessary since the array originated from a `BlockPlacement` for which this check was already done.
can you run the perf suite and see where this help? https://github.com/pydata/pandas/wiki/Performance-Testing |
@jreback I apologise in advance for this disproportionally long post. I tried running vbench but have two problems: 1. I am on windows and cannot use the script, but even after invoking vbench with what I think is the windows equivalent, I am 2. not using git but `hg-git'. I tried to use git for windows to re-download my fork from github and run vbench on it, but
I am not sure which part of my setup is causing it. I know it is somewhat unlikely, but do you know of another pandas developer using Windows who could help me out? Reason for the PR: allow users to create custom high-performance dataframe constructors: I do not expect any significant speed improvements from this PR in the usual pandas usage (if any at all). Usually Index generation far outweighs
With both of these in place, dataframe instantiation time becomes then comparable to More importantly however, filling the thus allocated dataframe from a columnar database is x3 faster than filling an equivalent numpy structured array! Similarly, column-wise data analysis is faster with pandas dataframes than with numpy structured arrays. I know that for your use case, the instantiation speed of a dataframe is irrelevant, but it would be nice to give people the option to roll their own highly-optimized (and probably fragile ;-) ) dataframe constructors. This PR allows that. If you are concerned about performance regressions, I can restructure the PR such that Funky dataframe constructor using
Timing results comparing pandas dataframe to numpy structured array:
|
@ARF1 I think you are missing the point about incremental development. I certainly care about You may or may not have improvements with various PR's, but they each need to be proved incrementally. If that is not the case then you can simply bundle them and request that all the changes go in at once. However, the main reason for incremental changes is that it is far easier to review and think about the changes; if you are proposing a massive change then it will take quite some time to review (even after it passes all of the tests). That is while incremental changes are much better in a mature project like pandas. |
@ARF1 you mention 'am I concerned about performance regressions' certainly. But you haven't proven the case either way. The only way is to test. maybe @jorisvandenbossche can help you out on windows with vbench. It should work if you have all of the deps installed. I do recall a fair amount of users running vbench correctly on windows. |
sorry, can't really help. I am both using windows and linux, and always did my vbenches on linux. I am not sure if it is supposed to work there. |
not clear if this actually helps/hurts perf at all. |
Copying of a
BlockPlacement
is currently slower than it could/should be:copy()
cannot be accessed from python, one currently needs to re-implementcopy()
if one wants to duplicate aBlockPlacement
instancecopy()
tries to infer a slice for any array contained in theBlockPlacement
.copy()
passes an array to the constructor, a sanity check in the form ofnp.require(val, dtype=np.int64, requirements='W')
checks that the array is a valid input for aBlockPlacement
. This is unnecessary since the array originated from aBlockPlacement
for which this check was already done.