-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
Improve DataFrame.select_dtypes scaling to wide data frames #28317
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
For up to 10k columns I saw the same behavior as the one you described. for 10k columns it took me 3 seconds. For 100k columns it takes me 160 seconds (instead of the expected roughly 30 seconds). Profile
|
So I was reading some portions of the code (starting from the top) and started thinking about this. Firstly, I don't think O(log n) is anyhow the right complexity :). We need to determine the dtype of every column. Hence our complexity O(n). Complexity wise, the optimum we could achieve would be O(1), if we had a dictionary, that contains a mapping of dtypes to columns. This however, would essentially require something like static typing or at least keeping track of type changes of columns after operations. I'm assuming we are not going to do this. So complexity wise it's still O(n). But we can bring the constants down. However, I'm not having real cython experience here. Could somebody maybe provide some guidelines on how to tackle this issue? A different approach: I'm assuming that to infer the dtype, a whole array is analyzed. One could maybe add an option |
Thanks, I think you're right about the complexity stuff. Sorry if I led anyone astray there. I don't understand your comment about inference though. What exactly are we inferring? We shouldn't be passing the values of a Series / DataFrame to infer_dtype, as we already have the dtypes. |
Maybe I don't know pandas internals good enough then. ;) (I was too much thinking in the direction of python not having static typing, but it doesn't make too much sense with e.g. numpy I have to admit ;)) The profile shows, that we are wasting most of our time in infer_dtype. Why are we doing that, if we know the dtypes? I mean, if we have the dtypes, e.g. in a list, it should just be in close to no time to get all the dtypes out. I think I'll try to investigate the codepath to see, where and why infer_dtype is called. |
Thanks. Glancing at the implementation, we do We may also call it in side the FYI @datajanko if you're looking into this I would recommend line_profiler.
gives
|
Thanks for the hint. I'll have a look into this. |
From your example, we see that the include_these blocks (probably exclude_these as well) take the longest. The starmap iteration over each column is inefficient. Actually, we only need to do this per dtype in self.dtypes. So we would have at most something like 30 hits there. I'll work on the issue asap |
Okay, for small data, this can be easily improved:
changes to
Note that the last line looks awful, and the second line looks nice. What did I do:
So obviously, rewriting the values of the dict and appending one item to a list does not scale well, here. A different approach I'll try next is to just On a slightly different note: I'm not able to install |
Hmm I'm not sure. Are you pip or conda installing it? It does have a C extension, not sure if they have a wheel. A WIP PR is just fine. Make sure to include a new ASV with a wide-ish DataFrame. |
I tried both ways to install it, without success. I'll attach an asv probably tomorrow. |
Running select_dtypes for a variety of lengths.
This looks O(n) in the number of columns. I think that can be improved (to whatever set intersection is)
Edit: maybe it's O(log(n)), I never took CS :)
The text was updated successfully, but these errors were encountered: