-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
POC: ArrayManager -- array-based data manager for columnar store #36010
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
POC: ArrayManager -- array-based data manager for columnar store #36010
Conversation
Whats DataManager? Edit: nevermind, clear now that I look at the code. |
pandas/core/internals/managers.py
Outdated
do_integrity_check: bool = True, | ||
): | ||
self._axes = axes | ||
self.arrays = arrays |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i would have expected youd need to be backed by blocks to get e.g. replace
to work. is there a nice way around that?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It was my explicit goal of the current experiment to not use Blocks, because it gives another level of indirection / overhead. But for sure, there are currently certain algos like replace
that are defined on the Blocks (see also the inline comment at def replace
). So short term we can wrap the arrays in Blocks just for those operations when needed (to be clear, this would only be a hack to get a POC more fully working), and eventually I would abstract some of those algos into separate array-based algos (like we have for many others already which are not defined in the Blocks itself), and then those can be used by both the Block.replace as ArrayManager.replace.
So we will need to see a bit how many of those things from Blocks need reuse, but if it turns out possible (eg is limited to a set of algos that can be factored out), I would prefer to drop the Block concept in a potential ArrayManager.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
eventually I would abstract some of those algos into separate array-based algos
+1 for this idea
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Poking at this, I think the hard part would be implementing can_hold_element as an array function.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Poking at this, I think the hard part would be implementing can_hold_element as an array function.
For numpy arrays, I think we could write a single helper function that works for all dtypes? And for ExtensionArrays, we could maybe add a similar method on the EA itself to the interface.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think you're right on both points: a helper for ndarrays and a method for EAs would both be good.
Spent some time on this yesterday and found other Block methods are necessary (putmask comes to mind), to the point where keeping them as methods feels more intuitive than standalone functions. It might be easiest to use the existing Block code for these methods, but delay wrapping the arrays in Blocks until they are needed (at least for the POC stage)
Overall, having a base class, BlockManager, and ArrayManager makes sense for easy prototyping / switching. That also lets us clearly define the interface between pandas' internals and the rest of pandas. For ease of review, can you split |
The diff should be clean right now. It's added in manager.py itself, but it's purely an addition of lines, so the diff should look the same as if it was a new file. |
For the purpose of seeing how close this is to working, would it make sense to use a pd.options flag to control ArrayManager vs BlockManager instead of a DataFrame keyword? This would make it straightforward to run all the tests using ArrayManager with few edits. |
Yes, that's a good idea (I now added a keyword to the DataFrame constructor, and for testing purposes switched the default. But indeed with an option it is easier to switch) |
@jorisvandenbossche im curious how this performs for the snippet discussed in #34683
|
It's a bit faster: In [5]: pd.options.mode.data_manager = 'block'
In [6]: ... # code to create df1 and df2
In [7]: %timeit pd.concat([df1, df2])
465 µs ± 29.5 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
In [8]: pd.options.mode.data_manager = 'array'
In [9]: ... # code to create df1 and df2
In [10]: %timeit pd.concat([df1, df2])
298 µs ± 6.95 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each) But: (it's also not directly related to "consolidation in reshape or not", because after the first iteration of the ``%timeit |
a7880e9
to
896080a
Compare
Hello @jorisvandenbossche! Thanks for updating this PR. We checked the lines you've touched for PEP 8 issues, and found: There are currently no PEP 8 issues detected in this Pull Request. Cheers! 🍻 Comment last updated at 2020-12-12 19:03:25 UTC |
pandas/core/internals/managers.py
Outdated
# mgr_shape = self.shape | ||
# tot_items = sum(len(x.mgr_locs) for x in self.blocks) | ||
# for block in self.blocks: | ||
# if block._verify_integrity and block.shape[1:] != mgr_shape[1:]: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i think this can be replaced with just checking that all of the array lengths match
@TomAugspurger can you repeat again what you exactly prefer regarding the diff (missed that point a bit on the call) ? |
No worries about the diff here, I thinks it’s fine.
… On Sep 4, 2020, at 15:47, Joris Van den Bossche ***@***.***> wrote:
@TomAugspurger can you repeat again what you exactly prefer regarding the diff (missed that point a bit on the call) ?
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub, or unsubscribe.
|
Some updates:
Big chunks of failing tests relate to: tests with extension dtypes (since I am only handling np.ndarrays for now), some other functionality relying on the blocks (eg HDF IO), some not-yet-implemented functionality (quantile, unstack). |
When do you want to handle (/ only use) ExtensionArrays? Would that be done before merging? |
Yeah, I was also wondering that today. Given that we want that eventually, it might make sense to just do the switch now? |
My preference would be for all extension array-backed from the start. What
are the extra sources of overhead?
1. Constructing ~Pandas~NumPyArray instances
2. Extra attribute access
3. ... anything else?
…On Sat, Sep 5, 2020 at 9:43 AM Joris Van den Bossche < ***@***.***> wrote:
Yeah, I was also wondering that today. Given that we want that eventually,
it might make sense to just do the switch now?
It will make comparing performance a bit harder, but getting things
working correctly first is probably more important, can optimize later.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#36010 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAKAOIVCOR5YLTEELBOI4YDSEJFBNANCNFSM4QQSHDPA>
.
|
|
||
if ignore_index: | ||
axis = 1 if isinstance(self, ABCDataFrame) else 0 | ||
new_data.axes[axis] = ibase.default_index(len(indexer)) | ||
new_data.set_axis(axis, ibase.default_index(len(indexer))) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you comment on why these need to be changed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In hindsight, it's probably not needed anymore. Originally, I changed the axes
attribute on ArrayManager to be (index, columns) order. But afterwards, I changed this to be stored in _axes
, and having an axes
property that switches the internal order, to match the "public" interface of BlockManager.
So with that change, I could revert those edits (I think, but can check that)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just checked this again, and so this change is still needed (without it I get failures for the array manager).
The reason is that mgr.axes
is no longer the actual "storage" of the axes for the ArrayManager, but is there stored in _axes
. And because mgr.axes
returns a list, it's actually updating the list in place, and not the actual storage. Which is a bit of a gotcha now with the manager's axes
attribute.
But based on a quick check, this seems to be the only place where we assign to mgr.axes[i] = ..
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks for explaining. we should make another attempt to get axes out of FooManager before too long
@jreback I tried to update according to your comments regarding the conversion of different types of managers: moved the bulk of the current implementation from frame.py to internals/construction.py, and simplified the code in frame.py. If can have another look (see second to last commit) |
thanks @jorisvandenbossche looks pretty good.
otherwise rebase and looks ok to merge. cc @jbrockmendel |
@@ -19,7 +19,8 @@ def test_to_numpy_dtype(self): | |||
|
|||
def test_to_numpy_copy(self): | |||
arr = np.random.randn(4, 3) | |||
df = DataFrame(arr) | |||
with option_context("mode.data_manager", "block"): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
would @td.skip_array_manager_invalid_test make more sense here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
would @td.skip_array_manager_invalid_test make more sense here?
Indeed, will update (I think this was from before I added the decorators)
@@ -50,6 +52,21 @@ def concatenate_block_managers( | |||
------- | |||
BlockManager | |||
""" | |||
if isinstance(mgrs_indexers[0][0], ArrayManager): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
are we assuming that they are either all-ArrayManager or all-BlockManager?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
are we assuming that they are either all-ArrayManager or all-BlockManager?
Yes, and this concat implementation right now is very limited in general (eg only the simple case without any reindexing needed, several tests are skipped because of this, I put several TODO(ArrayManager) concat with reindexing
because of this).
Concatenation is one of the big areas of work for follow-up on this initial PR.
A handful of small comments, generally looks nice. Over the weekend I'll see if I can chip away at the disabled tests |
running |
Yes, json tests segfault (because the C code is expecting a BlockManager). I skipped them when I started this PR, but the one you reference was added more recently. Will add a skip for those as well (for CI, I am currently only running a subset of |
So regarding using
I think the only potentially impacted code path (for normal use, so without enabling ArrayManager) right now is the DataFrame construction. The inline comment is a bit hidden, but see #36010 (comment) for some timings related to that. Summary is that the checking of the option only has a small impact (ca 2 µs), while the constructor itself already takes relatively more time (eg a |
I think I addressed all remaining comments, so I am planning to merge this (it's getting a bit annoying to keep rebasing this, and I also don't plan to do substantial new feature work in this PR, there is plenty for follow-ups). I will make an overview of the different areas of work for follow-ups. |
Thank all for the reviews! I created an overview issue to track the different follow-ups here: #39146 |
Related to the discussion in #10556, and following up on the mailing list discussion "A case for a simplified (non-consolidating) BlockManager with 1D blocks" (archive).
This branch experiments with an "array manager", storing a list of 1D arrays instead of blocks.
The idea is that this
ArrayManager
could optionally be used instead ofBlockManager
. If we ensure the "DataManager" has a clear interface for the rest of pandas (and thus parts outside of the internals don't rely on details like block layout, xref #34669), this should be possible without much changes outside of /core/internals.Some notes on this experiment:
This is not a complete POC, not every aspect and behaviour of the BlockManager has already been replicated, and there are still places in pandas that rely on the blocks being present, so lots of tests are still failing (although changes in behaviour are also desired). That said, a lot of the basic operations do work. Two illustrations of this:
For now, I focused on an ArrayManager storing a list of numpy arrays. Of course we need to expand that to support ExtensionArrays as well (or ExtensionArrays only?), but the reason I limited to numpy arrays for now: besides making it a bit simpler to experiment with, this also gives a fairer comparison with the consolidated BlockManager (because it focuses on the numpy array being 1D vs 2D, and doesn't mix in performance/implementation differences of numpy array vs ExtensionArray).
Personally, I think this looks promising. Many of the methods are a lot simpler than the BlockManager equivalent (although not every aspect is implemented yet, that's correct). And for the case I showed in the notebook, performance looks also good. For the benchmark suite I ran, there are obviously slowdowns for the "wide dataframe" benchmarks.
There is still a lot of work needed to make this fully working with the rest of pandas, though ;)
Given the early proof of concept stage, detailed code feedback is not yet needed, but I would find it very useful to discuss the following aspects:
High-level feedback on the approach: does the approach of the two subclasses look interesting? The approach of the ArrayManager itself storing a list of arrays? ...
What to do with Series, which now is a SingleBlockManager inheriting from BlockManager (should we also have a "SingleArrayManager"?)
If we find this interesting, how can we go from here? How do we decide on this? (what aspects already need to work, how fast does it need to be?) I don't think getting a fully complete implementation passing all tests is is possible in a single PR. Are we fine with merging something partial in master and continue from there? Or a shared feature branch in upstream? ...
Benchmark results for asv_bench/arithmetic.py
As an example, I ran
asv continuous -f 1.1 upstream/master HEAD -b arithmetic
.The benchmarks with a slowdown bigger than a factor 2 can basically be brought back to two cases:
FrameWithFrameWide
using a case with n_cols > n_rows)IntFrameWithScalar
class: from a quick profile, it seems that the usage of numexpr is the cause, and disabling this seems to reduce the slowdown to a factor 2. The numexpr code (and checking if it should be used etc) apparently has a high overhead per call, which I assume is something that can be solved (moving those checks a level higher up, so we don't need to repeat it for each column)