POC: ArrayManager -- array-based data manager for columnar store #36010

jorisvandenbossche · 2020-08-31T14:17:54Z

Related to the discussion in #10556, and following up on the mailing list discussion "A case for a simplified (non-consolidating) BlockManager with 1D blocks" (archive).

This branch experiments with an "array manager", storing a list of 1D arrays instead of blocks.
The idea is that this ArrayManager could optionally be used instead of BlockManager. If we ensure the "DataManager" has a clear interface for the rest of pandas (and thus parts outside of the internals don't rely on details like block layout, xref #34669), this should be possible without much changes outside of /core/internals.

Some notes on this experiment:

This is not a complete POC, not every aspect and behaviour of the BlockManager has already been replicated, and there are still places in pandas that rely on the blocks being present, so lots of tests are still failing (although changes in behaviour are also desired). That said, a lot of the basic operations do work. Two illustrations of this:
- An updated version of the notebook I showed in the mailing list discussion as well: with a certain setup, comparing a set of operations between block vs array manager: https://nbviewer.jupyter.org/gist/jorisvandenbossche/f917d4301d21069e2be2e3b7c7aa4d07
- I ran the arithmetic.py benchmark file, comparing against master, see below for the results.
For now, I focused on an ArrayManager storing a list of numpy arrays. Of course we need to expand that to support ExtensionArrays as well (or ExtensionArrays only?), but the reason I limited to numpy arrays for now: besides making it a bit simpler to experiment with, this also gives a fairer comparison with the consolidated BlockManager (because it focuses on the numpy array being 1D vs 2D, and doesn't mix in performance/implementation differences of numpy array vs ExtensionArray).
Personally, I think this looks promising. Many of the methods are a lot simpler than the BlockManager equivalent (although not every aspect is implemented yet, that's correct). And for the case I showed in the notebook, performance looks also good. For the benchmark suite I ran, there are obviously slowdowns for the "wide dataframe" benchmarks.
There is still a lot of work needed to make this fully working with the rest of pandas, though ;)
Given the early proof of concept stage, detailed code feedback is not yet needed, but I would find it very useful to discuss the following aspects:
- High-level feedback on the approach: does the approach of the two subclasses look interesting? The approach of the ArrayManager itself storing a list of arrays? ...
- What to do with Series, which now is a SingleBlockManager inheriting from BlockManager (should we also have a "SingleArrayManager"?)
- If we find this interesting, how can we go from here? How do we decide on this? (what aspects already need to work, how fast does it need to be?) I don't think getting a fully complete implementation passing all tests is is possible in a single PR. Are we fine with merging something partial in master and continue from there? Or a shared feature branch in upstream? ...

Benchmark results for asv_bench/arithmetic.py

As an example, I ran asv continuous -f 1.1 upstream/master HEAD -b arithmetic.

The benchmarks with a slowdown bigger than a factor 2 can basically be brought back to two cases:

Benchmarks for "wide" dataframes (eg FrameWithFrameWide using a case with n_cols > n_rows)
Benchmarks from the IntFrameWithScalar class: from a quick profile, it seems that the usage of numexpr is the cause, and disabling this seems to reduce the slowdown to a factor 2. The numexpr code (and checking if it should be used etc) apparently has a high overhead per call, which I assume is something that can be solved (moving those checks a level higher up, so we don't need to repeat it for each column)

       before           after         ratio
     [b45327f5]       [047f9091]
     <master>                   
!        40.6±6ms           failed      n/a  arithmetic.Ops.time_frame_multi_and(False, 'default')
!        32.7±2ms           failed      n/a  arithmetic.Ops.time_frame_multi_and(False, 1)
!        26.5±1ms           failed      n/a  arithmetic.Ops.time_frame_multi_and(True, 'default')
!        37.7±2ms           failed      n/a  arithmetic.Ops.time_frame_multi_and(True, 1)
+      1.06±0.3ms         93.5±7ms    88.57  arithmetic.FrameWithFrameWide.time_op_same_blocks(<built-in function gt>)
+      1.51±0.2ms         80.6±3ms    53.34  arithmetic.FrameWithFrameWide.time_op_same_blocks(<built-in function add>)
+     1.22±0.08ms         55.1±5ms    45.19  arithmetic.MixedFrameWithSeriesAxis.time_frame_op_with_series_axis0('le')
+     1.30±0.07ms        55.6±20ms    42.83  arithmetic.MixedFrameWithSeriesAxis.time_frame_op_with_series_axis0('ne')
+      2.12±0.4ms         90.1±4ms    42.47  arithmetic.FrameWithFrameWide.time_op_different_blocks(<built-in function gt>)
+     1.17±0.04ms         49.4±4ms    42.38  arithmetic.MixedFrameWithSeriesAxis.time_frame_op_with_series_axis0('gt')
+     1.28±0.07ms         52.9±3ms    41.28  arithmetic.MixedFrameWithSeriesAxis.time_frame_op_with_series_axis0('lt')
+      1.29±0.2ms       52.5±0.6ms    40.63  arithmetic.MixedFrameWithSeriesAxis.time_frame_op_with_series_axis0('ge')
+     1.44±0.02ms         56.8±7ms    39.56  arithmetic.MixedFrameWithSeriesAxis.time_frame_op_with_series_axis0('eq')
+      2.08±0.3ms        78.9±10ms    37.90  arithmetic.Ops2.time_frame_float_mod
+      2.34±0.1ms         78.3±4ms    33.51  arithmetic.FrameWithFrameWide.time_op_different_blocks(<built-in function add>)
+      1.66±0.2ms         46.6±1ms    28.00  arithmetic.MixedFrameWithSeriesAxis.time_frame_op_with_series_axis0('mul')
+      1.78±0.2ms         48.2±5ms    27.02  arithmetic.MixedFrameWithSeriesAxis.time_frame_op_with_series_axis0('truediv')
+     1.14±0.04ms         26.8±4ms    23.49  arithmetic.IntFrameWithScalar.time_frame_op_with_scalar(<class 'numpy.int64'>, 3.0, <built-in function le>)
+      1.83±0.2ms         42.9±1ms    23.39  arithmetic.MixedFrameWithSeriesAxis.time_frame_op_with_series_axis0('add')
+      1.94±0.3ms         45.1±4ms    23.29  arithmetic.MixedFrameWithSeriesAxis.time_frame_op_with_series_axis0('sub')
+     1.23±0.07ms         23.0±3ms    18.65  arithmetic.IntFrameWithScalar.time_frame_op_with_scalar(<class 'numpy.int64'>, 3.0, <built-in function ge>)
+     1.33±0.08ms         22.8±1ms    17.14  arithmetic.IntFrameWithScalar.time_frame_op_with_scalar(<class 'numpy.float64'>, 3.0, <built-in function eq>)
+     1.03±0.05ms         17.6±2ms    17.13  arithmetic.IntFrameWithScalar.time_frame_op_with_scalar(<class 'numpy.int64'>, 2, <built-in function ge>)
+      1.65±0.5ms         28.1±7ms    17.00  arithmetic.IntFrameWithScalar.time_frame_op_with_scalar(<class 'numpy.int64'>, 5.0, <built-in function eq>)
+     1.21±0.05ms         20.1±3ms    16.67  arithmetic.IntFrameWithScalar.time_frame_op_with_scalar(<class 'numpy.float64'>, 2, <built-in function gt>)
+     1.18±0.03ms       19.4±0.9ms    16.54  arithmetic.IntFrameWithScalar.time_frame_op_with_scalar(<class 'numpy.int64'>, 4, <built-in function eq>)
+     1.08±0.07ms         17.8±1ms    16.53  arithmetic.IntFrameWithScalar.time_frame_op_with_scalar(<class 'numpy.int64'>, 2, <built-in function lt>)
+     1.22±0.05ms         20.0±2ms    16.41  arithmetic.IntFrameWithScalar.time_frame_op_with_scalar(<class 'numpy.int64'>, 3.0, <built-in function gt>)
+     1.30±0.06ms         21.2±3ms    16.28  arithmetic.IntFrameWithScalar.time_frame_op_with_scalar(<class 'numpy.float64'>, 3.0, <built-in function ne>)
+     1.15±0.06ms         18.6±3ms    16.18  arithmetic.IntFrameWithScalar.time_frame_op_with_scalar(<class 'numpy.int64'>, 4, <built-in function lt>)
+      1.42±0.1ms         22.6±1ms    15.96  arithmetic.IntFrameWithScalar.time_frame_op_with_scalar(<class 'numpy.int64'>, 3.0, <built-in function lt>)
+     1.11±0.01ms       17.6±0.4ms    15.85  arithmetic.IntFrameWithScalar.time_frame_op_with_scalar(<class 'numpy.int64'>, 4, <built-in function ne>)
+      5.30±0.8ms        81.7±20ms    15.40  arithmetic.MixedFrameWithSeriesAxis.time_frame_op_with_series_axis1('lt')
+      1.37±0.2ms         20.7±3ms    15.09  arithmetic.IntFrameWithScalar.time_frame_op_with_scalar(<class 'numpy.float64'>, 3.0, <built-in function gt>)
+     1.22±0.05ms         18.0±6ms    14.72  arithmetic.IntFrameWithScalar.time_frame_op_with_scalar(<class 'numpy.float64'>, 3.0, <built-in function ge>)
+      1.28±0.1ms         18.6±3ms    14.55  arithmetic.IntFrameWithScalar.time_frame_op_with_scalar(<class 'numpy.int64'>, 4, <built-in function gt>)
+     1.17±0.08ms         17.0±3ms    14.54  arithmetic.IntFrameWithScalar.time_frame_op_with_scalar(<class 'numpy.int64'>, 2, <built-in function gt>)
+      1.22±0.1ms       17.6±0.8ms    14.44  arithmetic.IntFrameWithScalar.time_frame_op_with_scalar(<class 'numpy.float64'>, 4, <built-in function eq>)
+      1.35±0.1ms         19.4±2ms    14.35  arithmetic.IntFrameWithScalar.time_frame_op_with_scalar(<class 'numpy.float64'>, 3.0, <built-in function le>)
+      1.35±0.1ms         19.2±4ms    14.21  arithmetic.IntFrameWithScalar.time_frame_op_with_scalar(<class 'numpy.int64'>, 4, <built-in function ge>)
+      4.36±0.3ms         61.8±8ms    14.17  arithmetic.MixedFrameWithSeriesAxis.time_frame_op_with_series_axis1('le')
+      1.31±0.1ms         18.5±2ms    14.09  arithmetic.IntFrameWithScalar.time_frame_op_with_scalar(<class 'numpy.float64'>, 3.0, <built-in function lt>)
+      4.48±0.5ms         62.9±5ms    14.06  arithmetic.MixedFrameWithSeriesAxis.time_frame_op_with_series_axis1('ge')
+      1.15±0.1ms         16.1±1ms    14.01  arithmetic.IntFrameWithScalar.time_frame_op_with_scalar(<class 'numpy.int64'>, 2, <built-in function ne>)
+      1.33±0.1ms         18.6±2ms    14.00  arithmetic.IntFrameWithScalar.time_frame_op_with_scalar(<class 'numpy.float64'>, 5.0, <built-in function ge>)
+      4.37±0.4ms         58.9±2ms    13.48  arithmetic.MixedFrameWithSeriesAxis.time_frame_op_with_series_axis1('ne')
+      1.22±0.2ms         16.2±3ms    13.25  arithmetic.IntFrameWithScalar.time_frame_op_with_scalar(<class 'numpy.float64'>, 4, <built-in function le>)
+      1.25±0.1ms         16.5±1ms    13.13  arithmetic.IntFrameWithScalar.time_frame_op_with_scalar(<class 'numpy.int64'>, 2, <built-in function le>)
+      1.44±0.2ms         18.6±4ms    12.90  arithmetic.IntFrameWithScalar.time_frame_op_with_scalar(<class 'numpy.int64'>, 5.0, <built-in function ge>)
+      1.75±0.3ms         22.3±2ms    12.74  arithmetic.IntFrameWithScalar.time_frame_op_with_scalar(<class 'numpy.int64'>, 3.0, <built-in function eq>)
+      1.42±0.3ms         18.0±7ms    12.68  arithmetic.IntFrameWithScalar.time_frame_op_with_scalar(<class 'numpy.float64'>, 5.0, <built-in function gt>)
+      1.36±0.1ms         17.2±1ms    12.67  arithmetic.IntFrameWithScalar.time_frame_op_with_scalar(<class 'numpy.int64'>, 3.0, <built-in function ne>)
+        440±30μs       5.57±0.1ms    12.65  arithmetic.Ops2.time_frame_series_dot
+      1.63±0.2ms         20.6±2ms    12.65  arithmetic.IntFrameWithScalar.time_frame_op_with_scalar(<class 'numpy.int64'>, 5.0, <built-in function lt>)
+     1.35±0.07ms         17.0±3ms    12.58  arithmetic.IntFrameWithScalar.time_frame_op_with_scalar(<class 'numpy.int64'>, 4, <built-in function le>)
+      1.34±0.2ms         16.7±1ms    12.46  arithmetic.IntFrameWithScalar.time_frame_op_with_scalar(<class 'numpy.int64'>, 2, <built-in function eq>)
+      1.50±0.1ms         18.6±5ms    12.43  arithmetic.IntFrameWithScalar.time_frame_op_with_scalar(<class 'numpy.float64'>, 2, <built-in function ge>)
+     1.35±0.07ms         16.8±1ms    12.42  arithmetic.IntFrameWithScalar.time_frame_op_with_scalar(<class 'numpy.float64'>, 4, <built-in function ge>)
+      1.35±0.1ms         16.7±2ms    12.37  arithmetic.IntFrameWithScalar.time_frame_op_with_scalar(<class 'numpy.float64'>, 5.0, <built-in function le>)
+      1.55±0.3ms         18.9±2ms    12.20  arithmetic.IntFrameWithScalar.time_frame_op_with_scalar(<class 'numpy.int64'>, 5.0, <built-in function le>)
+      1.67±0.3ms         20.3±5ms    12.17  arithmetic.IntFrameWithScalar.time_frame_op_with_scalar(<class 'numpy.int64'>, 5.0, <built-in function ne>)
+      1.55±0.2ms       18.5±0.7ms    11.94  arithmetic.IntFrameWithScalar.time_frame_op_with_scalar(<class 'numpy.float64'>, 2, <built-in function le>)
+      5.05±0.5ms         59.1±3ms    11.70  arithmetic.MixedFrameWithSeriesAxis.time_frame_op_with_series_axis1('gt')
+      1.51±0.2ms         17.6±2ms    11.66  arithmetic.IntFrameWithScalar.time_frame_op_with_scalar(<class 'numpy.float64'>, 5.0, <built-in function lt>)
+     1.33±0.08ms         15.3±1ms    11.50  arithmetic.IntFrameWithScalar.time_frame_op_with_scalar(<class 'numpy.float64'>, 2, <built-in function ne>)
+      4.47±0.1ms         51.2±1ms    11.45  arithmetic.MixedFrameWithSeriesAxis.time_frame_op_with_series_axis1('eq')
+      1.35±0.1ms         15.4±2ms    11.45  arithmetic.IntFrameWithScalar.time_frame_op_with_scalar(<class 'numpy.float64'>, 4, <built-in function lt>)
+      1.76±0.5ms         19.8±2ms    11.28  arithmetic.IntFrameWithScalar.time_frame_op_with_scalar(<class 'numpy.float64'>, 2, <built-in function lt>)
+     1.55±0.09ms       16.8±0.3ms    10.86  arithmetic.IntFrameWithScalar.time_frame_op_with_scalar(<class 'numpy.float64'>, 5.0, <built-in function ne>)
+      1.71±0.1ms         18.2±2ms    10.58  arithmetic.IntFrameWithScalar.time_frame_op_with_scalar(<class 'numpy.float64'>, 5.0, <built-in function eq>)
+      1.51±0.2ms         15.9±3ms    10.54  arithmetic.IntFrameWithScalar.time_frame_op_with_scalar(<class 'numpy.float64'>, 2, <built-in function eq>)
+      1.53±0.2ms       15.6±0.3ms    10.19  arithmetic.IntFrameWithScalar.time_frame_op_with_scalar(<class 'numpy.float64'>, 4, <built-in function ne>)
+      1.95±0.2ms         19.7±5ms    10.08  arithmetic.IntFrameWithScalar.time_frame_op_with_scalar(<class 'numpy.int64'>, 5.0, <built-in function gt>)
+     2.22±0.08ms         21.6±4ms     9.73  arithmetic.IntFrameWithScalar.time_frame_op_with_scalar(<class 'numpy.int64'>, 4, <built-in function add>)
+     1.77±0.08ms         16.7±1ms     9.48  arithmetic.IntFrameWithScalar.time_frame_op_with_scalar(<class 'numpy.float64'>, 4, <built-in function gt>)
+      2.19±0.1ms         19.9±2ms     9.08  arithmetic.IntFrameWithScalar.time_frame_op_with_scalar(<class 'numpy.float64'>, 2, <built-in function mul>)
+     1.91±0.04ms         17.0±2ms     8.88  arithmetic.Ops.time_frame_comparison(True, 'default')
+      2.18±0.1ms         19.0±1ms     8.73  arithmetic.IntFrameWithScalar.time_frame_op_with_scalar(<class 'numpy.float64'>, 2, <built-in function add>)
+     2.23±0.08ms         19.1±1ms     8.59  arithmetic.IntFrameWithScalar.time_frame_op_with_scalar(<class 'numpy.int64'>, 2, <built-in function sub>)
+     2.24±0.07ms         19.0±3ms     8.47  arithmetic.IntFrameWithScalar.time_frame_op_with_scalar(<class 'numpy.int64'>, 5.0, <built-in function mul>)
+     2.34±0.06ms         19.5±2ms     8.31  arithmetic.IntFrameWithScalar.time_frame_op_with_scalar(<class 'numpy.float64'>, 2, <built-in function truediv>)
+      2.52±0.2ms         20.3±6ms     8.06  arithmetic.IntFrameWithScalar.time_frame_op_with_scalar(<class 'numpy.float64'>, 5.0, <built-in function truediv>)
+      2.39±0.2ms         19.2±2ms     8.05  arithmetic.IntFrameWithScalar.time_frame_op_with_scalar(<class 'numpy.int64'>, 5.0, <built-in function truediv>)
+      3.07±0.4ms         24.4±5ms     7.94  arithmetic.IntFrameWithScalar.time_frame_op_with_scalar(<class 'numpy.int64'>, 5.0, <built-in function mod>)
+      2.24±0.1ms         17.5±2ms     7.85  arithmetic.IntFrameWithScalar.time_frame_op_with_scalar(<class 'numpy.float64'>, 5.0, <built-in function add>)
+      2.24±0.2ms       17.4±0.7ms     7.79  arithmetic.IntFrameWithScalar.time_frame_op_with_scalar(<class 'numpy.float64'>, 2, <built-in function sub>)
+      2.33±0.1ms         18.0±2ms     7.73  arithmetic.IntFrameWithScalar.time_frame_op_with_scalar(<class 'numpy.float64'>, 5.0, <built-in function mul>)
+      2.15±0.1ms         16.4±4ms     7.60  arithmetic.IntFrameWithScalar.time_frame_op_with_scalar(<class 'numpy.int64'>, 3.0, <built-in function sub>)
+     2.10±0.05ms         15.9±2ms     7.57  arithmetic.IntFrameWithScalar.time_frame_op_with_scalar(<class 'numpy.int64'>, 3.0, <built-in function add>)
+      2.27±0.1ms         16.8±1ms     7.39  arithmetic.IntFrameWithScalar.time_frame_op_with_scalar(<class 'numpy.int64'>, 5.0, <built-in function add>)
+      3.59±0.1ms         26.1±5ms     7.27  arithmetic.IntFrameWithScalar.time_frame_op_with_scalar(<class 'numpy.int64'>, 3.0, <built-in function mod>)
+      2.32±0.1ms         16.8±3ms     7.25  arithmetic.IntFrameWithScalar.time_frame_op_with_scalar(<class 'numpy.float64'>, 5.0, <built-in function sub>)
+     2.36±0.08ms       17.1±0.7ms     7.23  arithmetic.IntFrameWithScalar.time_frame_op_with_scalar(<class 'numpy.float64'>, 4, <built-in function truediv>)
+      2.42±0.2ms         17.4±2ms     7.17  arithmetic.IntFrameWithScalar.time_frame_op_with_scalar(<class 'numpy.int64'>, 4, <built-in function sub>)
+     2.31±0.09ms       16.4±0.9ms     7.11  arithmetic.IntFrameWithScalar.time_frame_op_with_scalar(<class 'numpy.int64'>, 2, <built-in function add>)
+      7.34±0.9ms         52.2±2ms     7.10  arithmetic.MixedFrameWithSeriesAxis.time_frame_op_with_series_axis1('add')
+      2.32±0.1ms       16.4±0.9ms     7.07  arithmetic.IntFrameWithScalar.time_frame_op_with_scalar(<class 'numpy.float64'>, 3.0, <built-in function add>)
+      2.25±0.2ms         15.8±2ms     7.03  arithmetic.IntFrameWithScalar.time_frame_op_with_scalar(<class 'numpy.float64'>, 4, <built-in function sub>)
+      2.51±0.5ms         17.3±2ms     6.91  arithmetic.IntFrameWithScalar.time_frame_op_with_scalar(<class 'numpy.float64'>, 4, <built-in function add>)
+      2.43±0.1ms       16.7±0.8ms     6.84  arithmetic.IntFrameWithScalar.time_frame_op_with_scalar(<class 'numpy.float64'>, 4, <built-in function mul>)
+      2.24±0.1ms         15.2±2ms     6.81  arithmetic.IntFrameWithScalar.time_frame_op_with_scalar(<class 'numpy.int64'>, 2, <built-in function mul>)
+        7.81±1ms         52.9±4ms     6.78  arithmetic.MixedFrameWithSeriesAxis.time_frame_op_with_series_axis1('sub')
+      2.48±0.2ms         16.4±2ms     6.62  arithmetic.IntFrameWithScalar.time_frame_op_with_scalar(<class 'numpy.int64'>, 3.0, <built-in function mul>)
+        6.82±1ms       44.4±0.7ms     6.51  arithmetic.MixedFrameWithSeriesAxis.time_frame_op_with_series_axis1('mul')
+     2.25±0.05ms       14.6±0.8ms     6.48  arithmetic.IntFrameWithScalar.time_frame_op_with_scalar(<class 'numpy.float64'>, 3.0, <built-in function sub>)
+      3.14±0.7ms         19.8±2ms     6.30  arithmetic.IntFrameWithScalar.time_frame_op_with_scalar(<class 'numpy.float64'>, 5.0, <built-in function mod>)
+      2.57±0.2ms         15.9±2ms     6.19  arithmetic.IntFrameWithScalar.time_frame_op_with_scalar(<class 'numpy.int64'>, 5.0, <built-in function sub>)
+      2.57±0.1ms         15.8±2ms     6.16  arithmetic.IntFrameWithScalar.time_frame_op_with_scalar(<class 'numpy.float64'>, 3.0, <built-in function truediv>)
+        7.70±1ms         47.2±3ms     6.13  arithmetic.MixedFrameWithSeriesAxis.time_frame_op_with_series_axis1('truediv')
+      3.02±0.1ms         18.4±3ms     6.08  arithmetic.IntFrameWithScalar.time_frame_op_with_scalar(<class 'numpy.float64'>, 3.0, <built-in function mod>)
+      2.79±0.2ms       16.8±0.8ms     6.04  arithmetic.IntFrameWithScalar.time_frame_op_with_scalar(<class 'numpy.int64'>, 3.0, <built-in function truediv>)
+      3.16±0.3ms       19.1±0.7ms     6.04  arithmetic.IntFrameWithScalar.time_frame_op_with_scalar(<class 'numpy.float64'>, 4, <built-in function mod>)
+      2.51±0.2ms       14.9±0.5ms     5.92  arithmetic.IntFrameWithScalar.time_frame_op_with_scalar(<class 'numpy.float64'>, 3.0, <built-in function mul>)
+      2.71±0.1ms       15.9±0.8ms     5.86  arithmetic.IntFrameWithScalar.time_frame_op_with_scalar(<class 'numpy.int64'>, 4, <built-in function mul>)
+      2.72±0.3ms         15.9±1ms     5.83  arithmetic.IntFrameWithScalar.time_frame_op_with_scalar(<class 'numpy.int64'>, 2, <built-in function truediv>)
+        11.9±1ms         64.0±5ms     5.39  arithmetic.Ops2.time_frame_int_mod
+      3.59±0.4ms         19.1±5ms     5.33  arithmetic.IntFrameWithScalar.time_frame_op_with_scalar(<class 'numpy.float64'>, 2, <built-in function mod>)
+      6.23±0.4ms         32.7±6ms     5.25  arithmetic.IntFrameWithScalar.time_frame_op_with_scalar(<class 'numpy.int64'>, 2, <built-in function mod>)
+      3.28±0.2ms         17.2±2ms     5.23  arithmetic.Ops.time_frame_add(True, 'default')
+        23.7±6ms          112±7ms     4.70  arithmetic.FrameWithFrameWide.time_op_same_blocks(<built-in function floordiv>)
+      3.51±0.4ms       16.5±0.6ms     4.70  arithmetic.Ops.time_frame_mult(True, 'default')
+        3.61±2ms         16.3±1ms     4.52  arithmetic.IntFrameWithScalar.time_frame_op_with_scalar(<class 'numpy.int64'>, 4, <built-in function truediv>)
+        45.8±4ms         194±20ms     4.25  arithmetic.FrameWithFrameWide.time_op_different_blocks(<built-in function floordiv>)
+      5.64±0.6ms         21.9±1ms     3.89  arithmetic.IntFrameWithScalar.time_frame_op_with_scalar(<class 'numpy.int64'>, 4, <built-in function mod>)
+      3.13±0.1ms       11.4±0.5ms     3.63  arithmetic.Ops.time_frame_comparison(True, 1)
+      12.2±0.8ms         42.5±4ms     3.47  arithmetic.IntFrameWithScalar.time_frame_op_with_scalar(<class 'numpy.int64'>, 5.0, <built-in function pow>)
+      4.03±0.7ms       11.2±0.3ms     2.79  arithmetic.Ops.time_frame_add(True, 1)
+        53.0±6ms         143±10ms     2.69  arithmetic.Ops2.time_frame_float_floor_by_zero
+      4.11±0.2ms         11.1±1ms     2.69  arithmetic.Ops.time_frame_mult(True, 1)
+        54.9±4ms          125±9ms     2.28  arithmetic.MixedFrameWithSeriesAxis.time_frame_op_with_series_axis0('floordiv')
+      25.0±0.6ms         55.9±5ms     2.24  arithmetic.IntFrameWithScalar.time_frame_op_with_scalar(<class 'numpy.float64'>, 4, <built-in function pow>)
+      2.42±0.2ms       5.21±0.6ms     2.16  arithmetic.Ops.time_frame_comparison(False, 'default')
+        16.2±1ms         31.9±3ms     1.97  arithmetic.IntFrameWithScalar.time_frame_op_with_scalar(<class 'numpy.int64'>, 3.0, <built-in function pow>)
+        30.9±3ms        58.1±10ms     1.88  arithmetic.IntFrameWithScalar.time_frame_op_with_scalar(<class 'numpy.float64'>, 5.0, <built-in function pow>)
+      3.36±0.3ms       5.76±0.4ms     1.71  arithmetic.Ops.time_frame_add(False, 'default')
+      3.10±0.3ms       5.03±0.3ms     1.62  arithmetic.Ops.time_frame_comparison(False, 1)
+        30.5±3ms         49.2±9ms     1.61  arithmetic.IntFrameWithScalar.time_frame_op_with_scalar(<class 'numpy.float64'>, 3.0, <built-in function pow>)
+      3.42±0.3ms       5.51±0.4ms     1.61  arithmetic.Ops.time_frame_mult(False, 1)
+      3.52±0.2ms       5.63±0.1ms     1.60  arithmetic.Ops.time_frame_add(False, 1)
+      3.60±0.2ms       5.74±0.5ms     1.59  arithmetic.Ops.time_frame_mult(False, 'default')
+        57.9±1ms         89.7±6ms     1.55  arithmetic.Ops2.time_frame_float_div
+      32.1±0.5ms         48.7±2ms     1.52  arithmetic.Ops2.time_frame_dot
+     2.96±0.06ms       4.32±0.4ms     1.46  arithmetic.DateInferOps.time_add_timedeltas
+        65.9±2ms         93.8±1ms     1.42  arithmetic.MixedFrameWithSeriesAxis.time_frame_op_with_series_axis1('pow')
+         106±2ms          132±3ms     1.25  arithmetic.MixedFrameWithSeriesAxis.time_frame_op_with_series_axis0('pow')
+     1.33±0.01ms       1.64±0.2ms     1.24  arithmetic.OffsetArrayArithmetic.time_add_series_offset(<YearEnd: month=12>)
+      7.09±0.2ms       8.49±0.5ms     1.20  arithmetic.DateInferOps.time_subtract_datetimes
+        1.13±0ms      1.33±0.09ms     1.18  arithmetic.OffsetArrayArithmetic.time_add_series_offset(<YearBegin: month=1>)
+     1.25±0.02ms       1.47±0.1ms     1.18  arithmetic.OffsetArrayArithmetic.time_add_series_offset(<SemiMonthEnd: day_of_month=15>)
+     2.52±0.04ms       2.97±0.2ms     1.18  arithmetic.OffsetArrayArithmetic.time_add_series_offset(<BusinessDay>)
+     1.16±0.01ms      1.32±0.06ms     1.13  arithmetic.OffsetArrayArithmetic.time_add_series_offset(<QuarterBegin: startingMonth=3>)
-      1.67±0.2ms      1.42±0.02ms     0.85  arithmetic.OffsetArrayArithmetic.time_add_dti_offset(<MonthEnd>)
-        282±20μs          230±5μs     0.81  arithmetic.NumericInferOps.time_subtract(<class 'numpy.int8'>)
-      4.36±0.2ms       3.54±0.3ms     0.81  arithmetic.NumericInferOps.time_modulo(<class 'numpy.uint16'>)
-      1.29±0.1ms      1.03±0.06ms     0.80  arithmetic.NumericInferOps.time_multiply(<class 'numpy.int64'>)
-     1.77±0.09ms      1.39±0.03ms     0.79  arithmetic.OffsetArrayArithmetic.time_add_dti_offset(<SemiMonthBegin: day_of_month=15>)
-      1.54±0.2ms      1.13±0.02ms     0.74  arithmetic.NumericInferOps.time_divide(<class 'numpy.int8'>)
-        301±40μs          221±4μs     0.73  arithmetic.OffsetArrayArithmetic.time_add_series_offset(<Day>)
-      3.85±0.5ms       2.58±0.2ms     0.67  arithmetic.OffsetArrayArithmetic.time_add_dti_offset(<BusinessDay>)

SOME BENCHMARKS HAVE CHANGED SIGNIFICANTLY.
PERFORMANCE DECREASED.

jbrockmendel · 2020-08-31T17:53:28Z

Whats DataManager?

Edit: nevermind, clear now that I look at the code.

jbrockmendel · 2020-08-31T18:08:59Z

pandas/core/internals/managers.py

+        do_integrity_check: bool = True,
+    ):
+        self._axes = axes
+        self.arrays = arrays


i would have expected youd need to be backed by blocks to get e.g. replace to work. is there a nice way around that?

It was my explicit goal of the current experiment to not use Blocks, because it gives another level of indirection / overhead. But for sure, there are currently certain algos like replace that are defined on the Blocks (see also the inline comment at def replace). So short term we can wrap the arrays in Blocks just for those operations when needed (to be clear, this would only be a hack to get a POC more fully working), and eventually I would abstract some of those algos into separate array-based algos (like we have for many others already which are not defined in the Blocks itself), and then those can be used by both the Block.replace as ArrayManager.replace.

So we will need to see a bit how many of those things from Blocks need reuse, but if it turns out possible (eg is limited to a set of algos that can be factored out), I would prefer to drop the Block concept in a potential ArrayManager.

eventually I would abstract some of those algos into separate array-based algos

+1 for this idea

Poking at this, I think the hard part would be implementing can_hold_element as an array function.

Poking at this, I think the hard part would be implementing can_hold_element as an array function.

For numpy arrays, I think we could write a single helper function that works for all dtypes? And for ExtensionArrays, we could maybe add a similar method on the EA itself to the interface.

I think you're right on both points: a helper for ndarrays and a method for EAs would both be good.

Spent some time on this yesterday and found other Block methods are necessary (putmask comes to mind), to the point where keeping them as methods feels more intuitive than standalone functions. It might be easiest to use the existing Block code for these methods, but delay wrapping the arrays in Blocks until they are needed (at least for the POC stage)

TomAugspurger · 2020-08-31T18:13:49Z

Overall, having a base class, BlockManager, and ArrayManager makes sense for easy prototyping / switching. That also lets us clearly define the interface between pandas' internals and the rest of pandas.

For ease of review, can you split internals/managers.py into base / array / block? Or perhaps just leave internals/managers.py as is for this PR (other than inheriting from DataManager) and just add stuff so the diff is cleanest.

jorisvandenbossche · 2020-08-31T20:57:08Z

For ease of review, can you split internals/managers.py into base / array / block? Or perhaps just leave internals/managers.py as is for this PR (other than inheriting from DataManager) and just add stuff so the diff is cleanest.

The diff should be clean right now. It's added in manager.py itself, but it's purely an addition of lines, so the diff should look the same as if it was a new file.
(the base class itself right now doesn't have much content, but I agree that eventually it would make sense to split in multiple files)

jbrockmendel · 2020-08-31T21:55:03Z

For the purpose of seeing how close this is to working, would it make sense to use a pd.options flag to control ArrayManager vs BlockManager instead of a DataFrame keyword? This would make it straightforward to run all the tests using ArrayManager with few edits.

jorisvandenbossche · 2020-09-02T14:15:10Z

For the purpose of seeing how close this is to working, would it make sense to use a pd.options flag to control ArrayManager vs BlockManager instead of a DataFrame keyword? This would make it straightforward to run all the tests using ArrayManager with few edits.

Yes, that's a good idea (I now added a keyword to the DataFrame constructor, and for testing purposes switched the default. But indeed with an option it is easier to switch)

jbrockmendel · 2020-09-03T22:41:51Z

@jorisvandenbossche im curious how this performs for the snippet discussed in #34683

df = pd.DataFrame(index=list(range(100)))

df1 = pd.DataFrame(index=list(range(100)))
df2 = pd.DataFrame(index=list(range(100)))

for i in range(10):
    df1[i] = np.random.randn(len(df))
    df2[i] = np.random.randn(len(df))


In [22]: %timeit pd.concat([df1, df2])

jorisvandenbossche · 2020-09-04T09:49:36Z

It's a bit faster:

In [5]: pd.options.mode.data_manager = 'block'   

In [6]: ... # code to create df1 and df2   

In [7]: %timeit pd.concat([df1, df2])   
465 µs ± 29.5 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

In [8]: pd.options.mode.data_manager = 'array'  

In [9]: ... # code to create df1 and df2  

In [10]: %timeit pd.concat([df1, df2])   
298 µs ± 6.95 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

But: concat is not yet fully implemented (only for the simple cases like this), but that might also mean it is taking some shortcuts that otherwise need checks.

(it's also not directly related to "consolidation in reshape or not", because after the first iteration of the ``%timeit, df1` and `df2` will actually be consolidated already, because this happens inplace, but that's for the discussion in the other issue)

pep8speaks · 2020-09-04T09:51:30Z

Hello @jorisvandenbossche! Thanks for updating this PR. We checked the lines you've touched for PEP 8 issues, and found:

There are currently no PEP 8 issues detected in this Pull Request. Cheers! 🍻

Comment last updated at 2020-12-12 19:03:25 UTC

jbrockmendel · 2020-09-04T18:48:07Z

pandas/core/internals/managers.py

+        # mgr_shape = self.shape
+        # tot_items = sum(len(x.mgr_locs) for x in self.blocks)
+        # for block in self.blocks:
+        #     if block._verify_integrity and block.shape[1:] != mgr_shape[1:]:


i think this can be replaced with just checking that all of the array lengths match

pandas/core/internals/managers.py

jorisvandenbossche · 2020-09-04T20:46:54Z

@TomAugspurger can you repeat again what you exactly prefer regarding the diff (missed that point a bit on the call) ?

TomAugspurger · 2020-09-04T21:11:31Z

No worries about the diff here, I thinks it’s fine.

…

On Sep 4, 2020, at 15:47, Joris Van den Bossche ***@***.***> wrote: @TomAugspurger can you repeat again what you exactly prefer regarding the diff (missed that point a bit on the call) ? — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or unsubscribe.

jorisvandenbossche · 2020-09-05T14:28:34Z

Some updates:

Added a apply_with_block that wraps the array in a Block to use some Block-functionality such as replace, where, putmask, .. (which can be temporary used as a workaround until we can factor things out of the blocks).
Skipped a bunch of JSON tests, because those otherwise segfault (instead of fail, which then stops running the tests). The segfault is because it relies on accessing the blocks in the C code (-> INT: the json C code should not deal with blocks #27164)

Big chunks of failing tests relate to: tests with extension dtypes (since I am only handling np.ndarrays for now), some other functionality relying on the blocks (eg HDF IO), some not-yet-implemented functionality (quantile, unstack).

TomAugspurger · 2020-09-05T14:39:58Z

When do you want to handle (/ only use) ExtensionArrays? Would that be done before merging?

jorisvandenbossche · 2020-09-05T14:43:21Z

Yeah, I was also wondering that today. Given that we want that eventually, it might make sense to just do the switch now?
It will make comparing performance a bit harder, but getting things working correctly first is probably more important, can optimize later.

TomAugspurger · 2020-09-05T15:52:35Z

My preference would be for all extension array-backed from the start. What are the extra sources of overhead? 1. Constructing ~Pandas~NumPyArray instances 2. Extra attribute access 3. ... anything else?

…

On Sat, Sep 5, 2020 at 9:43 AM Joris Van den Bossche < ***@***.***> wrote: Yeah, I was also wondering that today. Given that we want that eventually, it might make sense to just do the switch now? It will make comparing performance a bit harder, but getting things working correctly first is probably more important, can optimize later. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#36010 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAKAOIVCOR5YLTEELBOI4YDSEJFBNANCNFSM4QQSHDPA> .

jbrockmendel · 2020-12-21T17:58:30Z

pandas/core/generic.py


        if ignore_index:
            axis = 1 if isinstance(self, ABCDataFrame) else 0
-            new_data.axes[axis] = ibase.default_index(len(indexer))
+            new_data.set_axis(axis, ibase.default_index(len(indexer)))


can you comment on why these need to be changed?

In hindsight, it's probably not needed anymore. Originally, I changed the axes attribute on ArrayManager to be (index, columns) order. But afterwards, I changed this to be stored in _axes, and having an axes property that switches the internal order, to match the "public" interface of BlockManager.

So with that change, I could revert those edits (I think, but can check that)

Just checked this again, and so this change is still needed (without it I get failures for the array manager).
The reason is that mgr.axes is no longer the actual "storage" of the axes for the ArrayManager, but is there stored in _axes. And because mgr.axes returns a list, it's actually updating the list in place, and not the actual storage. Which is a bit of a gotcha now with the manager's axes attribute.

But based on a quick check, this seems to be the only place where we assign to mgr.axes[i] = ..

thanks for explaining. we should make another attempt to get axes out of FooManager before too long

jorisvandenbossche · 2021-01-08T16:20:46Z

@jreback I tried to update according to your comments regarding the conversion of different types of managers: moved the bulk of the current implementation from frame.py to internals/construction.py, and simplified the code in frame.py. If can have another look (see second to last commit)

jreback · 2021-01-08T23:16:54Z

thanks @jorisvandenbossche looks pretty good.

can you do the Manager=Union[ArrayManager, BlockManager] in typing? (you may have commented on why not but didn't see it)
can you benchmark key things (df construction and ops) to see what slowdown this code adds), I suspect its just a very small amount because of the additional if checks but would be nice to see

otherwise rebase and looks ok to merge. cc @jbrockmendel

jbrockmendel · 2021-01-09T00:05:47Z

pandas/tests/frame/methods/test_to_numpy.py

@@ -19,7 +19,8 @@ def test_to_numpy_dtype(self):

    def test_to_numpy_copy(self):
        arr = np.random.randn(4, 3)
-        df = DataFrame(arr)
+        with option_context("mode.data_manager", "block"):


would @td.skip_array_manager_invalid_test make more sense here?

would @td.skip_array_manager_invalid_test make more sense here?

Indeed, will update (I think this was from before I added the decorators)

jbrockmendel · 2021-01-09T00:08:18Z

pandas/core/internals/concat.py

@@ -50,6 +52,21 @@ def concatenate_block_managers(
    -------
    BlockManager
    """
+    if isinstance(mgrs_indexers[0][0], ArrayManager):


are we assuming that they are either all-ArrayManager or all-BlockManager?

are we assuming that they are either all-ArrayManager or all-BlockManager?

Yes, and this concat implementation right now is very limited in general (eg only the simple case without any reindexing needed, several tests are skipped because of this, I put several TODO(ArrayManager) concat with reindexing because of this).
Concatenation is one of the big areas of work for follow-up on this initial PR.

pandas/core/internals/array_manager.py

jbrockmendel · 2021-01-09T00:13:27Z

A handful of small comments, generally looks nice. Over the weekend I'll see if I can chip away at the disabled tests

jbrockmendel · 2021-01-09T16:56:40Z

running python3 -m pytest pandas/tests --skip-slow --skip-db --array-manager locally i got a segfault (MacOS 11.1, py3.9.1) in tests.io.test_fsspec

jorisvandenbossche · 2021-01-10T21:08:44Z

running python3 -m pytest pandas/tests --skip-slow --skip-db --array-manager locally i got a segfault (MacOS 11.1, py3.9.1) in tests.io.test_fsspec

Yes, json tests segfault (because the C code is expecting a BlockManager). I skipped them when I started this PR, but the one you reference was added more recently. Will add a skip for those as well (for CI, I am currently only running a subset of tests/frame/methods that passes)

jorisvandenbossche · 2021-01-11T07:42:53Z

@jreback

can you do the Manager=Union[ArrayManager, BlockManager] in typing? (you may have commented on why not but didn't see it)

So regarding using DataManager base class in typing, see the explanation above: #36010 (comment) (it's probably possible, but quite some work to get working, so I would rather not do it for this PR).
But I suppose here you are only meaning using an alias for the Union? Added that in the latest commits.

can you benchmark key things (df construction and ops) to see what slowdown this code adds), I suspect its just a very small amount because of the additional if checks but would be nice to see

I think the only potentially impacted code path (for normal use, so without enabling ArrayManager) right now is the DataFrame construction. The inline comment is a bit hidden, but see #36010 (comment) for some timings related to that. Summary is that the checking of the option only has a small impact (ca 2 µs), while the constructor itself already takes relatively more time (eg a pd.DataFrame(np.random.randn(4,3)) takes 50 µs, a pd.DataFrame({'a': [1, 2, 3]}) takes 200 µs).

jorisvandenbossche · 2021-01-12T13:09:57Z

I think I addressed all remaining comments, so I am planning to merge this (it's getting a bit annoying to keep rebasing this, and I also don't plan to do substantial new feature work in this PR, there is plenty for follow-ups).
Getting more tests passing can also be done in targeted follow-ups (we need to skip many tests right now anyway, because of lacking features).

I will make an overview of the different areas of work for follow-ups.

jorisvandenbossche · 2021-01-13T14:10:36Z

Thank all for the reviews!

I created an overview issue to track the different follow-ups here: #39146

…das-dev#36010)

jorisvandenbossche added Refactor Internal refactoring of code Internals Related to non-user accessible pandas implementation Needs Discussion Requires discussion from core team before further action labels Aug 31, 2020

jbrockmendel reviewed Aug 31, 2020

View reviewed changes

jorisvandenbossche added 3 commits September 4, 2020 11:51

POC: ArrayManager -- array-based data manager for columnar store

a51835b

Update with latest master + some fixes

591579b

add pd.options.mode.data_manager to switch

896080a

jorisvandenbossche force-pushed the array-manager branch from a7880e9 to 896080a Compare September 4, 2020 09:51

jbrockmendel reviewed Sep 4, 2020

View reviewed changes

pandas/core/internals/managers.py Outdated Show resolved Hide resolved

jorisvandenbossche added 6 commits September 5, 2020 08:08

Merge remote-tracking branch 'upstream/master' into array-manager

f9c4dda

add apply_with_block workaround

d18082a

fix alignment in apply

cf3c07a

reorder methods to match BlockManager

b252c6d

skip json tests for now

0fb645e

skip more json tests + to_csv with to_native_types

eb55fef

jbrockmendel reviewed Dec 21, 2020

View reviewed changes

jorisvandenbossche added 3 commits January 8, 2021 16:09

Merge remote-tracking branch 'upstream/master' into array-manager

5c73688

move to internals/construction.py

a9a8c2d

update for latest changes - fix tests/mypy

c7898fb

jorisvandenbossche added 2 commits January 8, 2021 17:38

fix todo

3430307

fix import in tests

1a30013

jbrockmendel reviewed Jan 9, 2021

View reviewed changes

pandas/core/internals/array_manager.py Outdated Show resolved Hide resolved

jbrockmendel reviewed Jan 9, 2021

View reviewed changes

pandas/core/internals/array_manager.py Outdated Show resolved Hide resolved

jorisvandenbossche added 4 commits January 10, 2021 22:17

Merge remote-tracking branch 'upstream/master' into array-manager

ef86b1e

add union alias to typing

c5548d9

updates based on review

afe8f80

skip json tests to avoid segfaults

b88c757

jorisvandenbossche added 2 commits January 12, 2021 12:13

Merge remote-tracking branch 'upstream/master' into array-manager

ddc51d0

fix for Label -> Hashable change in master

9dc5600

jorisvandenbossche merged commit 4e93eb6 into pandas-dev:master Jan 13, 2021

jorisvandenbossche deleted the array-manager branch January 13, 2021 13:23

jorisvandenbossche added this to the 1.3 milestone Jan 13, 2021

jorisvandenbossche mentioned this pull request Jan 13, 2021

Refactor - ArrayManager overview issue #39146

Closed

11 tasks

luckyvs1 pushed a commit to luckyvs1/pandas that referenced this pull request Jan 20, 2021

POC: ArrayManager -- array-based data manager for columnar store (pan…

618ac6d

…das-dev#36010)

simonjayhawkins mentioned this pull request Feb 8, 2021

REGR: fix case all-NaN/numeric object column in groupby #39655

Merged

POC: ArrayManager -- array-based data manager for columnar store #36010

POC: ArrayManager -- array-based data manager for columnar store #36010

Conversation

jorisvandenbossche commented Aug 31, 2020 • edited Loading

jbrockmendel commented Aug 31, 2020 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

TomAugspurger commented Aug 31, 2020

jorisvandenbossche commented Aug 31, 2020

jbrockmendel commented Aug 31, 2020

jorisvandenbossche commented Sep 2, 2020

jbrockmendel commented Sep 3, 2020

jorisvandenbossche commented Sep 4, 2020

pep8speaks commented Sep 4, 2020 • edited Loading

Comment last updated at 2020-12-12 19:03:25 UTC

Choose a reason for hiding this comment

jorisvandenbossche commented Sep 4, 2020

TomAugspurger commented Sep 4, 2020 via email

jorisvandenbossche commented Sep 5, 2020

TomAugspurger commented Sep 5, 2020

jorisvandenbossche commented Sep 5, 2020

TomAugspurger commented Sep 5, 2020 via email

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jorisvandenbossche Jan 8, 2021 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jorisvandenbossche commented Jan 8, 2021

jreback commented Jan 8, 2021

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jbrockmendel commented Jan 9, 2021

jbrockmendel commented Jan 9, 2021

jorisvandenbossche commented Jan 10, 2021

jorisvandenbossche commented Jan 11, 2021

jorisvandenbossche commented Jan 12, 2021

jorisvandenbossche commented Jan 13, 2021

jorisvandenbossche commented Aug 31, 2020 •

edited

Loading

jbrockmendel commented Aug 31, 2020 •

edited

Loading

pep8speaks commented Sep 4, 2020 •

edited

Loading

jorisvandenbossche Jan 8, 2021 •

edited

Loading