Skip to content

PERF: BlockPlacement.copy() speed-up #10073

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 1 commit into from
Closed

PERF: BlockPlacement.copy() speed-up #10073

wants to merge 1 commit into from

Conversation

ARF1
Copy link

@ARF1 ARF1 commented May 7, 2015

Copying of a BlockPlacement is currently slower than it could/should be:

  • copy() cannot be accessed from python, one currently needs to re-implement copy() if one wants to duplicate a BlockPlacement instance
  • copy() tries to infer a slice for any array contained in the BlockPlacement.
  • After copy() passes an array to the constructor, a sanity check in the form of np.require(val, dtype=np.int64, requirements='W') checks that the array is a valid input for a BlockPlacement. This is unnecessary since the array originated from a BlockPlacement for which this check was already done.

Copying of a `BlockPlacement` is currently slower than it could be:

* `copy()` cannot be accessed from python, one currently needs to re-implement
`copy()` if one wants to duplicate a `BlockPlacement` instance
* `copy()` tries to infer a slice for any array contained in the
`BlockPlacement`.
* After `copy()` passes an array to the constructor, a sanity check in the form
of `np.require(val, dtype=np.int64, requirements='W')` checks that the array is
a valid input for a `BlockPlacement`. This is unnecessary since the array
originated from a `BlockPlacement` for which this check was already done.
@ARF1 ARF1 changed the title PERF: BlockPlacement.copy() & constructor fastpath PERF: BlockPlacement.copy() speed-up May 7, 2015
@jreback jreback added the Performance Memory or execution speed performance label May 7, 2015
@jreback
Copy link
Contributor

jreback commented May 7, 2015

can you run the perf suite and see where this help? https://github.com/pydata/pandas/wiki/Performance-Testing

@ARF1
Copy link
Author

ARF1 commented May 7, 2015

@jreback I apologise in advance for this disproportionally long post.

I tried running vbench but have two problems: 1. I am on windows and cannot use the script, but even after invoking vbench with what I think is the windows equivalent, I am 2. not using git but `hg-git'.

I tried to use git for windows to re-download my fork from github and run vbench on it, but vbench produces the error message:

  File "C:\Anaconda\envs\pandas_dev\lib\subprocess.py", line 573, in check_output
    raise CalledProcessError(retcode, cmd, output=output)
subprocess.CalledProcessError: Command 'git rev-parse --short -verify master^{commit}' returned non-zero exit status 128

I am not sure which part of my setup is causing it. I know it is somewhat unlikely, but do you know of another pandas developer using Windows who could help me out?


Reason for the PR: allow users to create custom high-performance dataframe constructors:

I do not expect any significant speed improvements from this PR in the usual pandas usage (if any at all). Usually Index generation far outweighs BlockPlacement generation. One usually does not happen without the other as far as I know. This PR only comes into play in the following situation:

  1. RangeIndex (Introduction of RangeIndex #9977) is activated as default index and
  2. the user runs a truely funky, custom, high-performance dataframe constructor to be able to use the pandas dataframe as a numpy structured array replacement data container.

With both of these in place, dataframe instantiation time becomes then comparable to numpy (x28 speedup for small dataframes and x376 for large dataframes).

More importantly however, filling the thus allocated dataframe from a columnar database is x3 faster than filling an equivalent numpy structured array! Similarly, column-wise data analysis is faster with pandas dataframes than with numpy structured arrays.

I know that for your use case, the instantiation speed of a dataframe is irrelevant, but it would be nice to give people the option to roll their own highly-optimized (and probably fragile ;-) ) dataframe constructors. This PR allows that.

If you are concerned about performance regressions, I can restructure the PR such that copy() behaves identical to now (with slice inference) and only enable the optimization with an additional fastpath argument for copy(). That should eliminate the need to run vbench.


Funky dataframe constructor using BlockPlacement.copy():

from pandas.core.internals import BlockManager
from pandas.core.frame import DataFrame
from pandas.core.generic import NDFrame
from pandas.core.common import CategoricalDtype
from pandas.core.categorical import Categorical
from pandas.lib import BlockPlacement
try:
    from pandas.core.index import Index, RangeIndex
except ImportError:
    try:
        from pandas.core.index import Int64Index
        def RangeIndex(start, stop, step, **kwargs):
            return Int64Index(np.arange(start, stop, step), **kwargs)
    except ImportError:
        pass


def allocate_like(df, size, keep_categories=False):
    """High-performance pandas dataframe constructor for dataframes consisting 
    only of numpy dtype columns. It creates a dataframe with the same columns as
    the provided template dataframe `df` and with `size` number of rows.

    ATTENTION: This constructor works ONLY for dataframes that contain ONLY columns 
               with NUMPY DTYPES. Date, Categorical, etc. columns are not supported!
    """

    # define axes
    # (ideally uses pandas/pandas#9977 for MUCH better performance with 
    #  large dataframes)
    axes = [df.columns, RangeIndex(0, size, 1, fastpath=True)]

    # allocate and create blocks
    blocks = []
    for block in df._data.blocks:
        new_shape = (block.values.shape[0], size)
        values = np.empty(shape=new_shape, dtype=block.dtype)

        # The following section is equivalent to:
        # block.make_block_same_class(values=values,
        #                             placement=block.mgr_locs.as_array)
        new_block = object.__new__(block.__class__)
        new_block.values = values
        new_block.ndim = values.ndim
        # uses pandas/pandas#10073
        new_block._mgr_locs = block.mgr_locs.copy()

        blocks.append(new_block)

    # create block manager
    # The following section is equivalent to:
    # mgr = BlockManager(blocks, axes, do_integrity_check=False, fastpath=True)
    mgr = object.__new__(BlockManager)
    mgr.axes = axes
    mgr.blocks = tuple(blocks)
    mgr._blknos = df._data._blknos.copy()
    mgr._blklocs = df._data._blklocs.copy()

    # create dataframe
    # The following section is equivalent to:
    # return DataFrame(mgr)
    result = object.__new__(DataFrame)
    #NDFrame.__init__(result, mgr, fastpath=True)
    object.__setattr__(result, 'is_copy', None)
    object.__setattr__(result, '_data', mgr)
    object.__setattr__(result, '_item_cache', {})
    return result

Timing results comparing pandas dataframe to numpy structured array:

In [2]: import numpy as np

In [3]: import pandas as pd


# create template structures
In [4]: np_templ = np.empty(0, dtype='i4,i4,f4,f4,f4')

In [5]: pd_templ = pd.DataFrame(np_templ)


# small structure timings: x28 speedup
#
In [6]: %timeit np.empty(0, dtype=np_templ.dtype)
100000 loops, best of 3: 2.14 µs per loop

In [7]: %timeit np.empty(0, dtype='i4,i4,f4,f4,f4')
10000 loops, best of 3: 58 µs per loop

In [8]: %timeit pd.DataFrame(np.empty(0, dtype=np_templ.dtype))
1000 loops, best of 3: 786 µs per loop

In [8]: %timeit allocate_like(pd_templ, size=0)
10000 loops, best of 3: 28.3 µs per loop


# large structure timings: x376 speedup
#
In [9]: %timeit np.empty(int(1e6), dtype=np_templ.dtype)
10000 loops, best of 3: 100 µs per loop

In [10]: %timeit np.empty(int(1e6), dtype='i4,i4,f4,f4,f4')
10000 loops, best of 3: 159 µs per loop

In [11]: %timeit pd.DataFrame(np.empty(int(1e6), dtype=np_templ.dtype))
10 loops, best of 3: 56.4 ms per loop

In [12]: %timeit allocate_like(pd_templ, size=int(1e6))
10000 loops, best of 3: 150 µs per loop


# Simulate filling large mixed-dtype structures from a columnar database
#

In [13]: some_data = np.random.rand(1,int(1e6))

In [14]: arr = np.empty(int(1e6), dtype=np_templ.dtype)

In [15]: %%timeit
   ....: for name in arr.dtype.names:
   ....:     arr[name] = some_data
   ....:
10 loops, best of 3: 61.1 ms per loop

In [16]: df = pd.DataFrame(np.empty(int(1e6), dtype=np_templ.dtype))

In [17]: %%timeit
   ....: mgr = df._data
   ....: for name in arr.dtype.names:
   ....:     loc = mgr.items.get_loc(name)
   ....:     blkno = mgr._blknos[loc]
   ....:     blkloc = mgr._blklocs[loc]
   ....:     mgr.blocks[blkno].values[blkloc, :] = some_data
   ....:
10 loops, best of 3: 20.7 ms per loop

@jreback
Copy link
Contributor

jreback commented May 7, 2015

@ARF1 I think you are missing the point about incremental development. RangeIndex is gr8, but it is not relevant to this particular discussion as its not in the master branch. You are proposing a change, which on the surface looks ok. If it has a perf speedup (or at least doesn't hurt anything) then that is important to know.

I certainly care about DataFrame construction performance. If you can improve it w/o breaking other things then gr8. My points from other threads are that it is relatively unimportant compared to most operations, doesn't mean it doesn't matter though. Even an incremental improvement at the end of the day is good.

You may or may not have improvements with various PR's, but they each need to be proved incrementally. If that is not the case then you can simply bundle them and request that all the changes go in at once.

However, the main reason for incremental changes is that it is far easier to review and think about the changes; if you are proposing a massive change then it will take quite some time to review (even after it passes all of the tests).

That is while incremental changes are much better in a mature project like pandas.

@jreback
Copy link
Contributor

jreback commented May 7, 2015

@ARF1 you mention 'am I concerned about performance regressions' certainly. But you haven't proven the case either way. The only way is to test.

maybe @jorisvandenbossche can help you out on windows with vbench. It should work if you have all of the deps installed. I do recall a fair amount of users running vbench correctly on windows.

@jorisvandenbossche
Copy link
Member

sorry, can't really help. I am both using windows and linux, and always did my vbenches on linux. I am not sure if it is supposed to work there.

@jreback
Copy link
Contributor

jreback commented Aug 5, 2015

not clear if this actually helps/hurts perf at all.

@jreback jreback closed this Aug 5, 2015
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Performance Memory or execution speed performance
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants