Skip to content

PERF: DataFrame(ndarr, copy=True) #52440

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed

Conversation

topper-123
Copy link
Contributor

@topper-123 topper-123 commented Apr 5, 2023

Performance example:

>>> import numpy
>>> import pandas as pd, numpy as np
>>> data = numpy.random.rand(10_000_000, 2)
>>> df = pd.DataFrame(data, copy=True)
>>> %timeit df.sum()
164 ms ± 1.12 ms per loop  # main
13.2 ms ± 48.3 µs per loop   # this PR

Copy link
Member

@phofl phofl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

-1 on this without further investigation, this caused significant slowdowns in a lot of benchmarks when I looked at it a couple of weeks ago.

Please run them locally and report result

Edit: found my result

       before           after         ratio
     [26999ebe]       [47ed11f5]
+         763±6μs       1.68±0.1ms     2.20  arithmetic.Ops.time_frame_add(False, 1)
+        768±30μs      1.67±0.08ms     2.18  arithmetic.Ops.time_frame_add(False, 'default')
+        768±30μs      1.61±0.01ms     2.10  arithmetic.Ops.time_frame_mult(False, 'default')
+        783±20μs      1.57±0.01ms     2.01  arithmetic.Ops.time_frame_mult(False, 1)
+        784±30μs      1.33±0.03ms     1.69  arithmetic.MixedFrameWithSeriesAxis.time_frame_op_with_series_axis0('add')
+         789±6μs      1.32±0.02ms     1.67  arithmetic.MixedFrameWithSeriesAxis.time_frame_op_with_series_axis0('sub')
+         126±1μs          210±3μs     1.67  arithmetic.Ops2.time_frame_series_dot
+         485±2μs         786±10μs     1.62  arithmetic.MixedFrameWithSeriesAxis.time_frame_op_with_series_axis0('eq')
+         485±3μs         783±20μs     1.61  arithmetic.MixedFrameWithSeriesAxis.time_frame_op_with_series_axis0('ge')
+         488±8μs         784±20μs     1.61  arithmetic.MixedFrameWithSeriesAxis.time_frame_op_with_series_axis0('le')
+        495±10μs         789±30μs     1.59  arithmetic.MixedFrameWithSeriesAxis.time_frame_op_with_series_axis0('lt')
+        491±10μs         767±10μs     1.56  arithmetic.MixedFrameWithSeriesAxis.time_frame_op_with_series_axis0('gt')
+         514±3μs         785±20μs     1.53  arithmetic.MixedFrameWithSeriesAxis.time_frame_op_with_series_axis0('ne')
+     1.42±0.02ms      2.09±0.02ms     1.46  arithmetic.Ops.time_frame_mult(True, 1)
+     1.19±0.04ms      1.69±0.02ms     1.42  arithmetic.MixedFrameWithSeriesAxis.time_frame_op_with_series_axis0('truediv')
+        971±40μs      1.34±0.04ms     1.38  arithmetic.MixedFrameWithSeriesAxis.time_frame_op_with_series_axis0('mul')
+     1.74±0.01ms       2.39±0.1ms     1.37  arithmetic.Ops.time_frame_add(True, 1)
+      43.7±0.2μs       56.0±0.2μs     1.28  arithmetic.NumericInferOps.time_add(<class 'numpy.uint8'>)
+      43.6±0.1μs       55.4±0.2μs     1.27  arithmetic.NumericInferOps.time_multiply(<class 'numpy.int8'>)
+      43.8±0.3μs       55.1±0.2μs     1.26  arithmetic.NumericInferOps.time_multiply(<class 'numpy.uint8'>)
+      44.3±0.6μs       55.4±0.4μs     1.25  arithmetic.NumericInferOps.time_subtract(<class 'numpy.uint8'>)
+        44.2±1μs       55.3±0.1μs     1.25  arithmetic.NumericInferOps.time_add(<class 'numpy.int8'>)
+        967±20μs      1.20±0.08ms     1.24  arithmetic.Ops.time_frame_add(True, 'default')
+      58.3±0.2μs       70.5±0.3μs     1.21  arithmetic.NumericInferOps.time_add(<class 'numpy.int16'>)
+      58.5±0.1μs       70.4±0.5μs     1.20  arithmetic.NumericInferOps.time_multiply(<class 'numpy.int16'>)
+      58.6±0.6μs       70.4±0.4μs     1.20  arithmetic.NumericInferOps.time_subtract(<class 'numpy.uint16'>)
+      58.9±0.2μs       70.3±0.2μs     1.19  arithmetic.NumericInferOps.time_multiply(<class 'numpy.uint16'>)
+        968±40μs      1.14±0.03ms     1.17  arithmetic.Ops.time_frame_mult(True, 'default')
+        60.1±1μs       70.2±0.3μs     1.17  arithmetic.NumericInferOps.time_add(<class 'numpy.uint16'>)
-        45.9±2ms       35.3±0.4ms     0.77  arithmetic.BinaryOpsMultiIndex.time_binary_op_multiindex('add')
-        45.9±2ms       35.1±0.6ms     0.76  arithmetic.BinaryOpsMultiIndex.time_binary_op_multiindex('sub')
-        46.2±1ms       35.2±0.4ms     0.76  arithmetic.BinaryOpsMultiIndex.time_binary_op_multiindex('div')
-        46.1±2ms       35.0±0.4ms     0.76  arithmetic.BinaryOpsMultiIndex.time_binary_op_multiindex('mul')
-      2.91±0.2ms      2.20±0.03ms     0.76  arithmetic.MixedFrameWithSeriesAxis.time_frame_op_with_series_axis1('sub')
-      3.09±0.5ms      2.31±0.07ms     0.75  arithmetic.MixedFrameWithSeriesAxis.time_frame_op_with_series_axis1('mul')
-     5.81±0.09ms       4.05±0.4ms     0.70  arithmetic.Ops2.time_frame_int_mod
-       748±400μs          454±3μs     0.61  arithmetic.IntFrameWithScalar.time_frame_op_with_scalar(<class 'numpy.int64'>, 3.0, <built-in function ne>)

this was only a subset of all benchmarks, so further investigation necessary

@topper-123
Copy link
Contributor Author

I've always had trouble using asv, IDK if'm doing it wrong? for example running asv continuous -f 1.1 upstream/main HEAD -b groupby.GroupByMethods takes 48 minutes on my macbook m1, which just isn't tenable for any serious benchmarking work.

How long does asv continuous -f 1.1 upstream/main HEAD -b groupby.GroupByMethods take for you?

@phofl
Copy link
Member

phofl commented Apr 6, 2023

Not sure I've ever ran the groupby ones, the frame benchmarks for example take around 5-10 minutes (not totally sure, would lean more towards 5) on my m2 air

@topper-123
Copy link
Contributor Author

So asv continuous -f 1.1 upstream/main HEAD -b frame_methods takes approx 5 min. in for you?

Did you install asv like in the docs i.e. (pip install git+https://github.com/airspeed-velocity/asv) or did you use conda or pip?

(I need to try figuring it out, so appreciate any help)

@phofl
Copy link
Member

phofl commented Apr 6, 2023

I think so yes, can re-run later today. It's definitely not terribly long

I used mamba (came when installing environment.yml)

@phofl
Copy link
Member

phofl commented Apr 6, 2023

5 was a bit optimistic, ran around 9ish minutes.

2 suggestions:

  • set the number of cores when compiling in asv.conf.json to the number of cores that your machine has (speeds it up quite nicely).
  • check that you are not on low power mode (battery settings). test suite takes 2.5 times as long for me when on low power mode

@topper-123
Copy link
Contributor Author

Great, thanks I will try that, alternatively do a new install using mamba.

@mroeschke mroeschke added Performance Memory or execution speed performance Copy / view semantics labels Apr 6, 2023
@topper-123
Copy link
Contributor Author

Ok, I've run asv tests, generally look good

asv continuous -f 1.1 upstream/main HEAD -b arithmetic

BENCHMARKS NOT SIGNIFICANTLY CHANGED.
asv continuous -f 1.1 upstream/main HEAD -b frame_ctor

BENCHMARKS NOT SIGNIFICANTLY CHANGED.
asv continuous -f 1.1 upstream/main HEAD -b stat_ops

BENCHMARKS NOT SIGNIFICANTLY CHANGED.
asv continuous -f 1.1 upstream/main HEAD -b groupby.AggFunctionsx

BENCHMARKS NOT SIGNIFICANTLY CHANGED.
asv continuous -f 1.1 upstream/main HEAD -b groupby

[100.00%] ··· groupby.MultiColumn.time_lambda_sum                                                                                                                                                  152±0.5ms
       before           after         ratio
     [c537b363]       [1160d3f6]
     <master>         <dataframe_2d_array_copy_true>
+      31.7±0.1μs         35.0±6μs     1.10  groupby.GroupByMethods.time_dtype_as_group('uint', 'cumsum', 'direct', 5)
-        310±80ms        246±0.6ms     0.79  groupby.AggEngine.time_dataframe_numba(True)

Everything else looks good so I guess it's jitters.

@topper-123
Copy link
Contributor Author

Generally, DataFrame construction with copy=True doesnøt seem to be suppported in the asv tests. Below are the main differencers that I've noticed.

>>> arr = np.random.randn(10_000_000, 2)
>>> %timeit df2 = pd.DataFrame(arr, copy=True)
13.5 ms ± 157 µs per loop  # main
21.5 ms ± 435 µs per loop  # this PR
>>> df.sum()
160 ms ± 469 µs per loop  # main
13.1 ms ± 57.9 µs per loop  # this PR

I.e. faster aggregation, but slower construction

@phofl
Copy link
Member

phofl commented Apr 12, 2023

Yeah sorry, running them agains main doesn't make much sense. You probably have to change the default to True and then run your change against that commit to get a better idea how things would change

@topper-123
Copy link
Contributor Author

topper-123 commented Apr 12, 2023

Hmm, setting the default copy=True and comparing against that will do a lot of other things also. In particular, every time DataFrame._constructor & Series._constructor_expanddim are called, a copy will be made, which will be quite costly. That isn't proposed in this PR, so is not really the same.

This PR doesn't do anything that DataFrame.copy doesn't do already, so will logically have comparable performance to dataframes that come out of DataFrame.copy(). So I don't see any possibility that e.g. arithmetic operations can be affected negatively from this.

I could add a test that does pd.DataFrame(data, copy=True/False).sum() somewhere to in the ASVs to document the case where this gets a performance boost?

@github-actions
Copy link
Contributor

This pull request is stale because it has been open for thirty days with no activity. Please update and respond to this comment if you're still interested in working on this.

@github-actions github-actions bot added the Stale label May 13, 2023
@mroeschke mroeschke requested a review from phofl August 1, 2023 17:24
@phofl
Copy link
Member

phofl commented Aug 1, 2023

I saw some very weird things when switching this for CoW. So I am not convinced that we should do this right now. I'd be more on-board with doing this in 3.0. I don't think that setting copy=True manually is a very common use-case anyway.

@mroeschke
Copy link
Member

Guess there's not much appetite for this before 3.0 so will mothball this PR in the meantime

@mroeschke mroeschke closed this Aug 7, 2023
@mroeschke mroeschke added the Mothballed Temporarily-closed PR the author plans to return to label Aug 7, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Copy / view semantics Mothballed Temporarily-closed PR the author plans to return to Performance Memory or execution speed performance Stale
Projects
None yet
Development

Successfully merging this pull request may close these issues.

PERF: Inefficient data representation when building dataframe from NumPy array using copy=True
4 participants