0.13rc1+ vs. 0.12 vbench, perf regressions. #5660

ghost · 2013-12-07T06:12:30Z

30% slowdown and up.

melt_dataframe                               |   2.7920 |   2.1487 |   1.2994 |
concat_small_frames                          |  22.1807 |  17.0054 |   1.3043 |
dti_reset_index                              |   0.2950 |   0.2257 |   1.3070 |
frame_reindex_upcast                         |  28.8033 |  21.2270 |   1.3569 |
frame_fancy_lookup_all                       |  30.4507 |  22.1930 |   1.3721 |
timeseries_large_lookup_value                |   0.0383 |   0.0277 |   1.3811 |
frame_boolean_row_select                     |   0.4419 |   0.3170 |   1.3941 |
panel_from_dict_equiv_indexes                |  58.9810 |  41.6213 |   1.4171 |
panel_from_dict_same_index                   |  58.9364 |  41.5120 |   1.4197 |
frame_xs_row                                 |   0.0637 |   0.0443 |   1.4355 |
groupby_frame_singlekey_integer              |   4.3193 |   2.9940 |   1.4427 |
frame_ctor_nested_dict                       | 142.8297 |  98.7384 |   1.4465 |
frame_to_csv2                                | 191.4937 | 130.7974 |   1.4640 |
series_ctor_from_dict                        |   5.1107 |   3.4617 |   1.4764 |
groupby_multi_python                         | 175.8397 | 117.9717 |   1.4905 |
frame_to_csv                                 | 237.1720 | 154.0547 |   1.5395 |
series_string_vector_slice                   | 325.1186 | 208.9640 |   1.5559 |
indexing_panel_subset                        |   0.9351 |   0.5874 |   1.5919 |
frame_insert_500_columns_end                 | 198.1669 | 124.3424 |   1.5937 |
frame_ctor_nested_dict_int64                 | 173.6054 | 106.8807 |   1.6243 |
frame_loc_dups                               |   1.2674 |   0.7770 |   1.6311 |
frame_get_dtype_counts                       |   0.1843 |   0.1113 |   1.6552 |
ctor_index_array_string                      |   0.0393 |   0.0236 |   1.6667 |
frame_get_numeric_data                       |   0.1736 |   0.0997 |   1.7424 |
frame_constructor_ndarray                    |   0.1047 |   0.0587 |   1.7846 |
frame_iteritems_cached                       |   1.1443 |   0.6117 |   1.8707 |
indexing_dataframe_boolean_rows_object       |   0.9307 |   0.4853 |   1.9176 |
reshape_stack_simple                         |   2.6710 |   1.3684 |   1.9520 |
frame_iloc_dups                              |   0.5160 |   0.2420 |   2.1323 |
frame_xs_col                                 |   0.0720 |   0.0323 |   2.2260 |
dataframe_getitem_scalar                     |   0.0199 |   0.0087 |   2.3028 |
datamatrix_getitem_scalar                    |   0.0193 |   0.0083 |   2.3143 |
indexing_dataframe_boolean_rows              |   0.5593 |   0.2393 |   2.3374 |
series_constructor_ndarray                   |   0.0397 |   0.0167 |   2.3762 |
sparse_frame_constructor                     |  13.0450 |   4.7807 |   2.7287 |
frame_iteritems                              |  76.1917 |  27.6610 |   2.7545 |
indexing_frame_get_value_ix                  |   0.0123 |   0.0044 |   2.8182 |
series_getitem_scalar                        |   0.0199 |   0.0070 |   2.8523 |
frame_from_series                            |   0.1774 |   0.0534 |   3.3214 |
frame_fancy_lookup                           |   7.8727 |   1.7236 |   4.5675 |
series_get_value                             |   0.0123 |   0.0017 |   7.3810 |

Ruled out

frame_repr_tall                              |  49.2450 |   2.3817 |  20.6762 |
frame_repr_wide                              | 977.6090 |   1.1230 | 870.5710 |

#5550: truncated repr for large frames. legitimate change.

The text was updated successfully, but these errors were encountered:

TomAugspurger · 2013-12-07T12:30:05Z

Are all of these needing to be fixed before .13 is released? I've got an open issue about a bug in the Panel from_dict constructor so I'm going to be poking around there anyway, but there's no way I'll be able to get to it within the next month.

ghost · 2013-12-07T12:55:47Z

Not all of these are legitimate regressions (some are false positives and some make sense, but some are unintentional).
We can't know until we root cause as many as possible. Often a single changeset affects multiple vbenches.

As to "blocker status", If a bunch of core operation are twice as slow I think that definitely deserves some
attention before cutting a release and accepting them as fait accompli.

Any help shrinking the pile is useful, simply bisecting to a commit is a lot of help and requires just a little
git-fu. Feel free to add comments with what you find and we can triage the list as we go along.

jreback · 2013-12-07T12:57:52Z

Here I have run 5x from 0.12 to current master
Ignoring trivial ones (e.g. < 1ms)

-------------------------------------------------------------------------------
Test name                                    | head[ms] | base[ms] |  ratio   |
-------------------------------------------------------------------------------
series_value_counts_int64                    |   2.6530 |   2.0230 |   1.3114 |
groupby_frame_apply_overhead                 |  10.9626 |   8.3477 |   1.3133 |
mask_floats                                  |   3.0997 |   2.3553 |   1.3160 |
reindex_multiindex                           |   1.4647 |   1.0816 |   1.3542 |
series_timestamp_compare                     |   2.5460 |   1.8457 |   1.3794 |
panel_from_dict_equiv_indexes                |  34.3850 |  24.6427 |   1.3953 |
panel_from_dict_same_index                   |  34.5701 |  24.7107 |   1.3990 |
period_setitem                               | 217.5270 | 139.4180 |   1.5602 | open issue for 0.14 about this to fix
groupby_multi_python                         | 121.4403 |  76.8107 |   1.5810 | unavoidable construction overhead
concat_series_axis1                          |  94.9897 |  59.8777 |   1.5864 |
join_dataframe_index_single_key_bigger       |   8.7953 |   5.3890 |   1.6321 |
frame_apply_user_func                        | 228.5024 | 139.5910 |   1.6369 | unavoidable construction overhead
reshape_stack_simple                         |   2.1490 |   1.1697 |   1.8373 | non-issue
concat_small_frames                          |  22.6560 |  12.0120 |   1.8861 | doing more than in 0.12
timeseries_period_downsample_mean            |  10.3310 |   5.4344 |   1.9011 |
reindex_daterange_backfill                   |   1.2367 |   0.6413 |   1.9283 | non-issue
reindex_daterange_pad                        |   1.2403 |   0.6430 |   1.9289 | non-issue
frame_iteritems                              |  39.3244 |  19.7013 |   1.9960 | has already been heavily optimized and just shows construction overhead
frame_fancy_lookup                           |   3.5677 |   1.6206 |   2.2015 | non-issue
sparse_frame_constructor                     |  12.2011 |   5.3713 |   2.2715 | unavoidable, constructor doing more work than in 0.12
frame_repr_tall                              |  23.2603 |   2.0727 |  11.2225 | recenently addressed
frame_wide_repr                              | 511.4203 |   0.8390 | 609.5630 | recently addressed
-------------------------------------------------------------------------------
Test name                                    | head[ms] | base[ms] |  ratio   |
-------------------------------------------------------------------------------

Most of these slowdowns are from a difference in construction overhead with 0.12. When doing an apply/groupbpy and splitting the object in cython, a user function is called, so the object passed must be a Series. Their is construction overhead on this.

Certainly would appreciate some eyes on this, but IMHO none of these are blockers.

ghost · 2013-12-07T13:00:50Z

The last two were not recently addressed, that's after the fixes, But they do more work (truncated rather then info view) so no fix required.

what does no-issue stand for? false positive?

jreback · 2013-12-07T13:01:31Z

I just realized that

ghost · 2013-12-07T13:03:51Z

I agree often < 1ms are false positives, but a few us ops are in inner loops so the noise can actually
have a large impact.

Seems like the constructor stuff has associated large slowdown, can you name the PR?
frames and such are not often constructed so if the overhead is O(1) and reasonably small,
not an issue as you say.

jreback · 2013-12-07T13:04:02Z

non-issue means its there, but its a less common operation (e.g. frame_fancy_lookup is actually the lookup call, not indexing), or the vbench is already on a very large set (e.g. reindx_daterange...), so not a big deal

jreback · 2013-12-07T13:06:40Z

#3482 started it all (their are benchmarks their about 1/2 down). It is possible things slipped a bit after that. A lot of the duplicate handling code is slow, so generally have a fast path for things like that (e.g. if your frame has dups then need to select columns by iloc rather than a faster [])

jreback · 2013-12-07T13:08:02Z

construction overhead is on Series; its still a few function calls slower that a straight ndarray construction (as its now using a simplied Block Manager).

ghost · 2013-12-07T13:09:11Z

What's the panel to_dict and/or set_item issue? we'll add a perf notice to the issue tracking it and punt.

jreback · 2013-12-07T13:11:04Z

#5155

jreback · 2013-12-07T13:12:22Z

I don't see an issue about panel to_dict (from_dict is small slowdown)

ghost · 2013-12-07T13:25:05Z

frame_iteritems has O(N) behavior in input, so it's not constant overhead per frame.
Granted it's a smell (unvectorized), but existing code becoming ~20X times as slow on average
(more detailed mesurements with summary stats, below) is an issue .

Is this #3482 as well? what did users get in return?

%  ./test_perf.sh -H -r iteritems -N 10 -S

0.12:

------------------------------------------------------------------------------------------------------------------------------------------------------------
Test name                                    |    #0    |    #1    |    #2    |    #3    |    #4    |    #5    |    #6    |    #7    |    #8    |    #9    |
------------------------------------------------------------------------------------------------------------------------------------------------------------
frame_iteritems_cached                       |   0.0810 |   0.0850 |   0.0813 |   0.0834 |   0.0817 |   0.0840 |   0.0823 |   0.0837 |   0.0819 |   0.0846 |
frame_iteritems                              |   2.8053 |   2.7980 |   2.7986 |   2.7970 |   2.7916 |   2.7956 |   2.8013 |   2.7913 |   2.7907 |   2.8013 |
------------------------------------------------------------------------------------------------------------------------------------------------------------
Test name                                    |    #0    |    #1    |    #2    |    #3    |    #4    |    #5    |    #6    |    #7    |    #8    |    #9    |
------------------------------------------------------------------------------------------------------------------------------------------------------------

Target [8c0a34f] : RLS: set released to True, edit release dates

                        count      mean       std       min       25%       50%       75%       max
frame_iteritems_cached     10  0.082898  0.001436  0.080983  0.081758  0.082850  0.083923  0.085036
frame_iteritems            10  2.797087  0.004857  2.790689  2.792637  2.797524  2.800604  2.805312

55e624d (0.13rc1+)

------------------------------------------------------------------------------------------------------------------------------------------------------------
Test name                                    |    #0    |    #1    |    #2    |    #3    |    #4    |    #5    |    #6    |    #7    |    #8    |    #9    |
------------------------------------------------------------------------------------------------------------------------------------------------------------
frame_iteritems_cached                       |   1.1127 |   1.1493 |   1.1156 |   1.1476 |   1.1197 |   1.1223 |   1.1117 |   1.1370 |   1.1060 |   1.1547 |
frame_iteritems                              |  73.0940 |  72.7213 |  72.8296 |  72.9273 |  73.0057 |  73.1320 |  72.8940 |  73.2087 |  73.0950 |  72.7540 |
------------------------------------------------------------------------------------------------------------------------------------------------------------
Test name                                    |    #0    |    #1    |    #2    |    #3    |    #4    |    #5    |    #6    |    #7    |    #8    |    #9    |
------------------------------------------------------------------------------------------------------------------------------------------------------------

Target [55e624d] : Merge pull request #5615 from gjreda/master

                        count       mean       std        min        25%       50%       75%       max
frame_iteritems_cached     10   1.127664  0.017851   1.106024   1.113435   1.121004   1.144946   1.154661  
frame_iteritems            10  72.966162  0.167046  72.721322  72.845737 72.966496  73.094745  73.208729

jreback · 2013-12-07T13:32:18Z

I think we are running the same bench, but here is the code-vs-code

v0.12 (release)

In [9]: %timeit f()
10 loops, best of 3: 20.1 ms per loop

current master

In [2]: %timeit f()
10 loops, best of 3: 39.3 ms per loop

In [1]: %cpaste
Pasting code; enter '--' alone on the line to stop or use Ctrl-D.
:df = DataFrame(randn(10000, 1000))
:df2 = DataFrame(randn(3000,1),columns=['A'])
:df3 = DataFrame(randn(3000,1))
:
:def f():
:    if hasattr(df, '_item_cache'):
:        df._item_cache.clear()
:    for name, col in df.iteritems():
:        pass
:
:def g():
:    for name, col in df.iteritems():
:        pass
:
:def h():
:    for i in xrange(10000):
:        df2['A']
:
:def j():
:    for i in xrange(10000):
:        df3[0]
:--

jreback · 2013-12-07T13:35:07Z

This measure pure Series construction overhead, which is basically 1.5x slower (which causes 99% of things above)

0.12

In [4]: arr = randn(10000)

In [5]: %timeit Series(arr)
10000 loops, best of 3: 47.4 µs per loop

master

In [1]: arr = randn(10000)

In [2]: %timeit Series(arr)
10000 loops, best of 3: 33.3 µs per loop

ghost · 2013-12-07T13:35:55Z

2x vs. 20x? I'll remeasure, maybe I fumbled it.

ghost · 2013-12-07T13:43:27Z

Same thing again: ~25x

%  ./test_perf.sh -H -r iteritems -N 10 -S -q

0.12

                        count      mean       std       min       25%       50%       75%       max
frame_iteritems_cached     25  0.062415  0.006150  0.060002  0.060638  0.060956  0.061353  0.091632
frame_iteritems            25  2.746789  0.010132  2.732674  2.741337  2.745946  2.751033  2.784967


0.13.0rc1-36-g55e624d
                        count      mean       std       min       25%       50%       75%       max
frame_iteritems_cached     25   1.149158  0.014892   1.127323   1.140992   1.150370   1.153310   1.190027  
frame_iteritems            25  74.445496  0.260619  74.016333  74.275335   74.384054  74.534019  75.117668

My original measurement showed 2x as well. Can't explain it.

jreback · 2013-12-07T14:07:51Z

It seems (and I may have been the culprit), the test frame

was DataFrame(10000,100) in 0.12, but is DataFrame(10000,1000) in master....so about about 2x slowdown (as the individual tests show)

ghost · 2013-12-07T14:18:24Z

right, always check out vb_suite/ from master before running the tests on each version, my bad. I'll check again.

ghost · 2013-12-07T14:24:08Z

yup. 2x is still a drop, what was the upside?

jreback · 2013-12-07T14:36:09Z

series inherits from ndframe so consistency in methods across pandas

ghost · 2013-12-07T14:51:19Z

hard sell.

dragoljub · 2013-12-09T23:02:06Z

So I'm concerned about the following. Are any of these going to slow more than O(1), eg get slower the more rows you process? Especially the apply and concat functions.

groupby_multi_python -
frame_apply_user_func -
concat_small_frames

hayd · 2013-12-10T04:15:46Z

@dragoljub Perhaps it's worth putting in some larger vbench examples of the things your're concerned about?

ghost · 2013-12-11T13:34:48Z

@dragoljub, as jreback suggests should not see a change in O(), just worse constants.

I can't locate further regressions beyond #3482. The jitteriness in vbench measurements
doesn't help either, and the erratic behavior goes beyond just an outlier here and there.
At least some of the perf differences are lower then suggested above, due to vbench noise.

I don't plan to chase this any more. @jreback, close?

jreback · 2013-12-11T14:39:48Z

yep..unless someone has a specific perf regression

dragoljub · 2013-12-11T18:58:39Z

Here is some example code. I was wondering how the df.apply() slowdown would impact computing 20k chi squared tests? Currently using the apply method it takes 1.58 seconds on my machine, with pd 0.12.0.

If its a ~~linear~~ constant slowdown then fine, but if it grows with more rows then it may become problematic.

import pandas as pd
import numpy as np
from scipy.stats import chi2_contingency


chi2 = lambda x: chi2_contingency(x.reshape(2,2), correction=True)[1]

df_test = pd.DataFrame(np.random.randint(20000, size=(10000,4)), columns=['A','B', 'C', 'D'])

%timeit df_test.apply(chi2, raw=True, axis=1)
1 loops, best of 3: 1.58 s per loop

pd.__version__
Out[5]: '0.12.0'

np.__version__
Out[6]: '1.7.1'

ghost · 2013-12-11T19:19:31Z

Linear means it grows with more rows, but I understand what you mean.

Running that example in 0.12 vs. 0.13rc shows perf is linear in row count for both
as expected, and no real change in total runtime.

Closing now, if you come across specific regressions you're concerned about,
please open an issue.

ghost mentioned this issue Dec 7, 2013

Ideas for improvement wesm/vbench#27

Open

16 tasks

ghost closed this as completed Dec 11, 2013

ghost mentioned this issue Jan 16, 2014

df.dtypes.values is not O(1) and repr(df) is therefore slow for large frames #5968

Closed

This issue was closed.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

0.13rc1+ vs. 0.12 vbench, perf regressions. #5660

0.13rc1+ vs. 0.12 vbench, perf regressions. #5660

ghost commented Dec 7, 2013

TomAugspurger commented Dec 7, 2013

ghost commented Dec 7, 2013

jreback commented Dec 7, 2013

ghost commented Dec 7, 2013

jreback commented Dec 7, 2013

ghost commented Dec 7, 2013

jreback commented Dec 7, 2013

jreback commented Dec 7, 2013

jreback commented Dec 7, 2013

ghost commented Dec 7, 2013

jreback commented Dec 7, 2013

jreback commented Dec 7, 2013

ghost commented Dec 7, 2013

jreback commented Dec 7, 2013

jreback commented Dec 7, 2013

ghost commented Dec 7, 2013

ghost commented Dec 7, 2013

jreback commented Dec 7, 2013

ghost commented Dec 7, 2013

ghost commented Dec 7, 2013

jreback commented Dec 7, 2013

ghost commented Dec 7, 2013

dragoljub commented Dec 9, 2013

hayd commented Dec 10, 2013

ghost commented Dec 11, 2013

jreback commented Dec 11, 2013

dragoljub commented Dec 11, 2013

ghost commented Dec 11, 2013

0.13rc1+ vs. 0.12 vbench, perf regressions. #5660

0.13rc1+ vs. 0.12 vbench, perf regressions. #5660

Comments

ghost commented Dec 7, 2013

TomAugspurger commented Dec 7, 2013

ghost commented Dec 7, 2013

jreback commented Dec 7, 2013

ghost commented Dec 7, 2013

jreback commented Dec 7, 2013

ghost commented Dec 7, 2013

jreback commented Dec 7, 2013

jreback commented Dec 7, 2013

jreback commented Dec 7, 2013

ghost commented Dec 7, 2013

jreback commented Dec 7, 2013

jreback commented Dec 7, 2013

ghost commented Dec 7, 2013

jreback commented Dec 7, 2013

jreback commented Dec 7, 2013

ghost commented Dec 7, 2013

ghost commented Dec 7, 2013

jreback commented Dec 7, 2013

ghost commented Dec 7, 2013

ghost commented Dec 7, 2013

jreback commented Dec 7, 2013

ghost commented Dec 7, 2013

dragoljub commented Dec 9, 2013

hayd commented Dec 10, 2013

ghost commented Dec 11, 2013

jreback commented Dec 11, 2013

dragoljub commented Dec 11, 2013

ghost commented Dec 11, 2013