Skip to content

0.13rc1+ vs. 0.12 vbench, perf regressions. #5660

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
ghost opened this issue Dec 7, 2013 · 28 comments
Closed

0.13rc1+ vs. 0.12 vbench, perf regressions. #5660

ghost opened this issue Dec 7, 2013 · 28 comments
Labels
Performance Memory or execution speed performance Regression Functionality that used to work in a prior pandas version
Milestone

Comments

@ghost
Copy link

ghost commented Dec 7, 2013

30% slowdown and up.

melt_dataframe                               |   2.7920 |   2.1487 |   1.2994 |
concat_small_frames                          |  22.1807 |  17.0054 |   1.3043 |
dti_reset_index                              |   0.2950 |   0.2257 |   1.3070 |
frame_reindex_upcast                         |  28.8033 |  21.2270 |   1.3569 |
frame_fancy_lookup_all                       |  30.4507 |  22.1930 |   1.3721 |
timeseries_large_lookup_value                |   0.0383 |   0.0277 |   1.3811 |
frame_boolean_row_select                     |   0.4419 |   0.3170 |   1.3941 |
panel_from_dict_equiv_indexes                |  58.9810 |  41.6213 |   1.4171 |
panel_from_dict_same_index                   |  58.9364 |  41.5120 |   1.4197 |
frame_xs_row                                 |   0.0637 |   0.0443 |   1.4355 |
groupby_frame_singlekey_integer              |   4.3193 |   2.9940 |   1.4427 |
frame_ctor_nested_dict                       | 142.8297 |  98.7384 |   1.4465 |
frame_to_csv2                                | 191.4937 | 130.7974 |   1.4640 |
series_ctor_from_dict                        |   5.1107 |   3.4617 |   1.4764 |
groupby_multi_python                         | 175.8397 | 117.9717 |   1.4905 |
frame_to_csv                                 | 237.1720 | 154.0547 |   1.5395 |
series_string_vector_slice                   | 325.1186 | 208.9640 |   1.5559 |
indexing_panel_subset                        |   0.9351 |   0.5874 |   1.5919 |
frame_insert_500_columns_end                 | 198.1669 | 124.3424 |   1.5937 |
frame_ctor_nested_dict_int64                 | 173.6054 | 106.8807 |   1.6243 |
frame_loc_dups                               |   1.2674 |   0.7770 |   1.6311 |
frame_get_dtype_counts                       |   0.1843 |   0.1113 |   1.6552 |
ctor_index_array_string                      |   0.0393 |   0.0236 |   1.6667 |
frame_get_numeric_data                       |   0.1736 |   0.0997 |   1.7424 |
frame_constructor_ndarray                    |   0.1047 |   0.0587 |   1.7846 |
frame_iteritems_cached                       |   1.1443 |   0.6117 |   1.8707 |
indexing_dataframe_boolean_rows_object       |   0.9307 |   0.4853 |   1.9176 |
reshape_stack_simple                         |   2.6710 |   1.3684 |   1.9520 |
frame_iloc_dups                              |   0.5160 |   0.2420 |   2.1323 |
frame_xs_col                                 |   0.0720 |   0.0323 |   2.2260 |
dataframe_getitem_scalar                     |   0.0199 |   0.0087 |   2.3028 |
datamatrix_getitem_scalar                    |   0.0193 |   0.0083 |   2.3143 |
indexing_dataframe_boolean_rows              |   0.5593 |   0.2393 |   2.3374 |
series_constructor_ndarray                   |   0.0397 |   0.0167 |   2.3762 |
sparse_frame_constructor                     |  13.0450 |   4.7807 |   2.7287 |
frame_iteritems                              |  76.1917 |  27.6610 |   2.7545 |
indexing_frame_get_value_ix                  |   0.0123 |   0.0044 |   2.8182 |
series_getitem_scalar                        |   0.0199 |   0.0070 |   2.8523 |
frame_from_series                            |   0.1774 |   0.0534 |   3.3214 |
frame_fancy_lookup                           |   7.8727 |   1.7236 |   4.5675 |
series_get_value                             |   0.0123 |   0.0017 |   7.3810 |

Ruled out

frame_repr_tall                              |  49.2450 |   2.3817 |  20.6762 |
frame_repr_wide                              | 977.6090 |   1.1230 | 870.5710 |

#5550: truncated repr for large frames. legitimate change.

@TomAugspurger
Copy link
Contributor

Are all of these needing to be fixed before .13 is released? I've got an open issue about a bug in the Panel from_dict constructor so I'm going to be poking around there anyway, but there's no way I'll be able to get to it within the next month.

@ghost
Copy link
Author

ghost commented Dec 7, 2013

Not all of these are legitimate regressions (some are false positives and some make sense, but some are unintentional).
We can't know until we root cause as many as possible. Often a single changeset affects multiple vbenches.

As to "blocker status", If a bunch of core operation are twice as slow I think that definitely deserves some
attention before cutting a release and accepting them as fait accompli.

Any help shrinking the pile is useful, simply bisecting to a commit is a lot of help and requires just a little
git-fu. Feel free to add comments with what you find and we can triage the list as we go along.

@jreback
Copy link
Contributor

jreback commented Dec 7, 2013

Here I have run 5x from 0.12 to current master
Ignoring trivial ones (e.g. < 1ms)

-------------------------------------------------------------------------------
Test name                                    | head[ms] | base[ms] |  ratio   |
-------------------------------------------------------------------------------
series_value_counts_int64                    |   2.6530 |   2.0230 |   1.3114 |
groupby_frame_apply_overhead                 |  10.9626 |   8.3477 |   1.3133 |
mask_floats                                  |   3.0997 |   2.3553 |   1.3160 |
reindex_multiindex                           |   1.4647 |   1.0816 |   1.3542 |
series_timestamp_compare                     |   2.5460 |   1.8457 |   1.3794 |
panel_from_dict_equiv_indexes                |  34.3850 |  24.6427 |   1.3953 |
panel_from_dict_same_index                   |  34.5701 |  24.7107 |   1.3990 |
period_setitem                               | 217.5270 | 139.4180 |   1.5602 | open issue for 0.14 about this to fix
groupby_multi_python                         | 121.4403 |  76.8107 |   1.5810 | unavoidable construction overhead
concat_series_axis1                          |  94.9897 |  59.8777 |   1.5864 |
join_dataframe_index_single_key_bigger       |   8.7953 |   5.3890 |   1.6321 |
frame_apply_user_func                        | 228.5024 | 139.5910 |   1.6369 | unavoidable construction overhead
reshape_stack_simple                         |   2.1490 |   1.1697 |   1.8373 | non-issue
concat_small_frames                          |  22.6560 |  12.0120 |   1.8861 | doing more than in 0.12
timeseries_period_downsample_mean            |  10.3310 |   5.4344 |   1.9011 |
reindex_daterange_backfill                   |   1.2367 |   0.6413 |   1.9283 | non-issue
reindex_daterange_pad                        |   1.2403 |   0.6430 |   1.9289 | non-issue
frame_iteritems                              |  39.3244 |  19.7013 |   1.9960 | has already been heavily optimized and just shows construction overhead
frame_fancy_lookup                           |   3.5677 |   1.6206 |   2.2015 | non-issue
sparse_frame_constructor                     |  12.2011 |   5.3713 |   2.2715 | unavoidable, constructor doing more work than in 0.12
frame_repr_tall                              |  23.2603 |   2.0727 |  11.2225 | recenently addressed
frame_wide_repr                              | 511.4203 |   0.8390 | 609.5630 | recently addressed
-------------------------------------------------------------------------------
Test name                                    | head[ms] | base[ms] |  ratio   |
-------------------------------------------------------------------------------

Most of these slowdowns are from a difference in construction overhead with 0.12. When doing an apply/groupbpy and splitting the object in cython, a user function is called, so the object passed must be a Series. Their is construction overhead on this.

Certainly would appreciate some eyes on this, but IMHO none of these are blockers.

@ghost
Copy link
Author

ghost commented Dec 7, 2013

The last two were not recently addressed, that's after the fixes, But they do more work (truncated rather then info view) so no fix required.

what does no-issue stand for? false positive?

@jreback
Copy link
Contributor

jreback commented Dec 7, 2013

I just realized that

@ghost
Copy link
Author

ghost commented Dec 7, 2013

I agree often < 1ms are false positives, but a few us ops are in inner loops so the noise can actually
have a large impact.

Seems like the constructor stuff has associated large slowdown, can you name the PR?
frames and such are not often constructed so if the overhead is O(1) and reasonably small,
not an issue as you say.

@jreback
Copy link
Contributor

jreback commented Dec 7, 2013

non-issue means its there, but its a less common operation (e.g. frame_fancy_lookup is actually the lookup call, not indexing), or the vbench is already on a very large set (e.g. reindx_daterange...), so not a big deal

@jreback
Copy link
Contributor

jreback commented Dec 7, 2013

#3482 started it all (their are benchmarks their about 1/2 down). It is possible things slipped a bit after that. A lot of the duplicate handling code is slow, so generally have a fast path for things like that (e.g. if your frame has dups then need to select columns by iloc rather than a faster [])

@jreback
Copy link
Contributor

jreback commented Dec 7, 2013

construction overhead is on Series; its still a few function calls slower that a straight ndarray construction (as its now using a simplied Block Manager).

@ghost
Copy link
Author

ghost commented Dec 7, 2013

What's the panel to_dict and/or set_item issue? we'll add a perf notice to the issue tracking it and punt.

@jreback
Copy link
Contributor

jreback commented Dec 7, 2013

#5155

@ghost ghost mentioned this issue Dec 7, 2013
16 tasks
@jreback
Copy link
Contributor

jreback commented Dec 7, 2013

I don't see an issue about panel to_dict (from_dict is small slowdown)

@ghost
Copy link
Author

ghost commented Dec 7, 2013

frame_iteritems has O(N) behavior in input, so it's not constant overhead per frame.
Granted it's a smell (unvectorized), but existing code becoming ~20X times as slow on average
(more detailed mesurements with summary stats, below) is an issue .

Is this #3482 as well? what did users get in return?

%  ./test_perf.sh -H -r iteritems -N 10 -S 

0.12:

------------------------------------------------------------------------------------------------------------------------------------------------------------
Test name                                    |    #0    |    #1    |    #2    |    #3    |    #4    |    #5    |    #6    |    #7    |    #8    |    #9    |
------------------------------------------------------------------------------------------------------------------------------------------------------------
frame_iteritems_cached                       |   0.0810 |   0.0850 |   0.0813 |   0.0834 |   0.0817 |   0.0840 |   0.0823 |   0.0837 |   0.0819 |   0.0846 |
frame_iteritems                              |   2.8053 |   2.7980 |   2.7986 |   2.7970 |   2.7916 |   2.7956 |   2.8013 |   2.7913 |   2.7907 |   2.8013 |
------------------------------------------------------------------------------------------------------------------------------------------------------------
Test name                                    |    #0    |    #1    |    #2    |    #3    |    #4    |    #5    |    #6    |    #7    |    #8    |    #9    |
------------------------------------------------------------------------------------------------------------------------------------------------------------

Target [8c0a34f] : RLS: set released to True, edit release dates

                        count      mean       std       min       25%       50%       75%       max
frame_iteritems_cached     10  0.082898  0.001436  0.080983  0.081758  0.082850  0.083923  0.085036
frame_iteritems            10  2.797087  0.004857  2.790689  2.792637  2.797524  2.800604  2.805312

55e624d (0.13rc1+)

------------------------------------------------------------------------------------------------------------------------------------------------------------
Test name                                    |    #0    |    #1    |    #2    |    #3    |    #4    |    #5    |    #6    |    #7    |    #8    |    #9    |
------------------------------------------------------------------------------------------------------------------------------------------------------------
frame_iteritems_cached                       |   1.1127 |   1.1493 |   1.1156 |   1.1476 |   1.1197 |   1.1223 |   1.1117 |   1.1370 |   1.1060 |   1.1547 |
frame_iteritems                              |  73.0940 |  72.7213 |  72.8296 |  72.9273 |  73.0057 |  73.1320 |  72.8940 |  73.2087 |  73.0950 |  72.7540 |
------------------------------------------------------------------------------------------------------------------------------------------------------------
Test name                                    |    #0    |    #1    |    #2    |    #3    |    #4    |    #5    |    #6    |    #7    |    #8    |    #9    |
------------------------------------------------------------------------------------------------------------------------------------------------------------

Target [55e624d] : Merge pull request #5615 from gjreda/master

                        count       mean       std        min        25%       50%       75%       max
frame_iteritems_cached     10   1.127664  0.017851   1.106024   1.113435   1.121004   1.144946   1.154661  
frame_iteritems            10  72.966162  0.167046  72.721322  72.845737 72.966496  73.094745  73.208729   


@jreback
Copy link
Contributor

jreback commented Dec 7, 2013

I think we are running the same bench, but here is the code-vs-code

v0.12 (release)

In [9]: %timeit f()
10 loops, best of 3: 20.1 ms per loop

current master

In [2]: %timeit f()
10 loops, best of 3: 39.3 ms per loop
In [1]: %cpaste
Pasting code; enter '--' alone on the line to stop or use Ctrl-D.
:df = DataFrame(randn(10000, 1000))
:df2 = DataFrame(randn(3000,1),columns=['A'])
:df3 = DataFrame(randn(3000,1))
:
:def f():
:    if hasattr(df, '_item_cache'):
:        df._item_cache.clear()
:    for name, col in df.iteritems():
:        pass
:
:def g():
:    for name, col in df.iteritems():
:        pass
:
:def h():
:    for i in xrange(10000):
:        df2['A']
:
:def j():
:    for i in xrange(10000):
:        df3[0]
:--

@jreback
Copy link
Contributor

jreback commented Dec 7, 2013

This measure pure Series construction overhead, which is basically 1.5x slower (which causes 99% of things above)

0.12

In [4]: arr = randn(10000)

In [5]: %timeit Series(arr)
10000 loops, best of 3: 47.4 µs per loop

master

In [1]: arr = randn(10000)

In [2]: %timeit Series(arr)
10000 loops, best of 3: 33.3 µs per loop

@ghost
Copy link
Author

ghost commented Dec 7, 2013

2x vs. 20x? I'll remeasure, maybe I fumbled it.

@ghost
Copy link
Author

ghost commented Dec 7, 2013

Same thing again: ~25x

%  ./test_perf.sh -H -r iteritems -N 10 -S -q
0.12

                        count      mean       std       min       25%       50%       75%       max
frame_iteritems_cached     25  0.062415  0.006150  0.060002  0.060638  0.060956  0.061353  0.091632
frame_iteritems            25  2.746789  0.010132  2.732674  2.741337  2.745946  2.751033  2.784967


0.13.0rc1-36-g55e624d
                        count      mean       std       min       25%       50%       75%       max
frame_iteritems_cached     25   1.149158  0.014892   1.127323   1.140992   1.150370   1.153310   1.190027  
frame_iteritems            25  74.445496  0.260619  74.016333  74.275335   74.384054  74.534019  75.117668  

My original measurement showed 2x as well. Can't explain it.

@jreback
Copy link
Contributor

jreback commented Dec 7, 2013

It seems (and I may have been the culprit), the test frame

was DataFrame(10000,100) in 0.12, but is DataFrame(10000,1000) in master....so about about 2x slowdown (as the individual tests show)

@ghost
Copy link
Author

ghost commented Dec 7, 2013

right, always check out vb_suite/ from master before running the tests on each version, my bad. I'll check again.

@ghost
Copy link
Author

ghost commented Dec 7, 2013

yup. 2x is still a drop, what was the upside?

@jreback
Copy link
Contributor

jreback commented Dec 7, 2013

series inherits from ndframe so consistency in methods across pandas

@ghost
Copy link
Author

ghost commented Dec 7, 2013

hard sell.

@dragoljub
Copy link

So I'm concerned about the following. Are any of these going to slow more than O(1), eg get slower the more rows you process? Especially the apply and concat functions.

groupby_multi_python -
frame_apply_user_func -
concat_small_frames

@hayd
Copy link
Contributor

hayd commented Dec 10, 2013

@dragoljub Perhaps it's worth putting in some larger vbench examples of the things your're concerned about?

@ghost
Copy link
Author

ghost commented Dec 11, 2013

@dragoljub, as jreback suggests should not see a change in O(), just worse constants.

I can't locate further regressions beyond #3482. The jitteriness in vbench measurements
doesn't help either, and the erratic behavior goes beyond just an outlier here and there.
At least some of the perf differences are lower then suggested above, due to vbench noise.

I don't plan to chase this any more. @jreback, close?

@jreback
Copy link
Contributor

jreback commented Dec 11, 2013

yep..unless someone has a specific perf regression

@dragoljub
Copy link

Here is some example code. I was wondering how the df.apply() slowdown would impact computing 20k chi squared tests? Currently using the apply method it takes 1.58 seconds on my machine, with pd 0.12.0.

If its a linear constant slowdown then fine, but if it grows with more rows then it may become problematic.

import pandas as pd
import numpy as np
from scipy.stats import chi2_contingency


chi2 = lambda x: chi2_contingency(x.reshape(2,2), correction=True)[1]

df_test = pd.DataFrame(np.random.randint(20000, size=(10000,4)), columns=['A','B', 'C', 'D'])

%timeit df_test.apply(chi2, raw=True, axis=1)
1 loops, best of 3: 1.58 s per loop

pd.__version__
Out[5]: '0.12.0'

np.__version__
Out[6]: '1.7.1'

@ghost
Copy link
Author

ghost commented Dec 11, 2013

Linear means it grows with more rows, but I understand what you mean.

Running that example in 0.12 vs. 0.13rc shows perf is linear in row count for both
as expected, and no real change in total runtime.

Closing now, if you come across specific regressions you're concerned about,
please open an issue.

@ghost ghost closed this as completed Dec 11, 2013
This issue was closed.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Performance Memory or execution speed performance Regression Functionality that used to work in a prior pandas version
Projects
None yet
Development

No branches or pull requests

4 participants