Performance issue while groupby.shift if fill_value specified #26615

BeforeFlight · 2019-06-02T02:28:38Z

Test code:

    SIZE_MULT = 5
    data = np.random.randint(0, 255, size=10**SIZE_MULT, dtype='uint8')
    index = pd.MultiIndex.from_product(
                [list(range(10**(SIZE_MULT-1))), list('ABCDEFGHIJ')],
                names = ['d', 'l'])        
    test = pd.DataFrame(data, index, columns = ['data'])
    test.head()
    test['data'].dtype

     		data
    d 	l 	
    0 	A 	137
        B 	156
        C 	48
        D 	186
        E 	170
    
    dtype('uint8')

And suppose we want group by 0-level of index and shift each group (shift step = 2, for example).

%%time
shifted = test.groupby(axis=0, level=[0]).shift(2)
print(shifted['data'].dtype)

float64
CPU times: user 9.43 ms, sys: 56 µs, total: 9.49 ms
Wall time: 8.29 ms

Now to the problem: if we want to preserve our dtype 'uint8', we have to get rid of Nones, and set our fill value with 0, for example. But we will get HUGE time of code execution now:

%%time
shifted = test.groupby(axis=0, level=[0]).shift(2, fill_value = 0)
shifted.head()
print(shifted['data'].dtype)

uint8
CPU times: user 5.9 s, sys: 38.4 ms, total: 5.94 s
Wall time: 5.89 s

Expected time of execution should be comparable with following:
If we take 1st shifted dataframe without fill_value, and add few code lines to achieve same result:

%%time
shifted = test.groupby(axis=0, level=[0]).shift(2)
shifted.fillna(0, inplace=True)
shifted = shifted.astype(np.uint8)
print(shifted['data'].dtype)

Output:

uint8
CPU times: user 9.64 ms, sys: 3.68 ms, total: 13.3 ms
Wall time: 11.3 ms

It will add only few ms, not 5 seconds.

INSTALLED VERSIONS ------------------ commit: None python: 3.7.3.final.0 python-bits: 64 OS: Linux OS-release: 5.0.0-16-generic machine: x86_64 processor: x86_64 byteorder: little LC_ALL: None LANG: en_US.UTF-8 LOCALE: en_US.UTF-8

pandas: 0.24.2
pytest: 4.5.0
pip: 19.1.1
setuptools: 41.0.1
Cython: 0.29.7
numpy: 1.16.3
scipy: 1.2.1
pyarrow: None
xarray: 0.12.1
IPython: 7.2.0
sphinx: None
patsy: None
dateutil: 2.8.0
pytz: 2019.1
blosc: None
bottleneck: None
tables: 3.5.1
numexpr: 2.6.9
feather: None
matplotlib: 3.0.3
openpyxl: None
xlrd: None
xlwt: None
xlsxwriter: None
lxml.etree: None
bs4: None
html5lib: None
sqlalchemy: None
pymysql: None
psycopg2: None
jinja2: 2.10.1
s3fs: None
fastparquet: None
pandas_gbq: None
pandas_datareader: None
gcsfs: None

The text was updated successfully, but these errors were encountered:

WillAyd · 2019-06-02T02:35:16Z

Thanks for the report - the alternate solution you've identified is spot on. If you'd like to take a look at improving performance with the fill_value keyword we would certainly take PRs!

TomAugspurger · 2019-06-03T21:36:41Z

@BeforeFlight are you interested in working on this?

BeforeFlight · 2019-06-03T22:32:59Z

@TomAugspurger honestly I'm very novice on github (one may say - in pandas as well). So almost never have worked with 'branches - pull requests` schemes. Have to take quite some time to read docs / set environment / test environment / fully understand TDD / ... Besides now I'm encounter another pandas performance issue, with HDFstore, which slows my work a great deal. But I'm not really shore it's bug (one may have a skim on my 'investigations' - to understand is it worthy at all?) so haven't opened another issue.

So i think community just doesn't have so much time for waiting my pull requests :) (hypothetical for now). But if somehow this issue won't be solved by that time - it will become my first focus for sure.

BeforeFlight · 2019-06-03T22:52:54Z

But I may make some considerations even now, which may be helpful.

I think, that the reason for such delay of execution .shift(), is as proposed @Allolz, in using .apply instead of ._get_cythonized_result when fill_value is specified - code.

And as for implementation - I suppose it shouldn't be done in several steps like in the end of my 1st comment (remember dtype -> shift without fill_value -> fill_na -> restore dtype). I proposed it only for time execution comparing. It rather should be the way to infer fill_value for some low level functions which performs shift as parameter.

WillAyd · 2019-06-03T22:56:10Z

@BeforeFlight thanks for the initial insights. So one possibility to make this significantly faster is to still use a vectorized solution even when fill_value is provided. This could either be done by calling fill on the result of the shift call or alternately providing a post_processing function to _get_cythonized_result which does the filling and maintains dtype if possible.

chadaeschliman · 2020-04-16T14:50:26Z

Looks like _get_cythonized_result is using algorithms.take_nd under the hood to do the shift after computing the index. algorithms.take_nd takes a fill_value so this could be passed through.

WillAyd added Groupby Performance Memory or execution speed performance labels Jun 2, 2019

WillAyd added this to the Contributions Welcome milestone Jun 2, 2019

BeforeFlight mentioned this issue Jun 3, 2019

Read from HDF with empty where throws an error #26610

Closed

simonjayhawkins mentioned this issue Jun 8, 2021

BUG: groupby.shift returns different columns when fill_value is specified #41858

Merged

5 tasks

jreback modified the milestones: Contributions Welcome, 1.4 Jun 25, 2021

jreback closed this as completed in #41858 Jul 28, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Performance issue while groupby.shift if fill_value specified #26615

Performance issue while groupby.shift if fill_value specified #26615

BeforeFlight commented Jun 2, 2019 •

edited

Loading

WillAyd commented Jun 2, 2019

TomAugspurger commented Jun 3, 2019

BeforeFlight commented Jun 3, 2019 •

edited

Loading

BeforeFlight commented Jun 3, 2019 •

edited

Loading

WillAyd commented Jun 3, 2019

chadaeschliman commented Apr 16, 2020

Performance issue while groupby.shift if fill_value specified #26615

Performance issue while groupby.shift if fill_value specified #26615

Comments

BeforeFlight commented Jun 2, 2019 • edited Loading

WillAyd commented Jun 2, 2019

TomAugspurger commented Jun 3, 2019

BeforeFlight commented Jun 3, 2019 • edited Loading

BeforeFlight commented Jun 3, 2019 • edited Loading

WillAyd commented Jun 3, 2019

chadaeschliman commented Apr 16, 2020

BeforeFlight commented Jun 2, 2019 •

edited

Loading

BeforeFlight commented Jun 3, 2019 •

edited

Loading

BeforeFlight commented Jun 3, 2019 •

edited

Loading