Skip to content

Performance issue while groupby.shift if fill_value specified #26615

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
BeforeFlight opened this issue Jun 2, 2019 · 6 comments · Fixed by #41858
Closed

Performance issue while groupby.shift if fill_value specified #26615

BeforeFlight opened this issue Jun 2, 2019 · 6 comments · Fixed by #41858
Labels
Groupby Performance Memory or execution speed performance
Milestone

Comments

@BeforeFlight
Copy link
Contributor

BeforeFlight commented Jun 2, 2019

Test code:

    SIZE_MULT = 5
    data = np.random.randint(0, 255, size=10**SIZE_MULT, dtype='uint8')
    index = pd.MultiIndex.from_product(
                [list(range(10**(SIZE_MULT-1))), list('ABCDEFGHIJ')],
                names = ['d', 'l'])        
    test = pd.DataFrame(data, index, columns = ['data'])
    test.head()
    test['data'].dtype
     		data
    d 	l 	
    0 	A 	137
        B 	156
        C 	48
        D 	186
        E 	170
    
    dtype('uint8')

And suppose we want group by 0-level of index and shift each group (shift step = 2, for example).

%%time
shifted = test.groupby(axis=0, level=[0]).shift(2)
print(shifted['data'].dtype)
float64
CPU times: user 9.43 ms, sys: 56 µs, total: 9.49 ms
Wall time: 8.29 ms

Now to the problem: if we want to preserve our dtype 'uint8', we have to get rid of Nones, and set our fill value with 0, for example. But we will get HUGE time of code execution now:

%%time
shifted = test.groupby(axis=0, level=[0]).shift(2, fill_value = 0)
shifted.head()
print(shifted['data'].dtype)
uint8
CPU times: user 5.9 s, sys: 38.4 ms, total: 5.94 s
Wall time: 5.89 s

Expected time of execution should be comparable with following:
If we take 1st shifted dataframe without fill_value, and add few code lines to achieve same result:

%%time
shifted = test.groupby(axis=0, level=[0]).shift(2)
shifted.fillna(0, inplace=True)
shifted = shifted.astype(np.uint8)
print(shifted['data'].dtype)

Output:

uint8
CPU times: user 9.64 ms, sys: 3.68 ms, total: 13.3 ms
Wall time: 11.3 ms

It will add only few ms, not 5 seconds.

INSTALLED VERSIONS ------------------ commit: None python: 3.7.3.final.0 python-bits: 64 OS: Linux OS-release: 5.0.0-16-generic machine: x86_64 processor: x86_64 byteorder: little LC_ALL: None LANG: en_US.UTF-8 LOCALE: en_US.UTF-8

pandas: 0.24.2
pytest: 4.5.0
pip: 19.1.1
setuptools: 41.0.1
Cython: 0.29.7
numpy: 1.16.3
scipy: 1.2.1
pyarrow: None
xarray: 0.12.1
IPython: 7.2.0
sphinx: None
patsy: None
dateutil: 2.8.0
pytz: 2019.1
blosc: None
bottleneck: None
tables: 3.5.1
numexpr: 2.6.9
feather: None
matplotlib: 3.0.3
openpyxl: None
xlrd: None
xlwt: None
xlsxwriter: None
lxml.etree: None
bs4: None
html5lib: None
sqlalchemy: None
pymysql: None
psycopg2: None
jinja2: 2.10.1
s3fs: None
fastparquet: None
pandas_gbq: None
pandas_datareader: None
gcsfs: None

@WillAyd
Copy link
Member

WillAyd commented Jun 2, 2019

Thanks for the report - the alternate solution you've identified is spot on. If you'd like to take a look at improving performance with the fill_value keyword we would certainly take PRs!

@WillAyd WillAyd added Groupby Performance Memory or execution speed performance labels Jun 2, 2019
@WillAyd WillAyd added this to the Contributions Welcome milestone Jun 2, 2019
@TomAugspurger
Copy link
Contributor

@BeforeFlight are you interested in working on this?

@BeforeFlight
Copy link
Contributor Author

BeforeFlight commented Jun 3, 2019

@TomAugspurger honestly I'm very novice on github (one may say - in pandas as well). So almost never have worked with 'branches - pull requests` schemes. Have to take quite some time to read docs / set environment / test environment / fully understand TDD / ... Besides now I'm encounter another pandas performance issue, with HDFstore, which slows my work a great deal. But I'm not really shore it's bug (one may have a skim on my 'investigations' - to understand is it worthy at all?) so haven't opened another issue.

So i think community just doesn't have so much time for waiting my pull requests :) (hypothetical for now). But if somehow this issue won't be solved by that time - it will become my first focus for sure.

@BeforeFlight
Copy link
Contributor Author

BeforeFlight commented Jun 3, 2019

But I may make some considerations even now, which may be helpful.

I think, that the reason for such delay of execution .shift(), is as proposed @Allolz, in using .apply instead of ._get_cythonized_result when fill_value is specified - code.

And as for implementation - I suppose it shouldn't be done in several steps like in the end of my 1st comment (remember dtype -> shift without fill_value -> fill_na -> restore dtype). I proposed it only for time execution comparing. It rather should be the way to infer fill_value for some low level functions which performs shift as parameter.

@WillAyd
Copy link
Member

WillAyd commented Jun 3, 2019

@BeforeFlight thanks for the initial insights. So one possibility to make this significantly faster is to still use a vectorized solution even when fill_value is provided. This could either be done by calling fill on the result of the shift call or alternately providing a post_processing function to _get_cythonized_result which does the filling and maintains dtype if possible.

@chadaeschliman
Copy link

Looks like _get_cythonized_result is using algorithms.take_nd under the hood to do the shift after computing the index. algorithms.take_nd takes a fill_value so this could be passed through.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Groupby Performance Memory or execution speed performance
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants