Skip to content

Pandas MultiIndex causes out of memory error #36074

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
mahsa-ebrahimian opened this issue Sep 2, 2020 · 4 comments
Closed

Pandas MultiIndex causes out of memory error #36074

mahsa-ebrahimian opened this issue Sep 2, 2020 · 4 comments

Comments

@mahsa-ebrahimian
Copy link

mahsa-ebrahimian commented Sep 2, 2020

I have used multi indexing in my code which is causing out of memory error.

import pandas as pd import numpy as np import io import requests url="https://raw.githubusercontent.com/mahsa-ebrahimian/netflix_project/master/netflix_sample_complete.csv" movie_db=pd.read_csv(url, error_bad_lines=False) del movie_db['Unnamed: 0'] iix_n = pd.MultiIndex.from_product([np.unique(movie_db.user_id), np.unique(movie_db.date)]) arr = (movie_db.pivot_table('rating', ['user_id', 'date'], 'item_id', aggfunc='sum').reindex(iix_n,copy=False).to_numpy().reshape(movie_db.user_id.nunique(),movie_db.date.nunique(),-1))

any performance tip or alternative solution to change my data into desired 3D way would be appreciated.

@jbrockmendel
Copy link
Member

Can you give a copy/paste-able example (i.e. one that doesnt require downloading a zip file)? See https://matthewrocklin.com/blog/work/2018/02/28/minimal-bug-reports

@mahsa-ebrahimian
Copy link
Author

sure, I just updated it.

@attack68
Copy link
Contributor

attack68 commented Sep 3, 2020

From your data:

len(df['date'].unique())  # 1951
len(df['user_id'].unique()) # 2000

so your memory requirement for the multiindex from_product is almost 4 million rows.
Your data only has 404000 rows so the multiindex you are creating is inefficient for your data.

Alternatively:

df.set_index(['user_id', 'date'], inplace=True)
df.index
"""
MultiIndex([(1567167, '2005-09-11'),
            (1714116, '2005-08-09'),
            ...
            (1070701, '2004-11-04'),
            (1304720, '2005-01-31')],
           names=['user_id', 'date'], length=69710)
"""

There are only 69710 unique tuples in the multiindex derived from your data.

I think what are trying to achieve is doable, but I would ask in StackOverflow, github issues is not an ideal place for this.

@jbrockmendel
Copy link
Member

I think what are trying to achieve is doable, but I would ask in StackOverflow, github issues is not an ideal place for this.

This is the right answer. Closing.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants