Skip to content

Significant performance degradation in 0.9.1 for SparseDataFrame methods like to_dense() and save() and for arithmetic operations #2273

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
bluefir opened this issue Nov 16, 2012 · 4 comments
Labels
Milestone

Comments

@bluefir
Copy link

bluefir commented Nov 16, 2012

This is what I have in version 0.9.0:

import pandas as pd
pd.__version__

'0.9.0'

barra_industry_exposures

<class 'pandas.core.frame.DataFrame'>
MultiIndex: 253738 entries, (20061229, '00036110') to (20120928, 'Y8564W10')
Data columns:
MINING_METALS 253738 non-null values
GOLD 253738 non-null values
FORESTRY_PAPER 253738 non-null values
CHEMICAL 253738 non-null values
ENERGY_RESERVES 253738 non-null values
OIL_REFINING 253738 non-null values
OIL_SERVICES 253738 non-null values
FOOD_BEVERAGES 253738 non-null values
ALCOHOL 253738 non-null values
TOBACCO 253738 non-null values
HOME_PRODUCTS 253738 non-null values
GROCERY_STORES 253738 non-null values
CONSUMER_DURABLES 253738 non-null values
MOTOR_VEHICLES 253738 non-null values
APPAREL_TEXTILES 253738 non-null values
CLOTHING_STORES 253738 non-null values
SPECIALTY_RETAIL 253738 non-null values
DEPARTMENT_STORES 253738 non-null values
CONSTRUCTION 253738 non-null values
PUBLISHING 253738 non-null values
MEDIA 253738 non-null values
HOTELS 253738 non-null values
RESTAURANTS 253738 non-null values
ENTERTAINMENT 253738 non-null values
LEISURE 253738 non-null values
ENVIRONMENTAL_SERVICES 253738 non-null values
HEAVY_ELECTRICAL_EQUIPMENT 253738 non-null values
HEAVY_MACHINERY 253738 non-null values
INDUSTRIAL_PARTS 253738 non-null values
ELECTRICAL_UTILITY 253738 non-null values
GAS_WATER_UTILITY 253738 non-null values
RAILROADS 253738 non-null values
AIRLINES 253738 non-null values
FREIGHT 253738 non-null values
MEDICAL_SERVICES 253738 non-null values
MEDICAL_PRODUCTS 253738 non-null values
DRUGS 253738 non-null values
ELECTRONIC_EQUIPMENT 253738 non-null values
SEMICONDUCTORS 253738 non-null values
COMPUTER_HARDWARE 253738 non-null values
COMPUTER_SOFTWARE 253738 non-null values
DEFENCE_AEROSPACE 253738 non-null values
TELEPHONE 253738 non-null values
WIRELESS 253738 non-null values
INFORMATION_SERVICES 253738 non-null values
INDUSTRIAL_SERVICES 253738 non-null values
LIFE_HEALTH_INSURANCE 253738 non-null values
PROPERTY_CASUALTY_INSURANCE 253738 non-null values
BANKS 253738 non-null values
THRIFTS 253738 non-null values
ASSET_MANAGEMENT 253738 non-null values
FINANCIAL_SERVICES 253738 non-null values
INTERNET 253738 non-null values
REITS 253738 non-null values
BIOTECH 253738 non-null values
dtypes: int64(55)

sparse = barra_industry_exposures.to_sparse(fill_value=0)
sparse

<class 'pandas.sparse.frame.SparseDataFrame'>
MultiIndex: 253738 entries, (20061229, '00036110') to (20120928, 'Y8564W10')
Columns: 55 entries, AIRLINES to WIRELESS
dtypes: float64(55)

%timeit sparse / 100.

100 loops, best of 3: 6.64 ms per loop

%timeit sparse.to_dense()

10 loops, best of 3: 127 ms per loop

%timeit sparse.save('test.pkl')

1 loops, best of 3: 16.9 ms per loop

Now this is what I get in 0.9.1:

import pandas as pd
pd.__version__

'0.9.1'

%timeit sparse / 100.

1 loops, best of 3: 92.2 s per loop

%timeit sparse.to_dense()

1 loops, best of 3: 99.8 s per loop

%timeit sparse.save('test.pkl')

1 loops, best of 3: 100 s per loop

So, in the new version SparseDataFrame methods that used to run in less than 7-130 ms now run in more than 90 s. Ouch! What happened?

@changhiskhan
Copy link
Contributor

We need more performance benchmarks in the vbench suite. Thanks for the feedback. We'll investigate.

@ghost
Copy link

ghost commented Nov 16, 2012

This looks like 4a5b75b, though I'm not sure why take is so expensive.
the pending #2253 (3688e53) fixes the problem for me.

Testcase:

import pandas as pd
num=250000
l1=[randint(0,1000) for x in range(num)]
l2=[randint(0,20000) for x in range(num)]
l3=[randint(0,20000) for x in range(num)]
l4=[randint(0,20000) for x in range(num)]
a=pd.DataFrame(dict(zip([0,1,2,3],[l1,l2,l3,l4]))).set_index([0,1])
b=a.to_sparse()
%timeit b/100
%timeit b.to_dense()
%timeit b.save('test.pk1')

Edit: but perhaps there's another issue at play. I can't reproduce anything like 90s runtime on this data

@wesm
Copy link
Member

wesm commented Nov 17, 2012

Doh, this will teach me to review PRs more carefully; this is theoretically what vbench is for. I will fix

@wesm
Copy link
Member

wesm commented Nov 17, 2012

Ugh, iteritems for all DataFrames has borked performance. Guess we're going to see 0.9.2 sooner rather than later

@wesm wesm closed this as completed in bdbca8e Nov 17, 2012
wesm pushed a commit that referenced this issue Nov 23, 2012
catch the regression noted in #2273 next time.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants