Significant performance degradation in 0.9.1 for SparseDataFrame methods like to_dense() and save() and for arithmetic operations #2273

bluefir · 2012-11-16T19:32:08Z

This is what I have in version 0.9.0:

import pandas as pd
pd.__version__

'0.9.0'

barra_industry_exposures

<class 'pandas.core.frame.DataFrame'>
MultiIndex: 253738 entries, (20061229, '00036110') to (20120928, 'Y8564W10')
Data columns:
MINING_METALS 253738 non-null values
GOLD 253738 non-null values
FORESTRY_PAPER 253738 non-null values
CHEMICAL 253738 non-null values
ENERGY_RESERVES 253738 non-null values
OIL_REFINING 253738 non-null values
OIL_SERVICES 253738 non-null values
FOOD_BEVERAGES 253738 non-null values
ALCOHOL 253738 non-null values
TOBACCO 253738 non-null values
HOME_PRODUCTS 253738 non-null values
GROCERY_STORES 253738 non-null values
CONSUMER_DURABLES 253738 non-null values
MOTOR_VEHICLES 253738 non-null values
APPAREL_TEXTILES 253738 non-null values
CLOTHING_STORES 253738 non-null values
SPECIALTY_RETAIL 253738 non-null values
DEPARTMENT_STORES 253738 non-null values
CONSTRUCTION 253738 non-null values
PUBLISHING 253738 non-null values
MEDIA 253738 non-null values
HOTELS 253738 non-null values
RESTAURANTS 253738 non-null values
ENTERTAINMENT 253738 non-null values
LEISURE 253738 non-null values
ENVIRONMENTAL_SERVICES 253738 non-null values
HEAVY_ELECTRICAL_EQUIPMENT 253738 non-null values
HEAVY_MACHINERY 253738 non-null values
INDUSTRIAL_PARTS 253738 non-null values
ELECTRICAL_UTILITY 253738 non-null values
GAS_WATER_UTILITY 253738 non-null values
RAILROADS 253738 non-null values
AIRLINES 253738 non-null values
FREIGHT 253738 non-null values
MEDICAL_SERVICES 253738 non-null values
MEDICAL_PRODUCTS 253738 non-null values
DRUGS 253738 non-null values
ELECTRONIC_EQUIPMENT 253738 non-null values
SEMICONDUCTORS 253738 non-null values
COMPUTER_HARDWARE 253738 non-null values
COMPUTER_SOFTWARE 253738 non-null values
DEFENCE_AEROSPACE 253738 non-null values
TELEPHONE 253738 non-null values
WIRELESS 253738 non-null values
INFORMATION_SERVICES 253738 non-null values
INDUSTRIAL_SERVICES 253738 non-null values
LIFE_HEALTH_INSURANCE 253738 non-null values
PROPERTY_CASUALTY_INSURANCE 253738 non-null values
BANKS 253738 non-null values
THRIFTS 253738 non-null values
ASSET_MANAGEMENT 253738 non-null values
FINANCIAL_SERVICES 253738 non-null values
INTERNET 253738 non-null values
REITS 253738 non-null values
BIOTECH 253738 non-null values
dtypes: int64(55)

sparse = barra_industry_exposures.to_sparse(fill_value=0)
sparse

<class 'pandas.sparse.frame.SparseDataFrame'>
MultiIndex: 253738 entries, (20061229, '00036110') to (20120928, 'Y8564W10')
Columns: 55 entries, AIRLINES to WIRELESS
dtypes: float64(55)

%timeit sparse / 100.

100 loops, best of 3: 6.64 ms per loop

%timeit sparse.to_dense()

10 loops, best of 3: 127 ms per loop

%timeit sparse.save('test.pkl')

1 loops, best of 3: 16.9 ms per loop

Now this is what I get in 0.9.1:

import pandas as pd
pd.__version__

'0.9.1'

%timeit sparse / 100.

1 loops, best of 3: 92.2 s per loop

%timeit sparse.to_dense()

1 loops, best of 3: 99.8 s per loop

%timeit sparse.save('test.pkl')

1 loops, best of 3: 100 s per loop

So, in the new version SparseDataFrame methods that used to run in less than 7-130 ms now run in more than 90 s. Ouch! What happened?

The text was updated successfully, but these errors were encountered:

changhiskhan · 2012-11-16T19:45:07Z

We need more performance benchmarks in the vbench suite. Thanks for the feedback. We'll investigate.

ghost · 2012-11-16T20:57:08Z

This looks like 4a5b75b, though I'm not sure why take is so expensive.
the pending #2253 (3688e53) fixes the problem for me.

Testcase:

import pandas as pd
num=250000
l1=[randint(0,1000) for x in range(num)]
l2=[randint(0,20000) for x in range(num)]
l3=[randint(0,20000) for x in range(num)]
l4=[randint(0,20000) for x in range(num)]
a=pd.DataFrame(dict(zip([0,1,2,3],[l1,l2,l3,l4]))).set_index([0,1])
b=a.to_sparse()
%timeit b/100
%timeit b.to_dense()
%timeit b.save('test.pk1')

Edit: but perhaps there's another issue at play. I can't reproduce anything like 90s runtime on this data

wesm · 2012-11-17T02:46:37Z

Doh, this will teach me to review PRs more carefully; this is theoretically what vbench is for. I will fix

wesm · 2012-11-17T02:48:04Z

Ugh, iteritems for all DataFrames has borked performance. Guess we're going to see 0.9.2 sooner rather than later

catch the regression noted in #2273 next time.

wesm closed this as completed in bdbca8e Nov 17, 2012

ghost mentioned this issue Nov 23, 2012

VB: add test for iteritems performance #2335

Closed

wesm pushed a commit that referenced this issue Nov 23, 2012

VB: add test for iteritems performance

9202540

catch the regression noted in #2273 next time.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Significant performance degradation in 0.9.1 for SparseDataFrame methods like to_dense() and save() and for arithmetic operations #2273

Significant performance degradation in 0.9.1 for SparseDataFrame methods like to_dense() and save() and for arithmetic operations #2273

bluefir commented Nov 16, 2012

changhiskhan commented Nov 16, 2012

Uh oh!

ghost commented Nov 16, 2012

Uh oh!

wesm commented Nov 17, 2012

Uh oh!

wesm commented Nov 17, 2012

Uh oh!

Uh oh!

Significant performance degradation in 0.9.1 for SparseDataFrame methods like to_dense() and save() and for arithmetic operations #2273

Significant performance degradation in 0.9.1 for SparseDataFrame methods like to_dense() and save() and for arithmetic operations #2273

Comments

bluefir commented Nov 16, 2012

changhiskhan commented Nov 16, 2012

Uh oh!

ghost commented Nov 16, 2012

Uh oh!

wesm commented Nov 17, 2012

Uh oh!

wesm commented Nov 17, 2012

Uh oh!