Skip to content

transform in groupby throws TypeError when run with python -O option #2057

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
bluefir opened this issue Oct 11, 2012 · 4 comments
Closed

transform in groupby throws TypeError when run with python -O option #2057

bluefir opened this issue Oct 11, 2012 · 4 comments
Labels
Milestone

Comments

@bluefir
Copy link

bluefir commented Oct 11, 2012

I have a script that works fine when run without any options but generates the following traceback when run with 'python -O':

File "C:\Python27\lib\site-packages\pandas\core\groupby.py", line 1745, in transform
return self._transform_item_by_item(obj, wrapper)
File "C:\Python27\lib\site-packages\pandas\core\groupby.py", line 1777, in _transform_item_by_item
raise TypeError('Transform function invalid for data types')
TypeError: Transform function invalid for data types

I have Python 2.7.3 and pandas 0.9.0:

Python 2.7.3 (default, Apr 10 2012, 23:24:47) [MSC v.1500 64 bit (AMD64)] on win32
Type "help", "copyright", "credits" or "license" for more information.

import pandas
pandas.version
'0.9.0'

@wesm
Copy link
Member

wesm commented Oct 11, 2012

That's not cool. Do you have a self-contained reproduction you could post here?

@bluefir
Copy link
Author

bluefir commented Oct 12, 2012

Well, sort of. I tried to reproduce and discovered another strange behavior. Here is the code for the strange behavior:

import numpy as np
from pandas import DataFrame, MultiIndex

def quantiles(df, q=0.5):
    print('Entered quantiles() with shape ' + str(df.shape))

    print('Calculating quantiles ' + str(q) + ' for each column')
    qtls = df.quantile(q)

    print('Building output data frame')
    df_zeros = DataFrame(np.zeros(df.shape), index=df.index, columns=df.columns)
    df_out = df_zeros.add(qtls, axis='columns')

    print('Output shape ' + str(df_out.shape))
    return df_out


midx = MultiIndex(levels=[[1, 2], ['a', 'b', 'c', 'd', 'e']],
                  labels=[[0, 0, 0, 0, 0, 1, 1, 1, 1, 1],
                          [0, 1, 2, 3, 4, 0, 1, 2, 3, 4]],
                  names=['date', 'id'])
df = DataFrame(np.random.randn(10, 2), index=midx, columns=['col1', 'col2'])
print('\nInput data frame: ')
print(df.to_string())
print('\nCalculating medians for each column:')
qtls = df.groupby(level='date').transform(quantiles)
print('\nData frame with medians:')
print('Shape ' + str(qtls.shape))
print(qtls.to_string())

'python TestGroupbyTransformO.py' produces the expected oucome:

---begin console output-------------------------------------

Input data frame:
col1 col2
date id
1 a 0.025334 0.468002
b 1.307855 1.094578
c 0.454256 0.711495
d -1.450975 -0.858718
e 0.851123 0.878828
2 a 1.726560 -0.936486
b 0.911542 -0.177365
c -1.078583 1.797866
d 0.595278 1.683337
e -1.718456 -1.041106

Calculating medians for each column:
Entered quantiles() with shape (5L,)
Calculating quantiles 0.5 for each column
Building output data frame
Entered quantiles() with shape (5L,)
Calculating quantiles 0.5 for each column
Building output data frame
Entered quantiles() with shape (5, 2)
Calculating quantiles 0.5 for each column
Building output data frame
Output shape (5, 2)
Entered quantiles() with shape (5L,)
Calculating quantiles 0.5 for each column
Building output data frame
Entered quantiles() with shape (5L,)
Calculating quantiles 0.5 for each column
Building output data frame
Entered quantiles() with shape (5, 2)
Calculating quantiles 0.5 for each column
Building output data frame
Output shape (5, 2)

Data frame with medians:
Shape (10, 2)
col1 col2
date id
1 a 0.454256 0.711495
b 0.454256 0.711495
c 0.454256 0.711495
d 0.454256 0.711495
e 0.454256 0.711495
2 a 0.595278 -0.177365
b 0.595278 -0.177365
c 0.595278 -0.177365
d 0.595278 -0.177365
e 0.595278 -0.177365

---end console output-------------------------------------------

'python -O TestGroupbyTransformO.py' produces this:

---begin console output------------------------------------------

Input data frame:
col1 col2
date id
1 a 0.824534 0.258803
b -0.807477 -0.046351
c -0.243443 -0.887152
d -1.430488 1.675248
e -0.430917 1.466759
2 a 1.101497 0.738619
b -2.010792 -0.152976
c -1.757038 1.234569
d -0.081311 -1.690532
e 0.696795 0.442808

Calculating medians for each column:
Entered quantiles() with shape (5L,)
Calculating quantiles 0.5 for each column
Building output data frame
Entered quantiles() with shape (5L,)
Calculating quantiles 0.5 for each column
Building output data frame
Entered quantiles() with shape (5, 2)
Calculating quantiles 0.5 for each column
Building output data frame
Output shape (5, 2)
Entered quantiles() with shape (5L,)
Calculating quantiles 0.5 for each column
Building output data frame
Entered quantiles() with shape (5L,)
Calculating quantiles 0.5 for each column
Building output data frame
Entered quantiles() with shape (5, 2)
Calculating quantiles 0.5 for each column
Building output data frame
Output shape (5, 2)

Data frame with medians:
Shape (10, 2)
Traceback (most recent call last):
File "TestGroupbyTransformO.py", line 29, in
print(qtls.to_string())
File "C:\Python27\lib\site-packages\pandas\core\frame.py", line 1267, in to_st
ring
formatter.to_string(force_unicode=force_unicode)
File "C:\Python27\lib\site-packages\pandas\core\format.py", line 279, in to_st
ring
strcols = self._to_str_columns(force_unicode)
File "C:\Python27\lib\site-packages\pandas\core\format.py", line 214, in _to_s
tr_columns
str_columns = self._get_formatted_column_labels()
File "C:\Python27\lib\site-packages\pandas\core\format.py", line 355, in get
formatted_column_labels
dtypes = self.frame.dtypes
File "C:\Python27\lib\site-packages\pandas\core\frame.py", line 1386, in dtype
s
return self.apply(lambda x: x.dtype)
File "C:\Python27\lib\site-packages\pandas\core\frame.py", line 3763, in apply

return self._apply_standard(f, axis)

File "C:\Python27\lib\site-packages\pandas\core\frame.py", line 3831, in _appl
y_standard
k = res_index[i]
UnboundLocalError: local variable 'i' referenced before assignment

---end console output------------------------------------------------------------------------------------

This is not what I observe in my more complex problem. It has a bigger frame and a more complicated function, but it does calculate several quantiles per column during the first steps (I tried to isolate those steps and discovered the above behavior). Normal run produces something like this:

---begin console output-----------------------------------
...

Entered with (1996L,)
Entered with (1996L,)
Entered with (1996, 21)
Assertions done
Calculating quantiles
Quantiles calculated
Bounds calculated
Outliers detected
Output shape (1996, 21)
Entered with (1996L,)
Entered with (1996L,)
Entered with (1996, 21)
Assertions done
Calculating quantiles
Quantiles calculated
Bounds calculated
Outliers detected
Output shape (1996, 21)
Entered with (1996L,)
Entered with (1996L,)
Entered with (1996, 21)
Assertions done
Calculating quantiles
Quantiles calculated
Bounds calculated
Outliers detected
Output shape (1996, 21)
Entered with (1996L,)
Entered with (1996L,)
Entered with (1996, 21)
Assertions done
Calculating quantiles
Quantiles calculated
Bounds calculated
Outliers detected
Output shape (1996, 21)
...
---end console output------------------------------

The -O run produces this:

---begin console output-----------------------------
...
Entered with (1996L,)
Assertions done
Calculating quantiles
Entered with (1996L,)
Assertions done
Calculating quantiles
Entered with (1996L,)
Assertions done
Calculating quantiles
Entered with (1996L,)
Assertions done
Calculating quantiles
Entered with (1996L,)
Assertions done
Calculating quantiles
Entered with (1996L,)
Assertions done
Calculating quantiles
Entered with (1996L,)
Assertions done
Calculating quantiles
Entered with (1996L,)
Assertions done
Calculating quantiles
Entered with (1996L,)
Assertions done
Calculating quantiles
Entered with (1996L,)
Assertions done
Calculating quantiles
Entered with (1996L,)
Assertions done
Calculating quantiles
Entered with (1996L,)
Assertions done
Calculating quantiles
Entered with (1996L,)
Assertions done
Calculating quantiles
Entered with (1996L,)
Assertions done
Calculating quantiles
Entered with (1996L,)
Assertions done
Calculating quantiles
Entered with (1996L,)
Assertions done
Calculating quantiles
Entered with (1996L,)
Assertions done
Calculating quantiles
Entered with (1996L,)
Assertions done
Calculating quantiles
Entered with (1996L,)
Assertions done
Calculating quantiles
Entered with (1996L,)
Assertions done
Calculating quantiles
Entered with (1996L,)
Assertions done
Calculating quantiles
Entered with (1996L,)
Assertions done
Calculating quantiles
Entered with (1996L,)
Assertions done
Calculating quantiles
Traceback (most recent call last):
File "InefficiencyScores.py", line 390, in
returns_daily_no_outliers = returns_daily.groupby(level=field_date).transfor
m(f_shrink_outliers)
File "C:\Python27\lib\site-packages\pandas\core\groupby.py", line 1745, in tra
nsform
return self._transform_item_by_item(obj, wrapper)
File "C:\Python27\lib\site-packages\pandas\core\groupby.py", line 1777, in _tr
ansform_item_by_item
raise TypeError('Transform function invalid for data types')
TypeError: Transform function invalid for data types

---end console output-----------------------------

As you can see, in my program the -O run never enters with the full shape (1996, 21) and doesn't seem to even get beyond quantile calculations:

print('Calculating quantiles')

# Calculate sample quantiles
q25 = df_out.quantile(q=0.25, axis=axis)
q75 = df_out.quantile(q=0.75, axis=axis)
if symmetric:
    midpoint = (q75 + q25) / 2.
else:
    midpoint = df_out.quantile(q=0.5, axis=axis)

print('Quantiles calculated')

I realize it's convoluted but I hope it helps. The original program is more complex and so far I haven't been able to simply reproduce the behavior I observe. But I did find another puzzling behavior! :-)

@wesm
Copy link
Member

wesm commented Nov 19, 2012

wow, this is annoying. Apparently python -O removes assert statements in code

@bluefir
Copy link
Author

bluefir commented Nov 19, 2012

Yep, among other things. But that's the point! Faster code. -OO also removes docstrings. I am not sure how much it helps though. If it's hard to fix, fughetaboutit :-)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants