Skip to content

Why does Series.transform() exist? #31937

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
UchuuStranger opened this issue Feb 12, 2020 · 8 comments
Closed

Why does Series.transform() exist? #31937

UchuuStranger opened this issue Feb 12, 2020 · 8 comments

Comments

@UchuuStranger
Copy link

UchuuStranger commented Feb 12, 2020

This is my first issue on GitHub, so apologies in advance if there's something wrong with the format.

My issue does not have any expected output, I just really want to understand if and why the Series.transform() method is not redundant. Overall, the transform() methods are very similar to apply() methods, and as I was trying to figure out what the difference between them is (this Stack Overflow topic was helpful), I managed to pinpoint 3 primary differences:

  1. When the DataFrame is grouped on several categories, apply() sends the entire sub-DataFrames within the function, while transform() sends each column of each sub-DataFrame separately. That's why columns can't access values in other columns within transform();
  2. When the input passed to the function is an iterable of a certain length, apply() can still have the output of any length, while transform() has a limitation of having to output an iterable of the same length as the input;
  3. When the function outputs a scalar, apply() returns that scalar, while transform() propagates that scalar to the iterable of the input length.

I conducted a series of experiments that test these three differences on each applicable pandas object type: Series, DataFrame, SeriesGroupBy, and DataFrameGroupBy. I can send my ipynb with the code and the results if necessary, but it would be sufficient to just look at the conclusion for the Series type:

1 – not applicable. In both cases the function has a scalar input.
2 – not applicable. No matter what the function returns, in both cases the result is assigned to the single cell, even if it means entire DataFrames within cells of a Seires.
3 – not applicable. The input length is always "1" (it's considered "1" even when it's an iterable), so there's no need to propagate.

Inapplicability of 1 is self-explanatory. But 2 was a surprise. Below is the code I tried:

import pandas as pd

df = pd.DataFrame({'State':['Texas', 'Texas', 'Florida', 'Florida'], 
                   'a':[4,5,1,3], 'b':[6,10,3,11]})

def return_df(x):
    return pd.DataFrame([[4, 5], [3, 2]])

def return_series(x):
    return pd.Series([1, 2])

df['a'].transform(return_df)
df['a'].transform(return_series)

If you try this code, you'll see that it doesn't matter what the function returns. Whatever it is, it will be put inside the single Series cell in its entirety. Is this behavior intentional? It results in the output size being predetermined by the input size, so all the size checks that Series.transform() has within itself become redundant. I can't imagine any situation where Series.transform() could behave in a different way from Series.apply(). And that raises the question I posed: why does Series.transform() exist?

@Liam3851
Copy link
Contributor

Your observation 1 is wrong. Series.transform can also take a function that takes a Series. Your problem is that in your examples the return values have only two rows while your df had 4.

In [20]: df.a.transform(lambda x: (x - x.mean()) / x.std())
Out[20]:
0    0.439155
1    1.024695
2   -1.317465
3   -0.146385
Name: a, dtype: float64

And you can also do multiple transformers:

In [27]: df.a.transform([np.sqrt, np.exp])
Out[27]:
       sqrt         exp
0  2.000000   54.598150
1  2.236068  148.413159
2  1.000000    2.718282
3  1.732051   20.085537

Neither of those are available with apply.

@Liam3851
Copy link
Contributor

@MarcoGorelli The question was about Series.transform, which does not allow aggregator broadcasting, unlike SeriesGroupBy.transform (for which aggregator broadcasting is the main use case).

In [3]: s1.transform('sum')
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-3-6db3fc2c8d83> in <module>
----> 1 s1.transform('sum')

C:\Miniconda3\envs\bleeding\lib\site-packages\pandas\core\series.py in transform(self, func, axis, *args, **kwargs)
   3715         # Validate the axis parameter
   3716         self._get_axis_number(axis)
-> 3717         return super().transform(func, *args, **kwargs)
   3718
   3719     def apply(self, func, convert_dtype=True, args=(), **kwds):

C:\Miniconda3\envs\bleeding\lib\site-packages\pandas\core\generic.py in transform(self, func, *args, **kwargs)
  10427         result = self.agg(func, *args, **kwargs)
  10428         if is_scalar(result) or len(result) != len(self):
> 10429             raise ValueError("transforms cannot produce aggregated results")
  10430
  10431         return result

ValueError: transforms cannot produce aggregated results

Admittedly it's unclear to me why Series.transform does not support aggregator broadcasting, since I thought the point of adding agg, transform, and apply was to mimic the groupby versions.

@MarcoGorelli
Copy link
Member

@Liam3851 yes, you're right - will delete then as it's not relevant

@UchuuStranger
Copy link
Author

UchuuStranger commented Feb 12, 2020

@Liam3851 tried the first code. Fascinating. So the function treats the first "x" as a scalar but simultaneously treats the second "x" as a Series? How does it know which is which? Admittedly, when I tried to see what exactly get passed in the function, I did notice that there were two prints, and the second one was the entire Series for some reason. But it's still not clear to me how this works. It's true this doesn't work with apply() though, I've just checked. So they are different after all.

Regarding your second example, apply() seems to work in my case.

@UchuuStranger
Copy link
Author

UchuuStranger commented Feb 12, 2020

@Liam3851 I just realized that the first "x" is a Series too, and then the function just returns the resulting Series once. So I assume what it does is trying to work with "x" as a scalar and when it fails, it passes the entire Series and works with "x" as a Series then?

And now it's not clear to me why we need to use transform() here if we can do the same without using any methods at all.

@WillAyd
Copy link
Member

WillAyd commented Feb 13, 2020

This was implemented #14668 - you can see the reasoning in that and the associated notes if you want to see the history. Simply put, transform allows multiple functions as input

@WillAyd
Copy link
Member

WillAyd commented Feb 14, 2020

Closing as I don't know if there is anything to be done here, but if you disagree can certainly reopen. Thanks!

@WillAyd WillAyd closed this as completed Feb 14, 2020
@UchuuStranger
Copy link
Author

@WillAyd The proof that the methods can show different behavior after all was given, so the issue can be considered closed, yes. But if possible, I'd still like to know a) how transform() decides whether to pass a scalar or a Series, and b) what's the purpose of passing Series if whatever the function does can be done directly on a Series without using transform() at all.

I appreciate your link to the relevant pull request, but I'm new to GitHub, so it's not quite clear to me what exactly I should look/click at to get answers to these two questions. I looked through all the messages that mention transform(), but it didn't really make things clear to me. I'm sorry.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants