-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
Why does Series.transform() exist? #31937
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Your observation 1 is wrong. Series.transform can also take a function that takes a Series. Your problem is that in your examples the return values have only two rows while your df had 4. In [20]: df.a.transform(lambda x: (x - x.mean()) / x.std())
Out[20]:
0 0.439155
1 1.024695
2 -1.317465
3 -0.146385
Name: a, dtype: float64 And you can also do multiple transformers: In [27]: df.a.transform([np.sqrt, np.exp])
Out[27]:
sqrt exp
0 2.000000 54.598150
1 2.236068 148.413159
2 1.000000 2.718282
3 1.732051 20.085537 Neither of those are available with |
@MarcoGorelli The question was about
Admittedly it's unclear to me why |
@Liam3851 yes, you're right - will delete then as it's not relevant |
@Liam3851 tried the first code. Fascinating. Regarding your second example, |
@Liam3851 I just realized that the first "x" is a Series too, and then the function just returns the resulting Series once. So I assume what it does is trying to work with "x" as a scalar and when it fails, it passes the entire Series and works with "x" as a Series then? And now it's not clear to me why we need to use |
This was implemented #14668 - you can see the reasoning in that and the associated notes if you want to see the history. Simply put, transform allows multiple functions as input |
Closing as I don't know if there is anything to be done here, but if you disagree can certainly reopen. Thanks! |
@WillAyd The proof that the methods can show different behavior after all was given, so the issue can be considered closed, yes. But if possible, I'd still like to know a) how I appreciate your link to the relevant pull request, but I'm new to GitHub, so it's not quite clear to me what exactly I should look/click at to get answers to these two questions. I looked through all the messages that mention |
This is my first issue on GitHub, so apologies in advance if there's something wrong with the format.
My issue does not have any expected output, I just really want to understand if and why the
Series.transform()
method is not redundant. Overall, thetransform()
methods are very similar toapply()
methods, and as I was trying to figure out what the difference between them is (this Stack Overflow topic was helpful), I managed to pinpoint 3 primary differences:apply()
sends the entire sub-DataFrames within the function, whiletransform()
sends each column of each sub-DataFrame separately. That's why columns can't access values in other columns withintransform()
;apply()
can still have the output of any length, whiletransform()
has a limitation of having to output an iterable of the same length as the input;apply()
returns that scalar, whiletransform()
propagates that scalar to the iterable of the input length.I conducted a series of experiments that test these three differences on each applicable pandas object type: Series, DataFrame, SeriesGroupBy, and DataFrameGroupBy. I can send my ipynb with the code and the results if necessary, but it would be sufficient to just look at the conclusion for the Series type:
1 – not applicable. In both cases the function has a scalar input.
2 – not applicable. No matter what the function returns, in both cases the result is assigned to the single cell, even if it means entire DataFrames within cells of a Seires.
3 – not applicable. The input length is always "1" (it's considered "1" even when it's an iterable), so there's no need to propagate.
Inapplicability of 1 is self-explanatory. But 2 was a surprise. Below is the code I tried:
If you try this code, you'll see that it doesn't matter what the function returns. Whatever it is, it will be put inside the single Series cell in its entirety. Is this behavior intentional? It results in the output size being predetermined by the input size, so all the size checks that
Series.transform()
has within itself become redundant. I can't imagine any situation whereSeries.transform()
could behave in a different way fromSeries.apply()
. And that raises the question I posed: why doesSeries.transform()
exist?The text was updated successfully, but these errors were encountered: