-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
QST: Comments and feedback on rolling own dataframe functionality and syntax? #40522
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
I don't really understand why you need this. Surely this is similar enough to your pattern without having to account all of the various different and complex methods of looking up data:
|
@CharlesHe16 this just isn't valid Python syntax: df.row_A = 5 if row_B == 10
df.row_A = 5 if row_B == 10 else row_A is at least valid Python but it doesn't give you the semantic you want, the Python interpreter will make row_B == 10 a boolean and not do what you want, not to mention that neither row_A nor row_B are variables in scope. This isn't Scala, you can't just write a function that takes a random block of code following and changes its semantics. Couple of existing options you may prefer given your aesthetic desires: df['row_A'] = np.where(df.row_B == 10, 5, df.row_A) numexpr (note this does not actually exist since numexpr where doesn't take an array for its third arg, but you could contribute it over there perhaps if you really wanted it and they were amenable) # this exists
df['row_A'] = ne.evaluate('where(row_B == 10, 5, 2)')
# this does not exist either in numexpr (doesn't support arrays in 3rd arg)
# or pandas (doesn't support numexpr where in eval)
# but maybe it could if you wanted to write it
df.eval('row_A=where(row_B ==10, 5, row_A)') None of these to me beat df.loc[df.row_B == 10, 'row_A'] = 5 but beauty, beholders, etc. |
Apart from the issues shared above, I can only add my own principle when dealing with similar problems: convenience methods for aesthetic purposes are only worth it, when you can code them in an error-prone way in an hour. |
This seems to be a more complex example: you are doing a double substitution contingent on both columns of another row being a certain value. In additional to this being orthogonal, it's an interesting but also rare to me: This is rare in my domains and mode of use: most data is time series or else just a pool of observations (people, cities), where each row is an instance or observation. The use case is for data cleaning or constructing variables or logic based on conditional logic is across columns, and that is my focus for this "ticket". (This reply is orthogonal, but so is your example.) |
Everything you said seems right to me. I would need to override/supplant the "if" statement somehow and also the Your suggestions below are great:
This is really useful to know. Of course, I've used df['elderly'] = np.where(df['age']>=75, 'yes', 'no') I've never used numexpr before, this seem brilliant too, I'll check it out.
Also, since my previous comment, I found http://tomerfiliba.com/blog/Infix-Operators/ Here's some ideas for uses of it. https://stackoverflow.com/questions/29219590/python-infix-forward-pipe It seems like using this would create an abomination, but there might be some way to adapt it to write the code I want. However, it may fail. I don't know how to deal with the |
It's true, some things are subjective. But I think we can still reason through the aesthestics: df.row_A = 5 if row_B = 10 In my (magical, maybe unworkable) statement, a user wants to modify a column by setting it equal to a value. So they do it. df.row_A = 5 Also, they want to make this command conditional on another column in the same row being the same value. So it is done too, by appending it to the statement. ... if row_B = 10 The above is clean. The user is focused on the objects and uses almost no extra syntax. The conditional logic is a natural extension of the base statement. <BEGIN RHETORIC> Consider this statement: df.loc[df.row_B == 10, 'row_A'] = 5 Starting off, we have Ok. So we're in some sort of syntax where we pass something that looks like parameters. Except it's not a function, where we normally pass parameters inside round brackets of Square brackets are for indexing. Sure, we are giving indexes, but they look really different from the usual integers or counter Also, we have had to cite the our dataframe Why is the position of What exactly is <END RHETORIC> |
you might be interested in #40322 Selecting data in an efficient way over a MultiIndex using labels or integer lookup or Booleans is a complex task. Don't underestimate the number of cases you must consider for a well structured and argued alternative |
Thanks, I'll check this out! Re;
and also
Basically, the issues motivating this "ticket" this is something that has been noticeable to me for 5+ years since moving to Pandas and Python. Initially I never considered editing things, because of almost exactly the two classes of responses in this thread ( [editing syntax, antipatterns, fighting Python, literally impossible] and [huge amounts work to usefully implement] ). These still might make it impossible. But there are situations where having a few simple tools is highly valuable and not just adding to the noise. Maybe being extremely focused alleviates some of these issues. |
Update: I did a quick scout. 1. Can't fight Python?Basically some strategy with For example, in the code below, where df.row_A = 5 |operator| 'row_B' == 10 This code won't work because, as @Liam3851 suggests, my That is, I think Now, there seems like there is some other strategy where I intercept/override the attribute/column assignment inside Pandas itself, but this is much closer to "fighting Pandas" territory. 2. Wrapper with string statements insteadInstead, if I get around to it, I might try a homebrew a wrapper of df.set('row_A = 5 if row_B = 10') This is just a wrapper for This seems about as much useful (?) content I can provide on this and I don't expect more progress unless someone else can provide input. |
Thanks for the report, but overall, in addition to what other's have mentioned, you have the liberty to modify syntax and roll functionality as far as the Python language will allow you. It may be better to also get advice from Stack Overflow as well with API design. Closing, and I don't see this being actionable toward the pandas library end. |
I have searched the [pandas] tag on StackOverflow for similar questions.
I have asked my usage related question on StackOverflow.
Question about pandas
I want to roll my own dataframe functionality and syntax. Below is an example with a ternary-like operator
Example A:
Let's say I want to set ‘row_A’ equal to 5 if ‘row_B’ is 10.
Right now, this has to be written like this:
df.loc[df.row_1 == ‘10’, 'row_1'] = 5
But I want this to work:
df.row_A = 5 if row_B = 10
My question is, is it feasible possible to add / overload the functionality implied? Has anyone tried this?
Example A above would just be the beginning.
While subjective, I think this deals with a large class of "not quite" esoteric operations, that are in fact arcane and esoteric to execute concisely in Pandas.
Note that I’m in principle happy to share / learn / complement / accept someone else's implementation too.
Please let me know if this is insanity.
The text was updated successfully, but these errors were encountered: