Skip to content

QST: Comments and feedback on rolling own dataframe functionality and syntax? #40522

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
1 of 2 tasks
CharlesHe16 opened this issue Mar 19, 2021 · 10 comments
Closed
1 of 2 tasks

Comments

@CharlesHe16
Copy link

CharlesHe16 commented Mar 19, 2021

  • I have searched the [pandas] tag on StackOverflow for similar questions.

  • I have asked my usage related question on StackOverflow.


Question about pandas

I want to roll my own dataframe functionality and syntax. Below is an example with a ternary-like operator

Example A:

Let's say I want to set ‘row_A’ equal to 5 if ‘row_B’ is 10.

Right now, this has to be written like this:

df.loc[df.row_1 == ‘10’, 'row_1'] = 5

But I want this to work:

df.row_A = 5 if row_B = 10

My question is, is it feasible possible to add / overload the functionality implied? Has anyone tried this?

Example A above would just be the beginning.

While subjective, I think this deals with a large class of "not quite" esoteric operations, that are in fact arcane and esoteric to execute concisely in Pandas.

Note that I’m in principle happy to share / learn / complement / accept someone else's implementation too.

Please let me know if this is insanity.

@CharlesHe16 CharlesHe16 added Needs Triage Issue that has not been reviewed by a pandas team member Usage Question labels Mar 19, 2021
@attack68
Copy link
Contributor

I don't really understand why you need this. Surely this is similar enough to your pattern without having to account all of the various different and complex methods of looking up data:

df = DataFrame([[1, 3], [2, 2], [3, 3]], index=['a', 'b', 'c'], columns=['z', 'y'])

df.loc['b'] = [9,9] if all(df.loc['a'] == [1,3]) else df.loc['b']
print(df.loc['b'])
z 9
y 9
Name: b, dtype: int64

@Liam3851
Copy link
Contributor

Liam3851 commented Mar 20, 2021

@CharlesHe16 this just isn't valid Python syntax:

df.row_A = 5 if row_B == 10

5 if row_B == 10 would be evaluated first as a unit and that just doesn't work. You can't add it to pandas (or subclass pandas or whatever) if it's not Python.

df.row_A = 5 if row_B == 10 else row_A

is at least valid Python but it doesn't give you the semantic you want, the Python interpreter will make row_B == 10 a boolean and not do what you want, not to mention that neither row_A nor row_B are variables in scope. This isn't Scala, you can't just write a function that takes a random block of code following and changes its semantics.

Couple of existing options you may prefer given your aesthetic desires:

df['row_A'] = np.where(df.row_B == 10, 5, df.row_A)

numexpr (note this does not actually exist since numexpr where doesn't take an array for its third arg, but you could contribute it over there perhaps if you really wanted it and they were amenable)

# this exists 
df['row_A'] = ne.evaluate('where(row_B == 10, 5, 2)')

# this does not exist either in numexpr (doesn't support arrays in 3rd arg)
# or pandas (doesn't support numexpr where in eval)
# but maybe it could if you wanted to write it
df.eval('row_A=where(row_B ==10, 5, row_A)')

None of these to me beat

df.loc[df.row_B == 10, 'row_A'] = 5

but beauty, beholders, etc.

@AlexKirko
Copy link
Member

Apart from the issues shared above, I can only add my own principle when dealing with similar problems: convenience methods for aesthetic purposes are only worth it, when you can code them in an error-prone way in an hour.

@CharlesHe16
Copy link
Author

I don't really understand why you need this. Surely this is similar enough to your pattern without having to account all of the various different and complex methods of looking up data:

df = DataFrame([[1, 3], [2, 2], [3, 3]], index=['a', 'b', 'c'], columns=['z', 'y'])

df.loc['b'] = [9,9] if all(df.loc['a'] == [1,3]) else df.loc['b']
print(df.loc['b'])
z 9
y 9
Name: b, dtype: int64

This seems to be a more complex example: you are doing a double substitution contingent on both columns of another row being a certain value.

In additional to this being orthogonal, it's an interesting but also rare to me:

This is rare in my domains and mode of use: most data is time series or else just a pool of observations (people, cities), where each row is an instance or observation.

The use case is for data cleaning or constructing variables or logic based on conditional logic is across columns, and that is my focus for this "ticket".

(This reply is orthogonal, but so is your example.)

@CharlesHe16
Copy link
Author

CharlesHe16 commented Mar 20, 2021

@CharlesHe16 this just isn't valid Python syntax:
...

Everything you said seems right to me.

I would need to override/supplant the "if" statement somehow and also the ==/= or eq logic at the same time. Doing this seems like a lot of fighting.

Your suggestions below are great:

Couple of existing options you may prefer given your aesthetic desires:

df['row_A'] = np.where(df.row_B == 10, 5, df.row_A)

numexpr (note this does not actually exist since numexpr where doesn't take an array for its third arg, but you could contribute it over there perhaps if you really wanted it and they were amenable)

# this exists 
df['row_A'] = ne.evaluate('where(row_B == 10, 5, 2)')

# this does not exist either in numexpr (doesn't support arrays in 3rd arg)
# or pandas (doesn't support numexpr where in eval)
# but maybe it could if you wanted to write it
df.eval('row_A=where(row_B ==10, 5, row_A)')

This is really useful to know.

Of course, I've used np.where, but for some reason it never occurred to me to use it like this in place of the loc. Instead this is how I use it, which is awfully close:

df['elderly'] = np.where(df['age']>=75, 'yes', 'no')

I've never used numexpr before, this seem brilliant too, I'll check it out.

pandas.Dataframe.query or eval is something I've used before, and is much closer than what I want, but sticking things into strings seems somewhat unnatural and breaks linting and my intellij IDE.

Also, since my previous comment, I found infix

http://tomerfiliba.com/blog/Infix-Operators/

Here's some ideas for uses of it. https://stackoverflow.com/questions/29219590/python-infix-forward-pipe

It seems like using this would create an abomination, but there might be some way to adapt it to write the code I want.

However, it may fail. I don't know how to deal with the if statement.

@CharlesHe16
Copy link
Author

CharlesHe16 commented Mar 20, 2021

None of these to me beat

df.loc[df.row_B == 10, 'row_A'] = 5

but beauty, beholders, etc.

It's true, some things are subjective.

But I think we can still reason through the aesthestics:

df.row_A = 5 if row_B = 10

In my (magical, maybe unworkable) statement, a user wants to modify a column by setting it equal to a value.

So they do it.

df.row_A = 5

Also, they want to make this command conditional on another column in the same row being the same value.

So it is done too, by appending it to the statement.

... if row_B = 10

The above is clean. The user is focused on the objects and uses almost no extra syntax. The conditional logic is a natural extension of the base statement.

<BEGIN RHETORIC>

Consider this statement:

df.loc[df.row_B == 10, 'row_A'] = 5

Starting off, we have loc and then a square bracket.

Ok.

So we're in some sort of syntax where we pass something that looks like parameters.

Except it's not a function, where we normally pass parameters inside round brackets of ( and ) .

Square brackets are for indexing. Sure, we are giving indexes, but they look really different from the usual integers or counter i variables that we use. Here we pass in a string ('row_A') and some operation (df.row_B == 10). The later basically evaluates to a boolean index (which isn't obvious to a newcomer).

Also, we have had to cite the our dataframe df twice now. Why? So we can support the use of the indexing functionality basically. Except for many people and many times, we only care about data from the original dataframe.

Why is the position of row_A, which is the focus of the command, and the index reversed?

What exactly is loc? Are we using it for its intended purpose or hijacking some sort of "location" function to edit our dataframe?

<END RHETORIC>

@attack68
Copy link
Contributor

you might be interested in #40322

Selecting data in an efficient way over a MultiIndex using labels or integer lookup or Booleans is a complex task. Don't underestimate the number of cases you must consider for a well structured and argued alternative

@AlexKirko AlexKirko removed the Needs Triage Issue that has not been reviewed by a pandas team member label Mar 20, 2021
@CharlesHe16
Copy link
Author

CharlesHe16 commented Mar 20, 2021

you might be interested in #40322

Thanks, I'll check this out!

Re;

Selecting data in an efficient way over a MultiIndex using labels or integer lookup or Booleans is a complex task. Don't underestimate the number of cases you must consider for a well structured and argued alternative

and also

Apart from the issues shared above, I can only add my own principle when dealing with similar problems: convenience methods for aesthetic purposes are only worth it, when you can code them in an error-prone way in an hour.

Basically, the issues motivating this "ticket" this is something that has been noticeable to me for 5+ years since moving to Pandas and Python.

Initially I never considered editing things, because of almost exactly the two classes of responses in this thread ( [editing syntax, antipatterns, fighting Python, literally impossible] and [huge amounts work to usefully implement] ).

These still might make it impossible. But there are situations where having a few simple tools is highly valuable and not just adding to the noise. Maybe being extremely focused alleviates some of these issues.

@CharlesHe16
Copy link
Author

Update:

I did a quick scout.

1. Can't fight Python?

Basically some strategy with infix isn't going to work because it's not Python.

For example, in the code below, where |operator| is some "Infix" operator,

df.row_A = 5 |operator| 'row_B' == 10

This code won't work because, as @Liam3851 suggests, my |operator| cannot obtain the full logic on either side.

That is, I think |operator| just sees 5 and 'row_B', and not other content, because Python.

Now, there seems like there is some other strategy where I intercept/override the attribute/column assignment inside Pandas itself, but this is much closer to "fighting Pandas" territory.

2. Wrapper with string statements instead

Instead, if I get around to it, I might try a homebrew a wrapper of loc or very canonical Pandas code using strings, along the spirit of eval or query, e.g:

df.set('row_A = 5 if row_B = 10')

This is just a wrapper for loc basically, so it reduces the costs @attack68 and @AlexKirko warn about (maybe?).

This seems about as much useful (?) content I can provide on this and I don't expect more progress unless someone else can provide input.

@mroeschke
Copy link
Member

Thanks for the report, but overall, in addition to what other's have mentioned, you have the liberty to modify syntax and roll functionality as far as the Python language will allow you. It may be better to also get advice from Stack Overflow as well with API design. Closing, and I don't see this being actionable toward the pandas library end.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants