QST: Comments and feedback on rolling own dataframe functionality and syntax? #40522

CharlesHe16 · 2021-03-19T22:08:33Z

I have searched the [pandas] tag on StackOverflow for similar questions.
I have asked my usage related question on StackOverflow.

Question about pandas

I want to roll my own dataframe functionality and syntax. Below is an example with a ternary-like operator

Example A:

Let's say I want to set ‘row_A’ equal to 5 if ‘row_B’ is 10.

Right now, this has to be written like this:

df.loc[df.row_1 == ‘10’, 'row_1'] = 5

But I want this to work:

df.row_A = 5 if row_B = 10

My question is, is it feasible possible to add / overload the functionality implied? Has anyone tried this?

Example A above would just be the beginning.

While subjective, I think this deals with a large class of "not quite" esoteric operations, that are in fact arcane and esoteric to execute concisely in Pandas.

Note that I’m in principle happy to share / learn / complement / accept someone else's implementation too.

Please let me know if this is insanity.

The text was updated successfully, but these errors were encountered:

attack68 · 2021-03-19T23:23:45Z

I don't really understand why you need this. Surely this is similar enough to your pattern without having to account all of the various different and complex methods of looking up data:

df = DataFrame([[1, 3], [2, 2], [3, 3]], index=['a', 'b', 'c'], columns=['z', 'y'])

df.loc['b'] = [9,9] if all(df.loc['a'] == [1,3]) else df.loc['b']
print(df.loc['b'])
z 9
y 9
Name: b, dtype: int64

Liam3851 · 2021-03-20T02:27:02Z

@CharlesHe16 this just isn't valid Python syntax:

df.row_A = 5 if row_B == 10

5 if row_B == 10 would be evaluated first as a unit and that just doesn't work. You can't add it to pandas (or subclass pandas or whatever) if it's not Python.

df.row_A = 5 if row_B == 10 else row_A

is at least valid Python but it doesn't give you the semantic you want, the Python interpreter will make row_B == 10 a boolean and not do what you want, not to mention that neither row_A nor row_B are variables in scope. This isn't Scala, you can't just write a function that takes a random block of code following and changes its semantics.

Couple of existing options you may prefer given your aesthetic desires:

df['row_A'] = np.where(df.row_B == 10, 5, df.row_A)

numexpr (note this does not actually exist since numexpr where doesn't take an array for its third arg, but you could contribute it over there perhaps if you really wanted it and they were amenable)

# this exists 
df['row_A'] = ne.evaluate('where(row_B == 10, 5, 2)')

# this does not exist either in numexpr (doesn't support arrays in 3rd arg)
# or pandas (doesn't support numexpr where in eval)
# but maybe it could if you wanted to write it
df.eval('row_A=where(row_B ==10, 5, row_A)')

None of these to me beat

df.loc[df.row_B == 10, 'row_A'] = 5

but beauty, beholders, etc.

AlexKirko · 2021-03-20T03:54:07Z

Apart from the issues shared above, I can only add my own principle when dealing with similar problems: convenience methods for aesthetic purposes are only worth it, when you can code them in an error-prone way in an hour.

CharlesHe16 · 2021-03-20T04:41:54Z

I don't really understand why you need this. Surely this is similar enough to your pattern without having to account all of the various different and complex methods of looking up data:
df = DataFrame([[1, 3], [2, 2], [3, 3]], index=['a', 'b', 'c'], columns=['z', 'y'])

df.loc['b'] = [9,9] if all(df.loc['a'] == [1,3]) else df.loc['b']
print(df.loc['b'])
z 9
y 9
Name: b, dtype: int64

This seems to be a more complex example: you are doing a double substitution contingent on both columns of another row being a certain value.

In additional to this being orthogonal, it's an interesting but also rare to me:

This is rare in my domains and mode of use: most data is time series or else just a pool of observations (people, cities), where each row is an instance or observation.

The use case is for data cleaning or constructing variables or logic based on conditional logic is across columns, and that is my focus for this "ticket".

(This reply is orthogonal, but so is your example.)

CharlesHe16 · 2021-03-20T05:05:09Z

@CharlesHe16 this just isn't valid Python syntax:
...

Everything you said seems right to me.

I would need to override/supplant the "if" statement somehow and also the ==/= or eq logic at the same time. Doing this seems like a lot of fighting.

Your suggestions below are great:

Couple of existing options you may prefer given your aesthetic desires:
df['row_A'] = np.where(df.row_B == 10, 5, df.row_A)
numexpr (note this does not actually exist since numexpr where doesn't take an array for its third arg, but you could contribute it over there perhaps if you really wanted it and they were amenable)
# this exists 
df['row_A'] = ne.evaluate('where(row_B == 10, 5, 2)')

# this does not exist either in numexpr (doesn't support arrays in 3rd arg)
# or pandas (doesn't support numexpr where in eval)
# but maybe it could if you wanted to write it
df.eval('row_A=where(row_B ==10, 5, row_A)')

This is really useful to know.

Of course, I've used np.where, but for some reason it never occurred to me to use it like this in place of the loc. Instead this is how I use it, which is awfully close:

df['elderly'] = np.where(df['age']>=75, 'yes', 'no')

I've never used numexpr before, this seem brilliant too, I'll check it out.

pandas.Dataframe.query or eval is something I've used before, and is much closer than what I want, but sticking things into strings seems somewhat unnatural and breaks linting and my intellij IDE.

Also, since my previous comment, I found infix

http://tomerfiliba.com/blog/Infix-Operators/

Here's some ideas for uses of it. https://stackoverflow.com/questions/29219590/python-infix-forward-pipe

It seems like using this would create an abomination, but there might be some way to adapt it to write the code I want.

However, it may fail. I don't know how to deal with the if statement.

CharlesHe16 · 2021-03-20T05:28:18Z

None of these to me beat
df.loc[df.row_B == 10, 'row_A'] = 5
but beauty, beholders, etc.

It's true, some things are subjective.

But I think we can still reason through the aesthestics:

df.row_A = 5 if row_B = 10

In my (magical, maybe unworkable) statement, a user wants to modify a column by setting it equal to a value.

So they do it.

df.row_A = 5

Also, they want to make this command conditional on another column in the same row being the same value.

So it is done too, by appending it to the statement.

... if row_B = 10

The above is clean. The user is focused on the objects and uses almost no extra syntax. The conditional logic is a natural extension of the base statement.

<BEGIN RHETORIC>

Consider this statement:

df.loc[df.row_B == 10, 'row_A'] = 5

Starting off, we have loc and then a square bracket.

Ok.

So we're in some sort of syntax where we pass something that looks like parameters.

Except it's not a function, where we normally pass parameters inside round brackets of ( and ) .

Square brackets are for indexing. Sure, we are giving indexes, but they look really different from the usual integers or counter i variables that we use. Here we pass in a string ('row_A') and some operation (df.row_B == 10). The later basically evaluates to a boolean index (which isn't obvious to a newcomer).

Also, we have had to cite the our dataframe df twice now. Why? So we can support the use of the indexing functionality basically. Except for many people and many times, we only care about data from the original dataframe.

Why is the position of row_A, which is the focus of the command, and the index reversed?

What exactly is loc? Are we using it for its intended purpose or hijacking some sort of "location" function to edit our dataframe?

<END RHETORIC>

attack68 · 2021-03-20T09:02:35Z

1. Can't fight Python?

Basically some strategy with infix isn't going to work because it's not Python.

For example, in the code below, where |operator| is some "Infix" operator,

df.row_A = 5 |operator| 'row_B' == 10

This code won't work because, as @Liam3851 suggests, my |operator| cannot obtain the full logic on either side.

That is, I think |operator| just sees 5 and 'row_B', and not other content, because Python.

Now, there seems like there is some other strategy where I intercept/override the attribute/column assignment inside Pandas itself, but this is much closer to "fighting Pandas" territory.

2. Wrapper with string statements instead

Instead, if I get around to it, I might try a homebrew a wrapper of loc or very canonical Pandas code using strings, along the spirit of eval or query, e.g:

df.set('row_A = 5 if row_B = 10')

This is just a wrapper for loc basically, so it reduces the costs @attack68 and @AlexKirko warn about (maybe?).

This seems about as much useful (?) content I can provide on this and I don't expect more progress unless someone else can provide input.

mroeschke · 2021-08-19T02:47:13Z

Thanks for the report, but overall, in addition to what other's have mentioned, you have the liberty to modify syntax and roll functionality as far as the Python language will allow you. It may be better to also get advice from Stack Overflow as well with API design. Closing, and I don't see this being actionable toward the pandas library end.

CharlesHe16 added Needs Triage Issue that has not been reviewed by a pandas team member Usage Question labels Mar 19, 2021

AlexKirko removed the Needs Triage Issue that has not been reviewed by a pandas team member label Mar 20, 2021

mroeschke closed this as completed Aug 19, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

QST: Comments and feedback on rolling own dataframe functionality and syntax? #40522

QST: Comments and feedback on rolling own dataframe functionality and syntax? #40522

CharlesHe16 commented Mar 19, 2021 •

edited

Loading

attack68 commented Mar 19, 2021

Liam3851 commented Mar 20, 2021 •

edited

Loading

AlexKirko commented Mar 20, 2021

CharlesHe16 commented Mar 20, 2021

CharlesHe16 commented Mar 20, 2021 •

edited

Loading

CharlesHe16 commented Mar 20, 2021 •

edited

Loading

attack68 commented Mar 20, 2021

CharlesHe16 commented Mar 20, 2021 •

edited

Loading

CharlesHe16 commented Mar 20, 2021

mroeschke commented Aug 19, 2021

QST: Comments and feedback on rolling own dataframe functionality and syntax? #40522

QST: Comments and feedback on rolling own dataframe functionality and syntax? #40522

Comments

CharlesHe16 commented Mar 19, 2021 • edited Loading

Question about pandas

attack68 commented Mar 19, 2021

Liam3851 commented Mar 20, 2021 • edited Loading

AlexKirko commented Mar 20, 2021

CharlesHe16 commented Mar 20, 2021

CharlesHe16 commented Mar 20, 2021 • edited Loading

CharlesHe16 commented Mar 20, 2021 • edited Loading

attack68 commented Mar 20, 2021

CharlesHe16 commented Mar 20, 2021 • edited Loading

CharlesHe16 commented Mar 20, 2021

1. Can't fight Python?

2. Wrapper with string statements instead

mroeschke commented Aug 19, 2021

CharlesHe16 commented Mar 19, 2021 •

edited

Loading

Liam3851 commented Mar 20, 2021 •

edited

Loading

CharlesHe16 commented Mar 20, 2021 •

edited

Loading

CharlesHe16 commented Mar 20, 2021 •

edited

Loading

CharlesHe16 commented Mar 20, 2021 •

edited

Loading