Skip to content

The input column name in query contains special characters #27017

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
solivehong opened this issue Jun 24, 2019 · 20 comments · Fixed by #28215
Closed

The input column name in query contains special characters #27017

solivehong opened this issue Jun 24, 2019 · 20 comments · Fixed by #28215
Labels
Milestone

Comments

@solivehong
Copy link

The input column name in pandas.dataframe.query() contains special characters.
I saw the change in 0.25, but still have . \ / 等问题
And main problem is that I can't restore these characters after converting them to "_" , which is a very serious problem.

I have an idea

inputpandas.dataframe
Intermediate processing`x2E`(Hex)-> like pandas`x2E`dataframe(Add backslashes after processing the data) ->pandas\x2Edataframe
output pandas\x2Edataframe
@WillAyd
Copy link
Member

WillAyd commented Jun 27, 2019

@solivehong
Copy link
Author

this is my say like this @WillAyd @hwalinga

import pandas as pd
import re
a = pd.DataFrame({'foo.bar':[11111,2222,333333],})
def in_columns_data(col):
    space = re.findall('(\W)',list(col)[0])[0]

    if space=='.':
        new = list(col)[0].replace('{space}'.format(**locals()),'x2E')
        control=1
        print (new)
        # new = list(col)[0].encode('hex')
        return new,control
    elif space==' ':
        new = list(col)[0].replace('{space}'.format(**locals()),'x20')
        control=2
        print (new)
        # new = list(col)[0].encode('hex')
        return new,control

name,control = in_columns_data(a.columns)
a.columns = [name]
print (a)
a = a.query("{name} >2222".format(**locals()))
def out_columns_data(col,control):
    if control==1:
        new = list(col)[0].replace('x2E','.')
        print (new)
        # new = list(col)[0].encode('hex')
        return [new]
a.columns = out_columns_data(a.columns,control)
print (a)

the result is

foox2Ebar
   foox2Ebar
0      11111
1       2222
2     333333
foo.bar
   foo.bar
0    11111
2   333333

@WillAyd
Copy link
Member

WillAyd commented Jul 5, 2019

Can you try and make the example minimal? This link might help:

http://matthewrocklin.com/blog/work/2018/02/28/minimal-bug-reports

@hwalinga
Copy link
Contributor

hwalinga commented Jul 5, 2019

@WillAyd

This is what he means:

import pandas as pd
a = pd.DataFrame({'foo.bar':[11111,2222,333333],})
a.query("foo.bar > 2222")
# fails 

(NB, Chinese just works fine in Python3, and since new releases don't support Python2 anyway, that issue can be dropped.)

He is coming from #6508 which I solved for spaces in #24955

The question that is now on the table: Do we want to support other forbidden characters (next to space) as well?

The problem is that we need to work around the tokenizer, but since the tokenizer recognizes python operators that can be dealt with. (So . / - + *) However, \ is an escape character, which is probably a lot harder, I don't know if even possible.

So, shall I make a pull request for this, or isn't this necessary enough to pollute the code base further with? (I estimate I need to add one new function and alter two existing ones, from what I remember from the last PR I did on this.)

@jreback has reviewed the previous PR and may want to have a word on this before I start.

@jreback
Copy link
Contributor

jreback commented Jul 5, 2019

I am probably -1 on this; we by definition only support python tokens

you can always not use query which is a convenience method

@solivehong
Copy link
Author

I think the input str in query and eval is a convenient way to input, and it can improve the speed of processing data. I have been entangled in this problem because I found that strings can use regular expressions, format, etc. Method, this is too convenient@jreback

@jreback
Copy link
Contributor

jreback commented Jul 5, 2019

I think the input str in query and eval is a convenient way to input, and it can improve the speed of processing data. I have been entangled in this problem because I found that strings can use regular expressions, format, etc. Method, this is too convenient@jreback

this does not improve processing at all unless you are doing very simple operations

changing to non python semantics is cause for confusion

@hwalinga
Copy link
Contributor

hwalinga commented Jul 5, 2019

@danielhrisca @dgua

These people might also have a word on this, since they requested the same in the issue about the spaces.

@hwalinga
Copy link
Contributor

I thought you would be skeptical @jreback but let's view it this way: We already have a way to distinguish between valid python names and invalid python names. That is the backtick quoting I have implemented to allow spaces. So, there isn't much of a barrier left to next to allow `this kind of names` to also allow `this.kind.of.names` or `this-kind-of-names`.

Not everybody in the world is used to snake_case, and the dots in the names would probably cater to the people coming from R. (In which having dots in your identifiers is basically the equivalent of underscores in python.) And who knows, maybe there is a webdev very happy to write his/her names as CSS class names.

If you are worried how horrible the code will look like. It is not that bad. I implemented it already. Take a look:
https://github.com/hwalinga/pandas/tree/allow-special-characters-query

@solivehong
Copy link
Author

excellent job @hwalinga
Eval and query are the two best methods I use in use
Looking forward to updating this part of 0.25, I will download the test first.

@hwalinga
Copy link
Contributor

@jreback Do you want me to open a PR, or will you close this as a 'wontfix'?

@JivanRoquet
Copy link

@hwalinga I second that. One more use case besides this.kind.of.name is also when a column name starts with a digit (which is not a valid Python variable name), like 60d or 30d.

Currently (as of 0.25.1) even using backticks doesn't work with columns starting with a digit.

df.eval("`column 60d` + `column 30d`") # works
df.eval("`60d` + `30d`") # fails

It seems that under the hood, Pandas replaces spaces by underscores and removes the backticks like this:

`hello there` + `foo bar`
# becomes
hello_there_BACKTICK_QUOTED_STRING + foo_bar_BACKTICK_QUOTED_STRING
# which is a valid Python expression

and

`30d` + `60d`
# becomes
30d_BACKTICK_QUOTED_STRING + 60d_BACKTICK_QUOTED_STRING
# which is NOT a valid Python expression

As a solution, one might suggest always prepending a _ in front of a backticked-column (instead of only appending _BACKTICK_QUOTED_STRING like now), like the following:

`30d` + `60d`
# would become
_30d_BACKTICK_QUOTED_STRING + _60d_BACKTICK_QUOTED_STRING
# which is a valid Python expression

@JivanRoquet
Copy link

@jreback

changing to non python semantics is cause for confusion

I don't think this is right. Datasets' column names (and the people who come up with them) couldn't care less about Python semantics, and it seems reasonable enough to think that Pandas users would expect eval and query to work out-of-the-box with any kind of dataset column names.

Another way to put it is that the analytics tool (here, Pandas) has to adapt to real-life datasets, instead of the other way around. The "other way around" will simply never happen, and if it doesn't happen either on Pandas' side, then it's up to the Pandas analyst to deal with the nitty-gritty hoops and bumps of dealing with exotic column names. Which is far from ideal.

@hwalinga
Copy link
Contributor

Thanks for the feedback.

I can understand jreback's perspective. Allowing all these kind of edge cases brings in quite some ugly code to the codebase. Unless you have your own language parser build-in. You aren't really solving it very elegantly.

Even if we allow for these kind of edge cases to be valid with the aforementioned hacks, there will all kinds of ways to break it.

Some different approach that has also been discussed was to use uuid instead. That would probably be most elegant, but would make exceptions harder to read.

Since there are now multiple people having trouble (or expecting too much) from the backticks, maybe the backtick approach is something worth reimplementing, even though the exceptions become harder to read?

@zhaohongqiangsoliva maybe this new activity makes it worth reopening again?

@TomAugspurger
Copy link
Contributor

@JivanRoquet Are non-python identifiers even feasible with how query / eval works? I'm not too familiar with it, but my understanding is that we use the ast module to build up an expression that's passed to numexpr.

@solivehong
Copy link
Author

solivehong commented Aug 28, 2019

@hwalinga I won't open a closed question,I searched using google, but I didn't find a way.

@solivehong solivehong reopened this Aug 28, 2019
@dgua
Copy link

dgua commented Aug 28, 2019

"I don't think this is right. Datasets' column names (and the people who come up with them) couldn't care less about Python semantics, and it seems reasonable enough to think that Pandas users would expect eval and query to work out-of-the-box with any kind of dataset column names."

100% agree with @JivanRoquet. This needs to work for all of us who use real life data files.
This issue needs to be resolved.

@TomAugspurger
Copy link
Contributor

@dgua do you have a design proposal?

@hwalinga
Copy link
Contributor

@TomAugspurger To give you little background. Using non-python identifiers was solved in #24955 . However, it was only solved for spaces. This issue talks about extending that functionality. I actually already implemented it here: https://github.com/hwalinga/pandas/tree/allow-special-characters-query

Shall I make a PR? (I will then also make the change to allow numbers in the beginning.)

@jreback
Copy link
Contributor

jreback commented Aug 28, 2019

@hwalinga your implementation looks ok; even though i am basically 0- on generally using query / eval (as its very opaque to readers); would be ok with this change

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

8 participants