-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
The input column name in query contains special characters #27017
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Do you have a minimally reproducible example for this? |
this is my say like this @WillAyd @hwalinga import pandas as pd
import re
a = pd.DataFrame({'foo.bar':[11111,2222,333333],})
def in_columns_data(col):
space = re.findall('(\W)',list(col)[0])[0]
if space=='.':
new = list(col)[0].replace('{space}'.format(**locals()),'x2E')
control=1
print (new)
# new = list(col)[0].encode('hex')
return new,control
elif space==' ':
new = list(col)[0].replace('{space}'.format(**locals()),'x20')
control=2
print (new)
# new = list(col)[0].encode('hex')
return new,control
name,control = in_columns_data(a.columns)
a.columns = [name]
print (a)
a = a.query("{name} >2222".format(**locals()))
def out_columns_data(col,control):
if control==1:
new = list(col)[0].replace('x2E','.')
print (new)
# new = list(col)[0].encode('hex')
return [new]
a.columns = out_columns_data(a.columns,control)
print (a) the result is
|
Can you try and make the example minimal? This link might help: http://matthewrocklin.com/blog/work/2018/02/28/minimal-bug-reports |
This is what he means:
(NB, Chinese just works fine in Python3, and since new releases don't support Python2 anyway, that issue can be dropped.) He is coming from #6508 which I solved for spaces in #24955 The question that is now on the table: Do we want to support other forbidden characters (next to space) as well? The problem is that we need to work around the tokenizer, but since the tokenizer recognizes python operators that can be dealt with. (So . / - + *) However, \ is an escape character, which is probably a lot harder, I don't know if even possible. So, shall I make a pull request for this, or isn't this necessary enough to pollute the code base further with? (I estimate I need to add one new function and alter two existing ones, from what I remember from the last PR I did on this.) @jreback has reviewed the previous PR and may want to have a word on this before I start. |
I am probably -1 on this; we by definition only support python tokens you can always not use query which is a convenience method |
I think the input str in query and eval is a convenient way to input, and it can improve the speed of processing data. I have been entangled in this problem because I found that strings can use regular expressions, format, etc. Method, this is too convenient@jreback |
this does not improve processing at all unless you are doing very simple operations changing to non python semantics is cause for confusion |
These people might also have a word on this, since they requested the same in the issue about the spaces. |
I thought you would be skeptical @jreback but let's view it this way: We already have a way to distinguish between valid python names and invalid python names. That is the backtick quoting I have implemented to allow spaces. So, there isn't much of a barrier left to next to allow Not everybody in the world is used to snake_case, and the dots in the names would probably cater to the people coming from R. (In which having dots in your identifiers is basically the equivalent of underscores in python.) And who knows, maybe there is a webdev very happy to write his/her names as CSS class names. If you are worried how horrible the code will look like. It is not that bad. I implemented it already. Take a look: |
excellent job @hwalinga |
@jreback Do you want me to open a PR, or will you close this as a 'wontfix'? |
@hwalinga I second that. One more use case besides Currently (as of 0.25.1) even using backticks doesn't work with columns starting with a digit. df.eval("`column 60d` + `column 30d`") # works
df.eval("`60d` + `30d`") # fails It seems that under the hood, Pandas replaces spaces by underscores and removes the backticks like this:
As a solution, one might suggest always prepending a
|
I don't think this is right. Datasets' column names (and the people who come up with them) couldn't care less about Python semantics, and it seems reasonable enough to think that Pandas users would expect Another way to put it is that the analytics tool (here, Pandas) has to adapt to real-life datasets, instead of the other way around. The "other way around" will simply never happen, and if it doesn't happen either on Pandas' side, then it's up to the Pandas analyst to deal with the nitty-gritty hoops and bumps of dealing with exotic column names. Which is far from ideal. |
Thanks for the feedback. I can understand jreback's perspective. Allowing all these kind of edge cases brings in quite some ugly code to the codebase. Unless you have your own language parser build-in. You aren't really solving it very elegantly. Even if we allow for these kind of edge cases to be valid with the aforementioned hacks, there will all kinds of ways to break it. Some different approach that has also been discussed was to use uuid instead. That would probably be most elegant, but would make exceptions harder to read. Since there are now multiple people having trouble (or expecting too much) from the backticks, maybe the backtick approach is something worth reimplementing, even though the exceptions become harder to read? @zhaohongqiangsoliva maybe this new activity makes it worth reopening again? |
@JivanRoquet Are non-python identifiers even feasible with how query / eval works? I'm not too familiar with it, but my understanding is that we use the |
@hwalinga I won't open a closed question,I searched using google, but I didn't find a way. |
100% agree with @JivanRoquet. This needs to work for all of us who use real life data files. |
@dgua do you have a design proposal? |
@TomAugspurger To give you little background. Using non-python identifiers was solved in #24955 . However, it was only solved for spaces. This issue talks about extending that functionality. I actually already implemented it here: https://github.com/hwalinga/pandas/tree/allow-special-characters-query Shall I make a PR? (I will then also make the change to allow numbers in the beginning.) |
@hwalinga your implementation looks ok; even though i am basically 0- on generally using query / eval (as its very opaque to readers); would be ok with this change |
The input column name in pandas.dataframe.query() contains special characters.
I saw the change in 0.25, but still have . \ / 等问题
And main problem is that I can't restore these characters after converting them to "_" , which is a very serious problem.
I have an idea
The text was updated successfully, but these errors were encountered: