The input column name in query contains special characters #27017

solivehong · 2019-06-24T13:23:19Z

The input column name in pandas.dataframe.query() contains special characters.
I saw the change in 0.25, but still have . \ / 等问题
And main problem is that I can't restore these characters after converting them to "_" , which is a very serious problem.

I have an idea

input： pandas.dataframe
Intermediate processing：
`x2E`(Hex)-> like pandas`x2E`dataframe(Add backslashes after processing the data) ->pandas\x2Edataframe
output pandas\x2Edataframe

WillAyd · 2019-06-27T03:43:28Z

Do you have a minimally reproducible example for this?

https://pandas.pydata.org/pandas-docs/stable/development/contributing.html#bug-reports-and-enhancement-requests

solivehong · 2019-07-05T14:27:31Z

this is my say like this @WillAyd @hwalinga

import pandas as pd
import re
a = pd.DataFrame({'foo.bar':[11111,2222,333333],})
def in_columns_data(col):
    space = re.findall('(\W)',list(col)[0])[0]

    if space=='.':
        new = list(col)[0].replace('{space}'.format(**locals()),'x2E')
        control=1
        print (new)
        # new = list(col)[0].encode('hex')
        return new,control
    elif space==' ':
        new = list(col)[0].replace('{space}'.format(**locals()),'x20')
        control=2
        print (new)
        # new = list(col)[0].encode('hex')
        return new,control

name,control = in_columns_data(a.columns)
a.columns = [name]
print (a)
a = a.query("{name} >2222".format(**locals()))
def out_columns_data(col,control):
    if control==1:
        new = list(col)[0].replace('x2E','.')
        print (new)
        # new = list(col)[0].encode('hex')
        return [new]
a.columns = out_columns_data(a.columns,control)
print (a)

the result is

foox2Ebar
   foox2Ebar
0      11111
1       2222
2     333333
foo.bar
   foo.bar
0    11111
2   333333

WillAyd · 2019-07-05T15:15:25Z

Can you try and make the example minimal? This link might help:

http://matthewrocklin.com/blog/work/2018/02/28/minimal-bug-reports

hwalinga · 2019-07-05T16:07:05Z

@WillAyd

This is what he means:

import pandas as pd
a = pd.DataFrame({'foo.bar':[11111,2222,333333],})
a.query("foo.bar > 2222")
# fails

(NB, Chinese just works fine in Python3, and since new releases don't support Python2 anyway, that issue can be dropped.)

He is coming from #6508 which I solved for spaces in #24955

The question that is now on the table: Do we want to support other forbidden characters (next to space) as well?

The problem is that we need to work around the tokenizer, but since the tokenizer recognizes python operators that can be dealt with. (So . / - + *) However, \ is an escape character, which is probably a lot harder, I don't know if even possible.

So, shall I make a pull request for this, or isn't this necessary enough to pollute the code base further with? (I estimate I need to add one new function and alter two existing ones, from what I remember from the last PR I did on this.)

@jreback has reviewed the previous PR and may want to have a word on this before I start.

jreback · 2019-07-05T16:14:19Z

I am probably -1 on this; we by definition only support python tokens

you can always not use query which is a convenience method

solivehong · 2019-07-05T16:44:31Z

I think the input str in query and eval is a convenient way to input, and it can improve the speed of processing data. I have been entangled in this problem because I found that strings can use regular expressions, format, etc. Method, this is too convenient@jreback

jreback · 2019-07-05T17:02:34Z

I think the input str in query and eval is a convenient way to input, and it can improve the speed of processing data. I have been entangled in this problem because I found that strings can use regular expressions, format, etc. Method, this is too convenient@jreback

this does not improve processing at all unless you are doing very simple operations

changing to non python semantics is cause for confusion

hwalinga · 2019-07-05T17:17:49Z

@danielhrisca @dgua

These people might also have a word on this, since they requested the same in the issue about the spaces.

hwalinga · 2019-07-10T22:35:08Z

I thought you would be skeptical @jreback but let's view it this way: We already have a way to distinguish between valid python names and invalid python names. That is the backtick quoting I have implemented to allow spaces. So, there isn't much of a barrier left to next to allow `this kind of names` to also allow `this.kind.of.names` or `this-kind-of-names`.

Not everybody in the world is used to snake_case, and the dots in the names would probably cater to the people coming from R. (In which having dots in your identifiers is basically the equivalent of underscores in python.) And who knows, maybe there is a webdev very happy to write his/her names as CSS class names.

If you are worried how horrible the code will look like. It is not that bad. I implemented it already. Take a look:
https://github.com/hwalinga/pandas/tree/allow-special-characters-query

solivehong · 2019-07-11T03:39:48Z

excellent job @hwalinga
Eval and query are the two best methods I use in use
Looking forward to updating this part of 0.25, I will download the test first.

hwalinga · 2019-08-21T09:41:13Z

@jreback Do you want me to open a PR, or will you close this as a 'wontfix'?

JivanRoquet · 2019-08-27T11:25:40Z

@hwalinga I second that. One more use case besides this.kind.of.name is also when a column name starts with a digit (which is not a valid Python variable name), like 60d or 30d.

Currently (as of 0.25.1) even using backticks doesn't work with columns starting with a digit.

df.eval("`column 60d` + `column 30d`") # works
df.eval("`60d` + `30d`") # fails

It seems that under the hood, Pandas replaces spaces by underscores and removes the backticks like this:

`hello there` + `foo bar`
# becomes
hello_there_BACKTICK_QUOTED_STRING + foo_bar_BACKTICK_QUOTED_STRING
# which is a valid Python expression

and

`30d` + `60d`
# becomes
30d_BACKTICK_QUOTED_STRING + 60d_BACKTICK_QUOTED_STRING
# which is NOT a valid Python expression

As a solution, one might suggest always prepending a _ in front of a backticked-column (instead of only appending _BACKTICK_QUOTED_STRING like now), like the following:

`30d` + `60d`
# would become
_30d_BACKTICK_QUOTED_STRING + _60d_BACKTICK_QUOTED_STRING
# which is a valid Python expression

JivanRoquet · 2019-08-27T11:33:39Z

@jreback

changing to non python semantics is cause for confusion

I don't think this is right. Datasets' column names (and the people who come up with them) couldn't care less about Python semantics, and it seems reasonable enough to think that Pandas users would expect eval and query to work out-of-the-box with any kind of dataset column names.

Another way to put it is that the analytics tool (here, Pandas) has to adapt to real-life datasets, instead of the other way around. The "other way around" will simply never happen, and if it doesn't happen either on Pandas' side, then it's up to the Pandas analyst to deal with the nitty-gritty hoops and bumps of dealing with exotic column names. Which is far from ideal.

hwalinga · 2019-08-27T12:58:41Z

Thanks for the feedback.

I can understand jreback's perspective. Allowing all these kind of edge cases brings in quite some ugly code to the codebase. Unless you have your own language parser build-in. You aren't really solving it very elegantly.

Even if we allow for these kind of edge cases to be valid with the aforementioned hacks, there will all kinds of ways to break it.

Some different approach that has also been discussed was to use uuid instead. That would probably be most elegant, but would make exceptions harder to read.

Since there are now multiple people having trouble (or expecting too much) from the backticks, maybe the backtick approach is something worth reimplementing, even though the exceptions become harder to read?

@zhaohongqiangsoliva maybe this new activity makes it worth reopening again?

TomAugspurger · 2019-08-27T20:31:32Z

@JivanRoquet Are non-python identifiers even feasible with how query / eval works? I'm not too familiar with it, but my understanding is that we use the ast module to build up an expression that's passed to numexpr.

solivehong · 2019-08-28T01:02:18Z

@hwalinga I won't open a closed question，I searched using google, but I didn't find a way.

dgua · 2019-08-28T02:45:38Z

"I don't think this is right. Datasets' column names (and the people who come up with them) couldn't care less about Python semantics, and it seems reasonable enough to think that Pandas users would expect eval and query to work out-of-the-box with any kind of dataset column names."

100% agree with @JivanRoquet. This needs to work for all of us who use real life data files.
This issue needs to be resolved.

TomAugspurger · 2019-08-28T03:09:35Z

@dgua do you have a design proposal?

hwalinga · 2019-08-28T08:11:04Z

@TomAugspurger To give you little background. Using non-python identifiers was solved in #24955 . However, it was only solved for spaces. This issue talks about extending that functionality. I actually already implemented it here: https://github.com/hwalinga/pandas/tree/allow-special-characters-query

Shall I make a PR? (I will then also make the change to allow numbers in the beginning.)

jreback · 2019-08-28T15:30:20Z

@hwalinga your implementation looks ok; even though i am basically 0- on generally using query / eval (as its very opaque to readers); would be ok with this change

solivehong closed this as completed Aug 21, 2019

solivehong reopened this Aug 28, 2019

hwalinga mentioned this issue Aug 29, 2019

Add function to clean up column names with special characters #28215

Merged

5 tasks

mroeschke added Enhancement expressions pd.eval, query labels Nov 2, 2019

jreback added this to the 1.0 milestone Dec 10, 2019

jreback closed this as completed in #28215 Jan 4, 2020

aschonfeld mentioned this issue Feb 18, 2020

Periods in column names cause page to go blank man-group/dtale#63

Closed

stevedodson mentioned this issue Mar 26, 2020

Query on existing field and value fails with AttributeError: 'numpy.bool_' object has no attribute 'empty' elastic/eland#146

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

The input column name in query contains special characters #27017

The input column name in query contains special characters #27017

solivehong commented Jun 24, 2019

WillAyd commented Jun 27, 2019

solivehong commented Jul 5, 2019

WillAyd commented Jul 5, 2019

hwalinga commented Jul 5, 2019

jreback commented Jul 5, 2019

solivehong commented Jul 5, 2019

jreback commented Jul 5, 2019

hwalinga commented Jul 5, 2019

hwalinga commented Jul 10, 2019

solivehong commented Jul 11, 2019

hwalinga commented Aug 21, 2019

JivanRoquet commented Aug 27, 2019

JivanRoquet commented Aug 27, 2019

hwalinga commented Aug 27, 2019

TomAugspurger commented Aug 27, 2019

solivehong commented Aug 28, 2019 •

edited

Loading

dgua commented Aug 28, 2019

TomAugspurger commented Aug 28, 2019

hwalinga commented Aug 28, 2019

jreback commented Aug 28, 2019

The input column name in query contains special characters #27017

The input column name in query contains special characters #27017

Comments

solivehong commented Jun 24, 2019

WillAyd commented Jun 27, 2019

solivehong commented Jul 5, 2019

WillAyd commented Jul 5, 2019

hwalinga commented Jul 5, 2019

jreback commented Jul 5, 2019

solivehong commented Jul 5, 2019

jreback commented Jul 5, 2019

hwalinga commented Jul 5, 2019

hwalinga commented Jul 10, 2019

solivehong commented Jul 11, 2019

hwalinga commented Aug 21, 2019

JivanRoquet commented Aug 27, 2019

JivanRoquet commented Aug 27, 2019

hwalinga commented Aug 27, 2019

TomAugspurger commented Aug 27, 2019

solivehong commented Aug 28, 2019 • edited Loading

dgua commented Aug 28, 2019

TomAugspurger commented Aug 28, 2019

hwalinga commented Aug 28, 2019

jreback commented Aug 28, 2019

solivehong commented Aug 28, 2019 •

edited

Loading