Add function to clean up column names with special characters #28215

hwalinga · 2019-08-29T10:25:38Z

Changed the backtick quoting functions so that you can also use backtick quoting to use invalid Python identifiers like ones that start with a digit, start with a number, or are separated by operators instead of spaces.

closes The input column name in query contains special characters #27017
tests added / passed
passes black pandas
passes git diff upstream/master -u -- "*.py" | flake8 --diff
whatsnew entry

As this builds upon #24955 I think @jreback would be again the right person for the code review.

hwalinga · 2019-08-29T11:40:56Z

I don't know how I can solve this issue pandas/core/computation/common.py:4: error: Module 'tokenize' has no attribute 'EXACT_TOKEN_TYPES' as the tokenize module has this attribute. (In the "Type validation" check.)

simonjayhawkins · 2019-08-29T11:49:06Z

https://docs.python.org/3/library/tokenize.html#module-tokenize does not refer to EXACT_TOKEN_TYPES and typeshed does not have a definition for EXACT_TOKEN_TYPES https://github.com/python/typeshed/blob/master/stdlib/3/tokenize.pyi

my guess is it's not part of the public API for tokenize.

hwalinga · 2019-08-29T12:28:42Z

@simonjayhawkins I see. Shall I write out this dictionary instead of importing it?

simonjayhawkins · 2019-08-29T14:41:51Z

just reading the docs.. "All constants from the token module are also exported from tokenize." https://docs.python.org/3/library/tokenize.html#tokenizing-input

so it would perhaps appear that the omission of EXACT_TOKEN_TYPES is an oversight in typeshed.

so probably best to add # type: ignore and a comment.

simonjayhawkins · 2019-08-29T14:50:45Z

just reading the docs.. "All constants from the token module are also exported from tokenize." https://docs.python.org/3/library/tokenize.html#tokenizing-input

so it would perhaps appear that the omission of EXACT_TOKEN_TYPES is an oversight in typeshed.

so probably best to add # type: ignore and a comment.

forget that. I misread the docs,

from the source..

import token
__all__ = token.__all__ + ["tokenize", "detect_encoding",
                           "untokenize", "TokenInfo"]

simonjayhawkins · 2019-08-29T14:51:38Z

can you use tokenize.tokenize instead? The returned named tuple has an additional property named exact_type that contains the exact operator type for OP tokens. For all other token types exact_type equals the named tuple type field.

hwalinga · 2019-08-29T16:41:32Z

@simonjayhawkins
Well, I basically need a list of the operators first, and than I need the number associated with that operator. I actually don't know how exactly I would do that like that.

So, I guess, I either ignore, or write them out manually.

pandas/core/frame.py

pandas/core/generic.py

pandas/core/computation/expr.py

hwalinga · 2019-09-24T10:05:12Z

Changed code so that the PR now also closes #28576.

hwalinga · 2019-09-24T10:33:50Z

EDIT: I just moved all to expr.py. Made more sense.

Also, I moved some more functions from pandas/core/computation/expr.py to pandas/core/computation/common.py. But the latter isn't anymore a good place for that. It is probably better to have them moved all to expr.py again, but maybe it is also a good idea to put them in a new file backtickquoting.py? (About 100 lines)

hwalinga · 2019-09-24T15:14:52Z

Implemented code that raises an exception if there is ambiguity. (Ambiguity should be very rare.)

Also added an exception if there somehow exists a way the conversion fails to create a valid Python identifier. (Currently cannot think of one.)

Also moved some functions around.

Probably need a new code review. @WillAyd you want to take a look again?

WillAyd

OK thanks. I think moving in the right direction just hoping we can improve readability

pandas/core/computation/expr.py

pandas/core/generic.py

hwalinga · 2019-10-02T19:07:10Z

@WillAyd What do you think of it now. (I am not entirely sure why travis fails, it says something about TestS3, so maybe a (temporal) problem with AWS?)

pandas/tests/groupby/test_categorical.py

pandas/core/computation/expr.py

hwalinga · 2019-10-08T18:24:16Z

@WillAyd What do you think. Is it ready already, and should I maybe to a squash or something?

WillAyd

Sorry I know this is probably taking longer than you hope but there is a lot of ambiguity to query as is, so just really want to keep this as simple and succinct as possible. Some new comments to that effect

pandas/core/generic.py

hwalinga · 2019-10-08T19:02:59Z

@WillAyd No need to apologize. I know pandas is a very popular Python package. So, it is good to take the time for this.

pandas/core/generic.py

pandas/core/computation/expr.py

WillAyd

Thanks this is looking a lot simpler. cc @jreback for thoughts as well

pandas/core/generic.py

doc/source/whatsnew/v0.25.2.rst

hwalinga · 2019-10-10T21:00:41Z

Well, I had some doubt over that filter / map thingy. Usually the comprehensions are favored over the those, and the __contains__ isn't so nice either. But for people that are a bit more familiar with functional programming I think it reads a lot better than the previous version.

I have changed the doc, but I will wait for the push. (Don't want to put too much work on your automated pipelines.) As I expect to still have some more improvements and probably also squash the commits etc.

jreback

i'll have to review, but I am not sure I am on-board with the rationale behind this. .query doesn't need to grow a completely different DSL than python, this makes maintenance very burdensome. you can always use quoted strings w/o query to do this, so I am leaning towards not expanding this scope.

hwalinga · 2019-10-11T23:34:18Z

@jreback

This PR does not only add the functionality of using special characters in backtick quoted identifiers. It improves on a wide range of issues with my latest PR (#24955).

To summarize, this PR consists of the following improvements:

Raise an error when it is not possible to convert a backtick quoted string to a valid Python identifier. (These cases are rare, but it is good to still raise an understandable error.) (_create_valid_python_identifier)
Raise an error when the conversion of the column resolvers to valid Python identifiers creates an ambiguity. This has to do with how Python parses spaces. This issue was already in there with the previous PR, but wasn't handled. Although I doubt someone in the world would ever differentiate its column names based on a difference in whitespace, it is still good to have such an error raised, as it would otherwise pass silently. (_get_space_character_free_column_resolvers)
Moved all code from common.py to expr.py. (I think it is better to have that at one place, I realized.)
The actual improvement of using special characters in the backtick quoted strings. (_remove_spaces_column_name -> _create_valid_python_identifier)
Tests to cover the above.
The backtick quoted string as well as the column name will undergo the same process (tokenizing and cleaning) so that special cases where tokenizing removes too much whitespace, still get resolved in the end. (see for example whitespaces in column name handled differently than input to eval/query #28893)
Improved documentation that also covers the edge cases.

I think the PR improves on a lot of things, and not only adding more coverage for different cases for the backtick quoted strings.

Also, keep in mind that we are not expanding the DSL for the .query function. We are only adding more coverage for different cases in naming, i.e. special characters, next to spaces.

hwalinga · 2019-10-16T14:46:18Z

Yet another improvement. I was able to fix the problem with the multiple spaces. I just made use of the indices of the token_generator to extract the exact indices of the backtick quoted string. So, you completely circumvent any undesired manipulation done to the tokens.

This will simplify the PR also because it removes documenting that edge case and catching an error when that occurs (that was a significant portion of code.).

hwalinga · 2019-11-02T22:01:31Z

@jreback Can you do the codereview?

Current failed test if just some linting in an unrelated file.

jreback · 2019-12-10T13:57:27Z

@hwalinga

so I think your code cleanup is +1. I don't really think supporting special characters aside from space (e.g. $) are worth it, but its not much code so ok.

I would like to see a follown, that handles a SyntaxError in the numexpr engine and then just falls back to the python engine.

jreback · 2019-12-10T14:00:31Z

@jbrockmendel @jorisvandenbossche @datapythonista comments.

WillAyd

Updating approval per previous comments but lgtm

jreback · 2019-12-12T14:52:45Z

@hwalinga can you merge master

hwalinga · 2019-12-12T16:16:14Z

@jreback I have merged master and squashed the commits. There are some build problems on the 32 bit Linux, and some problems with building the docs. Seems unrelated.

hwalinga · 2019-12-19T22:25:55Z

@jreback All is green. I think you can merge this now.

hwalinga · 2019-12-20T11:59:38Z

Hold it. Something seems to be gone wrong in squashing the commit.

hwalinga · 2019-12-20T14:54:18Z

@jreback Everything correct now, ready for merge.

jreback · 2019-12-27T19:54:41Z

can you merge master and ping on green.

hwalinga · 2019-12-30T21:51:24Z

@jreback Yes, all green, squashed, ready for the merge.

jreback

some comments, pls rebase, ping on green and we can get this in.

pandas/core/computation/eval.py

jreback · 2020-01-01T15:25:27Z

pandas/core/computation/parsing.py

+    return toknum, tokval
+
+
+def clean_column_names(name: str) -> str:


can you call this: clean_column_name (not the plural) to replace its action

jreback · 2020-01-01T15:26:16Z

pandas/core/computation/parsing.py

+def tokenize_backtick_quoted_string(
+    token_generator: Iterator[tokenize.TokenInfo], source: str, string_start: int
+) -> Tok:
+    """


can you add Parameters / Returns here

jreback · 2020-01-01T15:26:51Z

pandas/core/computation/parsing.py

+
+def tokenize_backtick_quoted_string(
+    token_generator: Iterator[tokenize.TokenInfo], source: str, string_start: int
+) -> Tok:


this return annotation is not correct, its a Tuple[....] (be specific inside the tuple)

I created Tok as an alias for Tuple[int, str], but as this did not seem clear, I made them all explicit.

jreback · 2020-01-01T15:27:11Z

pandas/core/computation/parsing.py

+    ----------
+    source : str
+        A Python source code string
+    """


can you add Returns

jreback · 2020-01-01T15:29:38Z

pandas/core/generic.py

@@ -453,22 +453,29 @@ def _get_axis_resolvers(self, axis):
        d[axis] = dindex
        return d

-    def _get_index_resolvers(self):
-        d = {}
+    def _get_index_resolvers(self) -> Dict[str, ABCSeries]:


as a followup I'd like to clean up all of these functions; i think we can move them entirely to pandas/computation/ (including the axis resolvers which is just plain wrong now the way its implemented), can you create an issue for this

and replace ABCSeries in type annotations. ABCSeries resolves to Any.

Replace with what? Also, let's continue this in #30683

I think return type is Dict[str, Union["Series", MultiIndex]]

Also, let's continue this in #30683

I don't think typing discussion relevant there

Okay, I will see if that works.

The issue is meant for cleaning up these functions. So, why not fix the typing hints as well.

…function. Clean up in the code that is used for using spaces in the query function and extending the ability to also use special characters that are not allowed in python identifiers. All files related to this functionality are now in the pandas/core/computation/parsing.py file.

hwalinga · 2020-01-04T18:27:40Z

@jreback

Improved docstrings and rebased.

jreback

thanks @hwalinga i know this has taken a while!

jbrockmendel · 2020-01-04T20:44:32Z

pandas/core/generic.py


-    def _get_space_character_free_column_resolvers(self):
-        """Return the space character free column resolvers of a dataframe.
+        return {clean_column_name(k): v for k, v in d.items() if k is not int}


should this be if not isinstance(k, int)? this is causing mypy failures in #30694.

yes that looks like it needs changing

Oeh, that's really embarrassing. Yes, it should be if not isinstance(k, int).

What should be best to fix this. Should I make a PR for this?

WillAyd requested changes Aug 30, 2019

View reviewed changes

pandas/core/frame.py Outdated Show resolved Hide resolved

pandas/core/generic.py Outdated Show resolved Hide resolved

pandas/core/computation/expr.py Outdated Show resolved Hide resolved

WillAyd added the DataFrame DataFrame data structure label Aug 30, 2019

TomAugspurger mentioned this pull request Sep 23, 2019

Cannot parse query for df.query() with column name containing parenthesis. #28576

Closed

WillAyd requested changes Sep 24, 2019

View reviewed changes

WillAyd reviewed Oct 3, 2019

View reviewed changes

pandas/tests/groupby/test_categorical.py Outdated Show resolved Hide resolved

pandas/core/computation/expr.py Outdated Show resolved Hide resolved

WillAyd requested changes Oct 8, 2019

View reviewed changes

pandas/core/generic.py Outdated Show resolved Hide resolved

pandas/core/generic.py Outdated Show resolved Hide resolved

hwalinga requested a review from WillAyd October 9, 2019 13:55

WillAyd requested changes Oct 9, 2019

View reviewed changes

WillAyd requested changes Oct 10, 2019

View reviewed changes

pandas/core/generic.py Outdated Show resolved Hide resolved

doc/source/whatsnew/v0.25.2.rst Outdated Show resolved Hide resolved

jreback requested changes Oct 11, 2019

View reviewed changes

hwalinga mentioned this pull request Oct 12, 2019

whitespaces in column name handled differently than input to eval/query #28893

Closed

EinarJohnsen mentioned this pull request Oct 14, 2019

Cleanup filter_rows.py after upgrading to pandas v1 equinor/gordo#522

Closed

hwalinga requested a review from jreback November 9, 2019 14:05

jreback approved these changes Dec 10, 2019

View reviewed changes

WillAyd approved these changes Dec 10, 2019

View reviewed changes

hwalinga force-pushed the allow-special-characters-query branch from da4d585 to c521be8 Compare December 12, 2019 15:39

hwalinga force-pushed the allow-special-characters-query branch 3 times, most recently from 3dd7599 to 0f3331f Compare December 19, 2019 21:30

hwalinga force-pushed the allow-special-characters-query branch from 0f3331f to c25fc84 Compare December 20, 2019 13:45

hwalinga force-pushed the allow-special-characters-query branch from c25fc84 to 2552ab1 Compare December 30, 2019 16:58

jreback added this to the 1.0 milestone Jan 1, 2020

jreback requested changes Jan 1, 2020

View reviewed changes

hwalinga force-pushed the allow-special-characters-query branch from 2552ab1 to 3fc1bdd Compare January 4, 2020 16:42

This was referenced Jan 4, 2020

CLN: de-privatize _parsers and _engines in pandas/core/computation and rename to something more descriptive #30681

Closed

CLN: Move axis resolvers from core/generic.py to core/computation #30683

Closed

jreback approved these changes Jan 4, 2020

View reviewed changes

jreback merged commit fffb978 into pandas-dev:master Jan 4, 2020

jbrockmendel reviewed Jan 4, 2020

View reviewed changes

hwalinga mentioned this pull request Jan 4, 2020

Fix integer check; also add column with integer name in test case. #30698

Merged

lohmanndouglas mentioned this pull request Nov 21, 2022

BUG: Cannot parse query for df.query() with column name containing two-bytes parenthesis. #49633

Closed

3 tasks

aram-cedarwood mentioned this pull request Aug 6, 2024

BUG: query on columns with characters like # in its name #59296

Merged

6 tasks

		return toknum, tokval


		def clean_column_names(name: str) -> str:

Uh oh!

Add function to clean up column names with special characters #28215

Add function to clean up column names with special characters #28215

Uh oh!

Conversation

hwalinga commented Aug 29, 2019

Uh oh!

hwalinga commented Aug 29, 2019

Uh oh!

simonjayhawkins commented Aug 29, 2019

Uh oh!

hwalinga commented Aug 29, 2019

Uh oh!

simonjayhawkins commented Aug 29, 2019

Uh oh!

simonjayhawkins commented Aug 29, 2019

Uh oh!

simonjayhawkins commented Aug 29, 2019

Uh oh!

hwalinga commented Aug 29, 2019

Uh oh!

Uh oh!

Uh oh!

Uh oh!

hwalinga commented Sep 24, 2019

Uh oh!

hwalinga commented Sep 24, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

hwalinga commented Sep 24, 2019

Uh oh!

WillAyd left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

hwalinga commented Oct 2, 2019

Uh oh!

Uh oh!

Uh oh!

hwalinga commented Oct 8, 2019

Uh oh!

WillAyd left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

hwalinga commented Oct 8, 2019

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

WillAyd left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

hwalinga commented Oct 10, 2019

Uh oh!

jreback left a comment

Choose a reason for hiding this comment

Uh oh!

hwalinga commented Oct 11, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

hwalinga commented Oct 16, 2019

Uh oh!

hwalinga commented Nov 2, 2019

Uh oh!

jreback commented Dec 10, 2019

Uh oh!

jreback commented Dec 10, 2019

Uh oh!

WillAyd left a comment

Choose a reason for hiding this comment

hwalinga commented Sep 24, 2019 •

edited

Loading

hwalinga commented Oct 11, 2019 •

edited

Loading

simonjayhawkins Jan 5, 2020 •

edited

Loading