Skip to content

Add function to clean up column names with special characters #28215

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
Jan 4, 2020

Conversation

hwalinga
Copy link
Contributor

Changed the backtick quoting functions so that you can also use backtick quoting to use invalid Python identifiers like ones that start with a digit, start with a number, or are separated by operators instead of spaces.

As this builds upon #24955 I think @jreback would be again the right person for the code review.

@hwalinga
Copy link
Contributor Author

I don't know how I can solve this issue pandas/core/computation/common.py:4: error: Module 'tokenize' has no attribute 'EXACT_TOKEN_TYPES' as the tokenize module has this attribute. (In the "Type validation" check.)

@simonjayhawkins
Copy link
Member

https://docs.python.org/3/library/tokenize.html#module-tokenize does not refer to EXACT_TOKEN_TYPES and typeshed does not have a definition for EXACT_TOKEN_TYPES https://github.com/python/typeshed/blob/master/stdlib/3/tokenize.pyi

my guess is it's not part of the public API for tokenize.

@hwalinga
Copy link
Contributor Author

@simonjayhawkins I see. Shall I write out this dictionary instead of importing it?

@simonjayhawkins
Copy link
Member

just reading the docs.. "All constants from the token module are also exported from tokenize." https://docs.python.org/3/library/tokenize.html#tokenizing-input

so it would perhaps appear that the omission of EXACT_TOKEN_TYPES is an oversight in typeshed.

so probably best to add # type: ignore and a comment.

@simonjayhawkins
Copy link
Member

just reading the docs.. "All constants from the token module are also exported from tokenize." https://docs.python.org/3/library/tokenize.html#tokenizing-input

so it would perhaps appear that the omission of EXACT_TOKEN_TYPES is an oversight in typeshed.

so probably best to add # type: ignore and a comment.

forget that. I misread the docs,

from the source..

import token
__all__ = token.__all__ + ["tokenize", "detect_encoding",
                           "untokenize", "TokenInfo"]

@simonjayhawkins
Copy link
Member

can you use tokenize.tokenize instead? The returned named tuple has an additional property named exact_type that contains the exact operator type for OP tokens. For all other token types exact_type equals the named tuple type field.

@hwalinga
Copy link
Contributor Author

@simonjayhawkins
Well, I basically need a list of the operators first, and than I need the number associated with that operator. I actually don't know how exactly I would do that like that.

So, I guess, I either ignore, or write them out manually.

@hwalinga
Copy link
Contributor Author

Changed code so that the PR now also closes #28576.

@hwalinga
Copy link
Contributor Author

hwalinga commented Sep 24, 2019

EDIT: I just moved all to expr.py. Made more sense.

Also, I moved some more functions from pandas/core/computation/expr.py to pandas/core/computation/common.py. But the latter isn't anymore a good place for that. It is probably better to have them moved all to expr.py again, but maybe it is also a good idea to put them in a new file backtickquoting.py? (About 100 lines)

@hwalinga
Copy link
Contributor Author

Implemented code that raises an exception if there is ambiguity. (Ambiguity should be very rare.)

Also added an exception if there somehow exists a way the conversion fails to create a valid Python identifier. (Currently cannot think of one.)

Also moved some functions around.

Probably need a new code review. @WillAyd you want to take a look again?

Copy link
Member

@WillAyd WillAyd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK thanks. I think moving in the right direction just hoping we can improve readability

@hwalinga
Copy link
Contributor Author

hwalinga commented Oct 2, 2019

@WillAyd What do you think of it now. (I am not entirely sure why travis fails, it says something about TestS3, so maybe a (temporal) problem with AWS?)

@hwalinga
Copy link
Contributor Author

hwalinga commented Oct 8, 2019

@WillAyd What do you think. Is it ready already, and should I maybe to a squash or something?

Copy link
Member

@WillAyd WillAyd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry I know this is probably taking longer than you hope but there is a lot of ambiguity to query as is, so just really want to keep this as simple and succinct as possible. Some new comments to that effect

@hwalinga
Copy link
Contributor Author

hwalinga commented Oct 8, 2019

@WillAyd No need to apologize. I know pandas is a very popular Python package. So, it is good to take the time for this.

@hwalinga hwalinga requested a review from WillAyd October 9, 2019 13:55
Copy link
Member

@WillAyd WillAyd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks this is looking a lot simpler. cc @jreback for thoughts as well

@hwalinga
Copy link
Contributor Author

Well, I had some doubt over that filter / map thingy. Usually the comprehensions are favored over the those, and the __contains__ isn't so nice either. But for people that are a bit more familiar with functional programming I think it reads a lot better than the previous version.

I have changed the doc, but I will wait for the push. (Don't want to put too much work on your automated pipelines.) As I expect to still have some more improvements and probably also squash the commits etc.

Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i'll have to review, but I am not sure I am on-board with the rationale behind this. .query doesn't need to grow a completely different DSL than python, this makes maintenance very burdensome. you can always use quoted strings w/o query to do this, so I am leaning towards not expanding this scope.

@hwalinga
Copy link
Contributor Author

hwalinga commented Oct 11, 2019

@jreback

This PR does not only add the functionality of using special characters in backtick quoted identifiers. It improves on a wide range of issues with my latest PR (#24955).

To summarize, this PR consists of the following improvements:

  • Raise an error when it is not possible to convert a backtick quoted string to a valid Python identifier. (These cases are rare, but it is good to still raise an understandable error.) (_create_valid_python_identifier)
  • Raise an error when the conversion of the column resolvers to valid Python identifiers creates an ambiguity. This has to do with how Python parses spaces. This issue was already in there with the previous PR, but wasn't handled. Although I doubt someone in the world would ever differentiate its column names based on a difference in whitespace, it is still good to have such an error raised, as it would otherwise pass silently. (_get_space_character_free_column_resolvers)
  • Moved all code from common.py to expr.py. (I think it is better to have that at one place, I realized.)
  • The actual improvement of using special characters in the backtick quoted strings. (_remove_spaces_column_name -> _create_valid_python_identifier)
  • Tests to cover the above.
  • The backtick quoted string as well as the column name will undergo the same process (tokenizing and cleaning) so that special cases where tokenizing removes too much whitespace, still get resolved in the end. (see for example whitespaces in column name handled differently than input to eval/query #28893)
  • Improved documentation that also covers the edge cases.

I think the PR improves on a lot of things, and not only adding more coverage for different cases for the backtick quoted strings.

Also, keep in mind that we are not expanding the DSL for the .query function. We are only adding more coverage for different cases in naming, i.e. special characters, next to spaces.

@hwalinga
Copy link
Contributor Author

Yet another improvement. I was able to fix the problem with the multiple spaces. I just made use of the indices of the token_generator to extract the exact indices of the backtick quoted string. So, you completely circumvent any undesired manipulation done to the tokens.

This will simplify the PR also because it removes documenting that edge case and catching an error when that occurs (that was a significant portion of code.).

@hwalinga
Copy link
Contributor Author

hwalinga commented Nov 2, 2019

@jreback Can you do the codereview?

Current failed test if just some linting in an unrelated file.

@hwalinga hwalinga requested a review from jreback November 9, 2019 14:05
@jreback
Copy link
Contributor

jreback commented Dec 10, 2019

@hwalinga

so I think your code cleanup is +1. I don't really think supporting special characters aside from space (e.g. $) are worth it, but its not much code so ok.

I would like to see a follown, that handles a SyntaxError in the numexpr engine and then just falls back to the python engine.

@jreback
Copy link
Contributor

jreback commented Dec 10, 2019

Copy link
Member

@WillAyd WillAyd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updating approval per previous comments but lgtm

@jreback
Copy link
Contributor

jreback commented Dec 12, 2019

@hwalinga can you merge master

@hwalinga hwalinga force-pushed the allow-special-characters-query branch from da4d585 to c521be8 Compare December 12, 2019 15:39
@hwalinga
Copy link
Contributor Author

@jreback I have merged master and squashed the commits. There are some build problems on the 32 bit Linux, and some problems with building the docs. Seems unrelated.

@hwalinga hwalinga force-pushed the allow-special-characters-query branch 3 times, most recently from 3dd7599 to 0f3331f Compare December 19, 2019 21:30
@hwalinga
Copy link
Contributor Author

@jreback All is green. I think you can merge this now.

@hwalinga
Copy link
Contributor Author

Hold it. Something seems to be gone wrong in squashing the commit.

@hwalinga hwalinga force-pushed the allow-special-characters-query branch from 0f3331f to c25fc84 Compare December 20, 2019 13:45
@hwalinga
Copy link
Contributor Author

@jreback Everything correct now, ready for merge.

@jreback
Copy link
Contributor

jreback commented Dec 27, 2019

can you merge master and ping on green.

@hwalinga hwalinga force-pushed the allow-special-characters-query branch from c25fc84 to 2552ab1 Compare December 30, 2019 16:58
@hwalinga
Copy link
Contributor Author

@jreback Yes, all green, squashed, ready for the merge.

@jreback jreback added this to the 1.0 milestone Jan 1, 2020
Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

some comments, pls rebase, ping on green and we can get this in.

return toknum, tokval


def clean_column_names(name: str) -> str:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you call this: clean_column_name (not the plural) to replace its action

def tokenize_backtick_quoted_string(
token_generator: Iterator[tokenize.TokenInfo], source: str, string_start: int
) -> Tok:
"""
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you add Parameters / Returns here


def tokenize_backtick_quoted_string(
token_generator: Iterator[tokenize.TokenInfo], source: str, string_start: int
) -> Tok:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this return annotation is not correct, its a Tuple[....] (be specific inside the tuple)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I created Tok as an alias for Tuple[int, str], but as this did not seem clear, I made them all explicit.

----------
source : str
A Python source code string
"""
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you add Returns

@@ -453,22 +453,29 @@ def _get_axis_resolvers(self, axis):
d[axis] = dindex
return d

def _get_index_resolvers(self):
d = {}
def _get_index_resolvers(self) -> Dict[str, ABCSeries]:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

as a followup I'd like to clean up all of these functions; i think we can move them entirely to pandas/computation/ (including the axis resolvers which is just plain wrong now the way its implemented), can you create an issue for this

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Member

@simonjayhawkins simonjayhawkins Jan 5, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

and replace ABCSeries in type annotations. ABCSeries resolves to Any.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Replace with what? Also, let's continue this in #30683

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think return type is Dict[str, Union["Series", MultiIndex]]

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, let's continue this in #30683

I don't think typing discussion relevant there

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay, I will see if that works.

The issue is meant for cleaning up these functions. So, why not fix the typing hints as well.

…function.

Clean up in the code that is used for using spaces in the query function
and extending the ability to also use special characters that are
not allowed in python identifiers.

All files related to this functionality are now in the
pandas/core/computation/parsing.py file.
@hwalinga
Copy link
Contributor Author

hwalinga commented Jan 4, 2020

@jreback

Improved docstrings and rebased.

Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks @hwalinga i know this has taken a while!

@jreback jreback merged commit fffb978 into pandas-dev:master Jan 4, 2020

def _get_space_character_free_column_resolvers(self):
"""Return the space character free column resolvers of a dataframe.
return {clean_column_name(k): v for k, v in d.items() if k is not int}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should this be if not isinstance(k, int)? this is causing mypy failures in #30694.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes that looks like it needs changing

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oeh, that's really embarrassing. Yes, it should be if not isinstance(k, int).

What should be best to fix this. Should I make a PR for this?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
DataFrame DataFrame data structure
Projects
None yet
Development

Successfully merging this pull request may close these issues.

The input column name in query contains special characters
7 participants