-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
Add function to clean up column names with special characters #28215
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add function to clean up column names with special characters #28215
Conversation
I don't know how I can solve this issue |
https://docs.python.org/3/library/tokenize.html#module-tokenize does not refer to EXACT_TOKEN_TYPES and typeshed does not have a definition for EXACT_TOKEN_TYPES https://github.com/python/typeshed/blob/master/stdlib/3/tokenize.pyi my guess is it's not part of the public API for tokenize. |
@simonjayhawkins I see. Shall I write out this dictionary instead of importing it? |
just reading the docs.. "All constants from the token module are also exported from tokenize." https://docs.python.org/3/library/tokenize.html#tokenizing-input so it would perhaps appear that the omission of EXACT_TOKEN_TYPES is an oversight in typeshed. so probably best to add |
forget that. I misread the docs, from the source..
|
can you use tokenize.tokenize instead? The returned named tuple has an additional property named exact_type that contains the exact operator type for OP tokens. For all other token types exact_type equals the named tuple type field. |
@simonjayhawkins So, I guess, I either ignore, or write them out manually. |
Changed code so that the PR now also closes #28576. |
EDIT: I just moved all to expr.py. Made more sense. Also, I moved some more functions from pandas/core/computation/expr.py to pandas/core/computation/common.py. But the latter isn't anymore a good place for that. It is probably better to have them moved all to expr.py again, but maybe it is also a good idea to put them in a new file backtickquoting.py? (About 100 lines) |
Implemented code that raises an exception if there is ambiguity. (Ambiguity should be very rare.) Also added an exception if there somehow exists a way the conversion fails to create a valid Python identifier. (Currently cannot think of one.) Also moved some functions around. Probably need a new code review. @WillAyd you want to take a look again? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK thanks. I think moving in the right direction just hoping we can improve readability
@WillAyd What do you think of it now. (I am not entirely sure why travis fails, it says something about TestS3, so maybe a (temporal) problem with AWS?) |
@WillAyd What do you think. Is it ready already, and should I maybe to a squash or something? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry I know this is probably taking longer than you hope but there is a lot of ambiguity to query
as is, so just really want to keep this as simple and succinct as possible. Some new comments to that effect
@WillAyd No need to apologize. I know pandas is a very popular Python package. So, it is good to take the time for this. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks this is looking a lot simpler. cc @jreback for thoughts as well
Well, I had some doubt over that filter / map thingy. Usually the comprehensions are favored over the those, and the I have changed the doc, but I will wait for the push. (Don't want to put too much work on your automated pipelines.) As I expect to still have some more improvements and probably also squash the commits etc. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i'll have to review, but I am not sure I am on-board with the rationale behind this. .query doesn't need to grow a completely different DSL than python, this makes maintenance very burdensome. you can always use quoted strings w/o query to do this, so I am leaning towards not expanding this scope.
This PR does not only add the functionality of using special characters in backtick quoted identifiers. It improves on a wide range of issues with my latest PR (#24955). To summarize, this PR consists of the following improvements:
I think the PR improves on a lot of things, and not only adding more coverage for different cases for the backtick quoted strings. Also, keep in mind that we are not expanding the DSL for the .query function. We are only adding more coverage for different cases in naming, i.e. special characters, next to spaces. |
Yet another improvement. I was able to fix the problem with the multiple spaces. I just made use of the indices of the token_generator to extract the exact indices of the backtick quoted string. So, you completely circumvent any undesired manipulation done to the tokens. This will simplify the PR also because it removes documenting that edge case and catching an error when that occurs (that was a significant portion of code.). |
@jreback Can you do the codereview? Current failed test if just some linting in an unrelated file. |
so I think your code cleanup is +1. I don't really think supporting special characters aside from space (e.g. I would like to see a follown, that handles a SyntaxError in the numexpr engine and then just falls back to the python engine. |
@jbrockmendel @jorisvandenbossche @datapythonista comments. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updating approval per previous comments but lgtm
@hwalinga can you merge master |
da4d585
to
c521be8
Compare
@jreback I have merged master and squashed the commits. There are some build problems on the 32 bit Linux, and some problems with building the docs. Seems unrelated. |
3dd7599
to
0f3331f
Compare
@jreback All is green. I think you can merge this now. |
Hold it. Something seems to be gone wrong in squashing the commit. |
0f3331f
to
c25fc84
Compare
@jreback Everything correct now, ready for merge. |
can you merge master and ping on green. |
c25fc84
to
2552ab1
Compare
@jreback Yes, all green, squashed, ready for the merge. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
some comments, pls rebase, ping on green and we can get this in.
pandas/core/computation/parsing.py
Outdated
return toknum, tokval | ||
|
||
|
||
def clean_column_names(name: str) -> str: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you call this: clean_column_name (not the plural) to replace its action
def tokenize_backtick_quoted_string( | ||
token_generator: Iterator[tokenize.TokenInfo], source: str, string_start: int | ||
) -> Tok: | ||
""" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you add Parameters / Returns here
pandas/core/computation/parsing.py
Outdated
|
||
def tokenize_backtick_quoted_string( | ||
token_generator: Iterator[tokenize.TokenInfo], source: str, string_start: int | ||
) -> Tok: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this return annotation is not correct, its a Tuple[....] (be specific inside the tuple)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I created Tok
as an alias for Tuple[int, str]
, but as this did not seem clear, I made them all explicit.
---------- | ||
source : str | ||
A Python source code string | ||
""" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you add Returns
@@ -453,22 +453,29 @@ def _get_axis_resolvers(self, axis): | |||
d[axis] = dindex | |||
return d | |||
|
|||
def _get_index_resolvers(self): | |||
d = {} | |||
def _get_index_resolvers(self) -> Dict[str, ABCSeries]: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
as a followup I'd like to clean up all of these functions; i think we can move them entirely to pandas/computation/ (including the axis resolvers which is just plain wrong now the way its implemented), can you create an issue for this
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
and replace ABCSeries in type annotations. ABCSeries resolves to Any.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Replace with what? Also, let's continue this in #30683
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think return type is Dict[str, Union["Series", MultiIndex]]
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, let's continue this in #30683
I don't think typing discussion relevant there
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay, I will see if that works.
The issue is meant for cleaning up these functions. So, why not fix the typing hints as well.
…function. Clean up in the code that is used for using spaces in the query function and extending the ability to also use special characters that are not allowed in python identifiers. All files related to this functionality are now in the pandas/core/computation/parsing.py file.
2552ab1
to
3fc1bdd
Compare
Improved docstrings and rebased. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks @hwalinga i know this has taken a while!
|
||
def _get_space_character_free_column_resolvers(self): | ||
"""Return the space character free column resolvers of a dataframe. | ||
return {clean_column_name(k): v for k, v in d.items() if k is not int} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should this be if not isinstance(k, int)
? this is causing mypy failures in #30694.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes that looks like it needs changing
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oeh, that's really embarrassing. Yes, it should be if not isinstance(k, int)
.
What should be best to fix this. Should I make a PR for this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changed the backtick quoting functions so that you can also use backtick quoting to use invalid Python identifiers like ones that start with a digit, start with a number, or are separated by operators instead of spaces.
black pandas
git diff upstream/master -u -- "*.py" | flake8 --diff
As this builds upon #24955 I think @jreback would be again the right person for the code review.