-
-
Notifications
You must be signed in to change notification settings - Fork 18.6k
ENH: Add decimal and thousand separator params to to_numeric() #56934
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ENH: Add decimal and thousand separator params to to_numeric() #56934
Conversation
update from main
…odgson/pandas-alex-hodgson into feat/to-numeric-decimal-seperators
Added the whatsnew changes to the 3.0.0 file for now, will there be another minor version before then or is that the correct file to document the changes? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added some comments on the implementation but generally I am a little unsure this is even worth doing in pandas. To handle all the edge cases correctly would be a lot of work.
What is the use-case for having this as part of the API? Does Python itself offer something similar with float <> string conversions?
@@ -2204,6 +2205,8 @@ def maybe_convert_numeric( | |||
bint convert_empty=True, | |||
bint coerce_numeric=False, | |||
bint convert_to_masked_nullable=False, | |||
str thousands=None, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do these need to be str
in the first place or would a declaration of char
simplify things? I am not sure how Cython works in this case but it seems like it might be able to automatically handle that object -> ctype conversion
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Your suggestion was my original implementation, and it ran fine, but I ran into issues with the mypy tests. It wasn't matching the stub function to its implentation here as the parameter types were different: I couldn't use char
dtype in lib.pyi
, and I was trying to declare it as bytes
. It may work to import the cython dtypes into the stub file but I'm not sure if this is the best option or there is another neat method.
if thousands is None: | ||
tsep = "\0" | ||
else: | ||
bytes_tsep = thousands.encode("UTF-8") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Doesn't thousands.encode
return a PyObject *
? Assigning that to tsep
does not seem correct
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Cython handles the conversion when assigning bytes_tsep
to tsep
. I followed the cython documentation for converting str to bytes here: https://cython.readthedocs.io/en/latest/src/tutorial/strings.html#encoding-text-to-bytes
If there is a way solve the char
dtype issue as mentioned above then this can be simplified though.
cdef char* dsep | ||
# Use null char to represent lack of separator | ||
if thousands is None: | ||
tsep = "\0" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is confusing to assign the nul byte instead of just assigning to NULL, especially with how Cython handles char *
like this (although the comment above about using char declarations in the function signature would probably simplify this)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I did it this way as precise_xstrtod()
in the C parser uses the null byte to represent the lack of a separator when processing a string (see line 1633 of tokenizer.c), previously this was just hardcoded as one of its parameters, but unless we want to change the C implementation then tsep needs to be set to the null char at some point, the assignment could be done in C rather than cython if you think that's neater.
pandas/_libs/lib.pyx
Outdated
dsep = bytes_dsep | ||
|
||
# Validate separators | ||
if len(tsep) > 1: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure why this would be useful but raising like this prevents a multi-byte character from being used as a separator. That should be tested either way
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Using wide characters for these separators does indeed seem pretty odd and an unlikely use case, if we do need to support this then it would require a change of approach as the C parser uses a single char
to represent the thousasnds separator. Otherwise I can make the error more specific and document that it must be single width.
@@ -2354,8 +2387,7 @@ def maybe_convert_numeric( | |||
seen.float_ = True | |||
else: | |||
try: | |||
floatify(val, &fval, &maybe_int) | |||
|
|||
floatify(val, &fval, &maybe_int, dsep[0], tsep[0]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this reachable if the length of dsep or tsep is every zero? i.e. if someone did thousands=""
seems like this could segfault?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes you're right, I'll add a validation check to make sure both tsep
and dsep
will be at least length 1
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changed the check on line 2227 to filter out zero length separators, the validation checks may need changing based on the decision re. multi width characters but the segfault should be fixed.
@WillAyd Thanks for the feedback, I'll have a look into the changes shortly. The idea behind this is to mimic some of the functionality of functions such as Regarding python handling this, the |
…odgson/pandas-alex-hodgson into feat/to-numeric-decimal-seperators
…odgson/pandas-alex-hodgson into feat/to-numeric-decimal-seperators
Thanks for the PR here @AlexHodgson but I am also hesitant if there's enough core dev team support for this feature to go through. I think it would be good to see if there more support in the original issue from the core team in the issue before moving forward with PR so closing for now. But if there's that interest we can reopen this PR |
doc/source/whatsnew/vX.X.X.rst
file if fixing a bug or adding a new feature.Added the option to specify a different decimal point when using to_numeric() as described in #4674. Also added an option to specify the thousasnds separator, similar to functions such as read_csv(). This removes the need for users to change the string manually before calling to_numeric(), and is especially useful for processing numbers in international formats.