Skip to content

ENH: Add decimal and thousand separator params to to_numeric() #56934

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Conversation

AlexHodgson
Copy link

@AlexHodgson AlexHodgson commented Jan 17, 2024

Added the option to specify a different decimal point when using to_numeric() as described in #4674. Also added an option to specify the thousasnds separator, similar to functions such as read_csv(). This removes the need for users to change the string manually before calling to_numeric(), and is especially useful for processing numbers in international formats.

@AlexHodgson AlexHodgson changed the title Add decimal and thousand separator to to_numeric Add decimal and thousand separator params to to_numeric Jan 18, 2024
@AlexHodgson AlexHodgson changed the title Add decimal and thousand separator params to to_numeric Add decimal and thousand separator params to to_numeric() Jan 18, 2024
@simonjayhawkins simonjayhawkins added Enhancement Numeric Operations Arithmetic, Comparison, and Logical operations labels Feb 7, 2024
@AlexHodgson
Copy link
Author

Added the whatsnew changes to the 3.0.0 file for now, will there be another minor version before then or is that the correct file to document the changes?

Copy link
Member

@WillAyd WillAyd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added some comments on the implementation but generally I am a little unsure this is even worth doing in pandas. To handle all the edge cases correctly would be a lot of work.

What is the use-case for having this as part of the API? Does Python itself offer something similar with float <> string conversions?

@@ -2204,6 +2205,8 @@ def maybe_convert_numeric(
bint convert_empty=True,
bint coerce_numeric=False,
bint convert_to_masked_nullable=False,
str thousands=None,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do these need to be str in the first place or would a declaration of char simplify things? I am not sure how Cython works in this case but it seems like it might be able to automatically handle that object -> ctype conversion

Copy link
Author

@AlexHodgson AlexHodgson Feb 13, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Your suggestion was my original implementation, and it ran fine, but I ran into issues with the mypy tests. It wasn't matching the stub function to its implentation here as the parameter types were different: I couldn't use char dtype in lib.pyi, and I was trying to declare it as bytes. It may work to import the cython dtypes into the stub file but I'm not sure if this is the best option or there is another neat method.

if thousands is None:
tsep = "\0"
else:
bytes_tsep = thousands.encode("UTF-8")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Doesn't thousands.encode return a PyObject *? Assigning that to tsep does not seem correct

Copy link
Author

@AlexHodgson AlexHodgson Feb 13, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cython handles the conversion when assigning bytes_tsep to tsep. I followed the cython documentation for converting str to bytes here: https://cython.readthedocs.io/en/latest/src/tutorial/strings.html#encoding-text-to-bytes
If there is a way solve the char dtype issue as mentioned above then this can be simplified though.

cdef char* dsep
# Use null char to represent lack of separator
if thousands is None:
tsep = "\0"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is confusing to assign the nul byte instead of just assigning to NULL, especially with how Cython handles char * like this (although the comment above about using char declarations in the function signature would probably simplify this)

Copy link
Author

@AlexHodgson AlexHodgson Feb 13, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did it this way as precise_xstrtod() in the C parser uses the null byte to represent the lack of a separator when processing a string (see line 1633 of tokenizer.c), previously this was just hardcoded as one of its parameters, but unless we want to change the C implementation then tsep needs to be set to the null char at some point, the assignment could be done in C rather than cython if you think that's neater.

dsep = bytes_dsep

# Validate separators
if len(tsep) > 1:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure why this would be useful but raising like this prevents a multi-byte character from being used as a separator. That should be tested either way

Copy link
Author

@AlexHodgson AlexHodgson Feb 13, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using wide characters for these separators does indeed seem pretty odd and an unlikely use case, if we do need to support this then it would require a change of approach as the C parser uses a single char to represent the thousasnds separator. Otherwise I can make the error more specific and document that it must be single width.

@@ -2354,8 +2387,7 @@ def maybe_convert_numeric(
seen.float_ = True
else:
try:
floatify(val, &fval, &maybe_int)

floatify(val, &fval, &maybe_int, dsep[0], tsep[0])
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this reachable if the length of dsep or tsep is every zero? i.e. if someone did thousands="" seems like this could segfault?

Copy link
Author

@AlexHodgson AlexHodgson Feb 13, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes you're right, I'll add a validation check to make sure both tsep and dsep will be at least length 1

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed the check on line 2227 to filter out zero length separators, the validation checks may need changing based on the decision re. multi width characters but the segfault should be fixed.

@AlexHodgson
Copy link
Author

AlexHodgson commented Feb 13, 2024

@WillAyd Thanks for the feedback, I'll have a look into the changes shortly.

The idea behind this is to mimic some of the functionality of functions such as read_csv() where you can define your own thousands and decimal seperators, mainly for ingestion of numerical data in international formats. If you are reading data directly into pandas through one of the read_xxx() functions then you usually have this custom separator functionality, but if for any reason the numeric data is ingested as a string into python first, and then try to "ingest" into pandas and convert with to_numeric() you don't have that functionality. The specific issue I had with this was when crawling data from websites that present numbers in the european format. For example data on websites that is not in a table readable by read_html(), and in some cases the data had to be obtained by parsing numbers from a pdf.

Regarding python handling this, the float() function can only handle different seperators if you mess around with locales, as described and not recommended here: https://stackoverflow.com/a/6633912 . This is global and not really usable in production situations, and means you will then be locked into processing data in the format as defined by the locale (Also somewhat system dependent). Having these parameters in to_numeric() would allow for processing different columns in different formats, eg. one column of US style numbers and one of European. And doesn't risk unintended consequences like locale changes and different behaviour on different PCs.

@mroeschke
Copy link
Member

Thanks for the PR here @AlexHodgson but I am also hesitant if there's enough core dev team support for this feature to go through. I think it would be good to see if there more support in the original issue from the core team in the issue before moving forward with PR so closing for now. But if there's that interest we can reopen this PR

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Enhancement Numeric Operations Arithmetic, Comparison, and Logical operations
Projects
None yet
Development

Successfully merging this pull request may close these issues.

ENH: Add decimal parameter to to_numeric
4 participants