ENH: Add decimal and thousand separator params to to_numeric() #56934

AlexHodgson · 2024-01-17T22:14:40Z

closes ENH: Add decimal parameter to to_numeric #4674
Tests added and passed if fixing a bug or adding a new feature
All code checks passed.
Added type annotations to new arguments/methods/functions.
Added an entry in the latest doc/source/whatsnew/vX.X.X.rst file if fixing a bug or adding a new feature.

Added the option to specify a different decimal point when using to_numeric() as described in #4674. Also added an option to specify the thousasnds separator, similar to functions such as read_csv(). This removes the need for users to change the string manually before calling to_numeric(), and is especially useful for processing numbers in international formats.

update from main

…odgson/pandas-alex-hodgson into feat/to-numeric-decimal-seperators

AlexHodgson · 2024-02-10T22:24:20Z

Added the whatsnew changes to the 3.0.0 file for now, will there be another minor version before then or is that the correct file to document the changes?

WillAyd

Added some comments on the implementation but generally I am a little unsure this is even worth doing in pandas. To handle all the edge cases correctly would be a lot of work.

What is the use-case for having this as part of the API? Does Python itself offer something similar with float <> string conversions?

WillAyd · 2024-02-13T13:39:26Z

pandas/_libs/lib.pyx

@@ -2204,6 +2205,8 @@ def maybe_convert_numeric(
    bint convert_empty=True,
    bint coerce_numeric=False,
    bint convert_to_masked_nullable=False,
+    str thousands=None,


Do these need to be str in the first place or would a declaration of char simplify things? I am not sure how Cython works in this case but it seems like it might be able to automatically handle that object -> ctype conversion

Your suggestion was my original implementation, and it ran fine, but I ran into issues with the mypy tests. It wasn't matching the stub function to its implentation here as the parameter types were different: I couldn't use char dtype in lib.pyi, and I was trying to declare it as bytes. It may work to import the cython dtypes into the stub file but I'm not sure if this is the best option or there is another neat method.

WillAyd · 2024-02-13T13:40:33Z

pandas/_libs/lib.pyx

+    if thousands is None:
+        tsep = "\0"
+    else:
+        bytes_tsep = thousands.encode("UTF-8")


Doesn't thousands.encode return a PyObject *? Assigning that to tsep does not seem correct

Cython handles the conversion when assigning bytes_tsep to tsep. I followed the cython documentation for converting str to bytes here: https://cython.readthedocs.io/en/latest/src/tutorial/strings.html#encoding-text-to-bytes
If there is a way solve the char dtype issue as mentioned above then this can be simplified though.

WillAyd · 2024-02-13T13:41:43Z

pandas/_libs/lib.pyx

+    cdef char* dsep
+    # Use null char to represent lack of separator
+    if thousands is None:
+        tsep = "\0"


I think this is confusing to assign the nul byte instead of just assigning to NULL, especially with how Cython handles char * like this (although the comment above about using char declarations in the function signature would probably simplify this)

I did it this way as precise_xstrtod() in the C parser uses the null byte to represent the lack of a separator when processing a string (see line 1633 of tokenizer.c), previously this was just hardcoded as one of its parameters, but unless we want to change the C implementation then tsep needs to be set to the null char at some point, the assignment could be done in C rather than cython if you think that's neater.

WillAyd · 2024-02-13T13:42:29Z

pandas/_libs/lib.pyx

+    dsep = bytes_dsep
+
+    # Validate separators
+    if len(tsep) > 1:


Not sure why this would be useful but raising like this prevents a multi-byte character from being used as a separator. That should be tested either way

Using wide characters for these separators does indeed seem pretty odd and an unlikely use case, if we do need to support this then it would require a change of approach as the C parser uses a single char to represent the thousasnds separator. Otherwise I can make the error more specific and document that it must be single width.

WillAyd · 2024-02-13T13:44:37Z

pandas/_libs/lib.pyx

@@ -2354,8 +2387,7 @@ def maybe_convert_numeric(
            seen.float_ = True
        else:
            try:
-                floatify(val, &fval, &maybe_int)
-
+                floatify(val, &fval, &maybe_int, dsep[0], tsep[0])


Is this reachable if the length of dsep or tsep is every zero? i.e. if someone did thousands="" seems like this could segfault?

Yes you're right, I'll add a validation check to make sure both tsep and dsep will be at least length 1

Changed the check on line 2227 to filter out zero length separators, the validation checks may need changing based on the decision re. multi width characters but the segfault should be fixed.

AlexHodgson · 2024-02-13T22:30:44Z

@WillAyd Thanks for the feedback, I'll have a look into the changes shortly.

The idea behind this is to mimic some of the functionality of functions such as read_csv() where you can define your own thousands and decimal seperators, mainly for ingestion of numerical data in international formats. If you are reading data directly into pandas through one of the read_xxx() functions then you usually have this custom separator functionality, but if for any reason the numeric data is ingested as a string into python first, and then try to "ingest" into pandas and convert with to_numeric() you don't have that functionality. The specific issue I had with this was when crawling data from websites that present numbers in the european format. For example data on websites that is not in a table readable by read_html(), and in some cases the data had to be obtained by parsing numbers from a pdf.

Regarding python handling this, the float() function can only handle different seperators if you mess around with locales, as described and not recommended here: https://stackoverflow.com/a/6633912 . This is global and not really usable in production situations, and means you will then be locked into processing data in the format as defined by the locale (Also somewhat system dependent). Having these parameters in to_numeric() would allow for processing different columns in different formats, eg. one column of US style numbers and one of European. And doesn't risk unintended consequences like locale changes and different behaviour on different PCs.

…odgson/pandas-alex-hodgson into feat/to-numeric-decimal-seperators

mroeschke · 2024-04-23T18:49:07Z

Thanks for the PR here @AlexHodgson but I am also hesitant if there's enough core dev team support for this feature to go through. I think it would be good to see if there more support in the original issue from the core team in the issue before moving forward with PR so closing for now. But if there's that interest we can reopen this PR

AlexHodgson and others added 24 commits October 9, 2023 18:12

Pass thousand and decimal params down to c parser

fd7c1ff

fix build errors

4bceef2

documentation

ea36c28

debugging

64096ef

clean up print contents

ae82729

string stuff

6fb82cb

use char* to handle string passing

04c25b5

Remove debug print

3c74258

Correct default thousand separator

6ac1be8

Remove old constant decimal separator comment

fd822c6

test cases

26a1295

parameter type hints

b8a5248

Better separator validation errors

77f0174

Remove unneeded check

32e2c1b

use int() on val processed by floatify

b381201

Merge branch 'main' into feat/to-numeric-decimal-seperators

90ec1a1

whitespace and formatting fix

282ba6b

Fix missing var declaration from merge

7bcaa6e

Change var from dec to decimal

df7f4f1

Merge pull request #1 from AlexHodgson/main

c810498

update from main

Merge branch 'main' into feat/to-numeric-decimal-seperators

3096fc9

Merge branch 'main' into feat/to-numeric-decimal-seperators

38189a5

match function parameters in header file

d5f5fed

formatting changes

67758c6

AlexHodgson changed the title ~~Add decimal and thousand separator to to_numeric~~ Add decimal and thousand separator params to to_numeric Jan 18, 2024

AlexHodgson changed the title ~~Add decimal and thousand separator params to to_numeric~~ Add decimal and thousand separator params to to_numeric() Jan 18, 2024

AlexHodgson added 4 commits January 18, 2024 18:03

Change param type in stub file to bytearray

f94f560

Docstring fixes

06e406d

Change separator param dtype to str

c837279

Change dtype again, to bytes and char*

cde703a

AlexHodgson and others added 3 commits February 6, 2024 11:36

Merge branch 'main' into feat/to-numeric-decimal-seperators

2d78e0f

Add more test cases

a24a0d2

Merge branch 'main' into feat/to-numeric-decimal-seperators

9c39b4d

simonjayhawkins added Enhancement Numeric Operations Arithmetic, Comparison, and Logical operations labels Feb 7, 2024

AlexHodgson and others added 7 commits February 9, 2024 12:37

Merge branch 'main' into feat/to-numeric-decimal-seperators

cb1ef67

Quotation marks

87093fd

Merge branch 'main' into feat/to-numeric-decimal-seperators

0243ccb

Move documentation to whatsnew 3.0.0

0bfeb50

Merge branch 'feat/to-numeric-decimal-seperators' of github.com:AlexH…

72c93cc

…odgson/pandas-alex-hodgson into feat/to-numeric-decimal-seperators

fix messy merge on whatsnew file

f33fbf5

try again

e696a92

Merge branch 'main' into feat/to-numeric-decimal-seperators

83d9004

WillAyd requested changes Feb 13, 2024

View reviewed changes

AlexHodgson and others added 11 commits February 16, 2024 13:46

Ensure 0 length separators are not passed in

493e86b

Merge branch 'feat/to-numeric-decimal-seperators' of github.com:AlexH…

a8e4845

…odgson/pandas-alex-hodgson into feat/to-numeric-decimal-seperators

Merge branch 'main' into feat/to-numeric-decimal-seperators

837a990

Remove debug print

a2c0ad2

Merge branch 'feat/to-numeric-decimal-seperators' of github.com:AlexH…

17f5e7e

…odgson/pandas-alex-hodgson into feat/to-numeric-decimal-seperators

Merge branch 'main' into feat/to-numeric-decimal-seperators

565f148

Merge branch 'main' into feat/to-numeric-decimal-seperators

52605cb

Merge branch 'main' into feat/to-numeric-decimal-seperators

dddd04e

Merge branch 'main' into feat/to-numeric-decimal-seperators

8f987ad

Merge branch 'main' into feat/to-numeric-decimal-seperators

51f8b06

Merge branch 'main' into feat/to-numeric-decimal-seperators

35d8757

mroeschke closed this Apr 23, 2024

Liam3851 mentioned this pull request Feb 24, 2025

ENH: Add to_numeric_br() function to convert Brazilian-formatted numbers #60998

Open

3 tasks

Uh oh!

ENH: Add decimal and thousand separator params to to_numeric() #56934

ENH: Add decimal and thousand separator params to to_numeric() #56934

Uh oh!

Conversation

AlexHodgson commented Jan 17, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

AlexHodgson commented Feb 10, 2024

Uh oh!

WillAyd left a comment

Choose a reason for hiding this comment

Uh oh!

WillAyd Feb 13, 2024

Choose a reason for hiding this comment

Uh oh!

AlexHodgson Feb 13, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

WillAyd Feb 13, 2024

Choose a reason for hiding this comment

Uh oh!

AlexHodgson Feb 13, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

WillAyd Feb 13, 2024

Choose a reason for hiding this comment

Uh oh!

AlexHodgson Feb 13, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

WillAyd Feb 13, 2024

Choose a reason for hiding this comment

Uh oh!

AlexHodgson Feb 13, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

WillAyd Feb 13, 2024

Choose a reason for hiding this comment

Uh oh!

AlexHodgson Feb 13, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

AlexHodgson Feb 16, 2024

Choose a reason for hiding this comment

Uh oh!

AlexHodgson commented Feb 13, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

mroeschke commented Apr 23, 2024

Uh oh!

Uh oh!

AlexHodgson commented Jan 17, 2024 •

edited

Loading

AlexHodgson Feb 13, 2024 •

edited

Loading

AlexHodgson Feb 13, 2024 •

edited

Loading

AlexHodgson Feb 13, 2024 •

edited

Loading

AlexHodgson Feb 13, 2024 •

edited

Loading

AlexHodgson Feb 13, 2024 •

edited

Loading

AlexHodgson commented Feb 13, 2024 •

edited

Loading