Skip to content

ENH: Add decimal and thousand separator params to to_numeric() #56934

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
Show file tree
Hide file tree
Changes from 71 commits
Commits
Show all changes
82 commits
Select commit Hold shift + click to select a range
fd7c1ff
Pass thousand and decimal params down to c parser
AlexHodgson Oct 9, 2023
4bceef2
fix build errors
AlexHodgson Oct 10, 2023
ea36c28
documentation
AlexHodgson Oct 11, 2023
64096ef
debugging
AlexHodgson Oct 11, 2023
ae82729
clean up print contents
AlexHodgson Oct 27, 2023
6fb82cb
string stuff
AlexHodgson Oct 28, 2023
04c25b5
use char* to handle string passing
AlexHodgson Oct 30, 2023
3c74258
Remove debug print
AlexHodgson Oct 30, 2023
6ac1be8
Correct default thousand separator
AlexHodgson Oct 30, 2023
fd822c6
Remove old constant decimal separator comment
AlexHodgson Nov 3, 2023
26a1295
test cases
AlexHodgson Nov 3, 2023
b8a5248
parameter type hints
AlexHodgson Nov 3, 2023
77f0174
Better separator validation errors
AlexHodgson Nov 3, 2023
32e2c1b
Remove unneeded check
AlexHodgson Dec 14, 2023
b381201
use int() on val processed by floatify
AlexHodgson Jan 2, 2024
90ec1a1
Merge branch 'main' into feat/to-numeric-decimal-seperators
AlexHodgson Jan 17, 2024
282ba6b
whitespace and formatting fix
AlexHodgson Jan 17, 2024
7bcaa6e
Fix missing var declaration from merge
AlexHodgson Jan 17, 2024
df7f4f1
Change var from dec to decimal
AlexHodgson Jan 17, 2024
c810498
Merge pull request #1 from AlexHodgson/main
AlexHodgson Jan 17, 2024
3096fc9
Merge branch 'main' into feat/to-numeric-decimal-seperators
AlexHodgson Jan 17, 2024
38189a5
Merge branch 'main' into feat/to-numeric-decimal-seperators
AlexHodgson Jan 18, 2024
d5f5fed
match function parameters in header file
AlexHodgson Jan 18, 2024
67758c6
formatting changes
AlexHodgson Jan 18, 2024
f94f560
Change param type in stub file to bytearray
AlexHodgson Jan 18, 2024
06e406d
Docstring fixes
AlexHodgson Jan 18, 2024
c837279
Change separator param dtype to str
AlexHodgson Jan 18, 2024
cde703a
Change dtype again, to bytes and char*
AlexHodgson Jan 18, 2024
d69b588
Merge branch 'pandas-dev:main' into feat/to-numeric-decimal-seperators
AlexHodgson Jan 18, 2024
53ffb96
Merge branch 'main' into feat/to-numeric-decimal-seperators
AlexHodgson Jan 18, 2024
1d3b9e7
Add quotes to default value
AlexHodgson Jan 19, 2024
5d625bb
Merge branch 'feat/to-numeric-decimal-seperators' of github.com:AlexH…
AlexHodgson Jan 19, 2024
acce835
Merge branch 'main' into feat/to-numeric-decimal-seperators
AlexHodgson Jan 19, 2024
a7e3cf8
Some changes with passing chars to c
AlexHodgson Jan 24, 2024
2977c63
Merge branch 'feat/to-numeric-decimal-seperators' of github.com:AlexH…
AlexHodgson Jan 24, 2024
1593e42
Merge branch 'main' into feat/to-numeric-decimal-seperators
AlexHodgson Jan 24, 2024
093a1d6
Merge branch 'main' into feat/to-numeric-decimal-seperators
AlexHodgson Jan 27, 2024
d0fa3db
Use string replace in case of int with thousand separator
AlexHodgson Jan 27, 2024
347ef91
Update docstrings
AlexHodgson Jan 27, 2024
1d8e146
Remove debug print statements
AlexHodgson Jan 28, 2024
574fb10
Don't call encode() on none type
AlexHodgson Jan 28, 2024
aae7e0a
Merge branch 'main' into feat/to-numeric-decimal-seperators
AlexHodgson Jan 29, 2024
2327c18
Split conversion function call in 2 based on thousand param
AlexHodgson Jan 29, 2024
a7034c9
Merge branch 'feat/to-numeric-decimal-seperators' of github.com:AlexH…
AlexHodgson Jan 29, 2024
c8b3ac0
Specify possible null type as parameter for maybe_convert_numeric
AlexHodgson Jan 29, 2024
f8a6407
Merge branch 'main' into feat/to-numeric-decimal-seperators
AlexHodgson Jan 29, 2024
f6c735a
Merge branch 'main' into feat/to-numeric-decimal-seperators
AlexHodgson Jan 29, 2024
4ecfb6f
Merge pull request #2 from AlexHodgson/main
AlexHodgson Jan 31, 2024
8474575
Merge branch 'main' into feat/to-numeric-decimal-seperators
AlexHodgson Feb 1, 2024
4bb2a94
Decimal stub file type bytes
AlexHodgson Feb 1, 2024
91195b8
Use string dtype to pass paramaters to cython
AlexHodgson Feb 4, 2024
0c33992
debug prints
AlexHodgson Feb 4, 2024
f79b91a
Cleanup functions now they use str dtype
AlexHodgson Feb 4, 2024
350e5d1
Merge pull request #3 from AlexHodgson/feat/use-str-dtype-cython
AlexHodgson Feb 4, 2024
ffd50de
Add entry to whatsnew
AlexHodgson Feb 4, 2024
a2cc191
Merge branch 'feat/to-numeric-decimal-seperators' of github.com:AlexH…
AlexHodgson Feb 4, 2024
d7890f8
Merge branch 'main' into feat/to-numeric-decimal-seperators
AlexHodgson Feb 4, 2024
15825a7
whatsnew formatting fixes
AlexHodgson Feb 5, 2024
c5ac8cb
Merge branch 'feat/to-numeric-decimal-seperators' of github.com:AlexH…
AlexHodgson Feb 5, 2024
d257d97
Merge branch 'main' into feat/to-numeric-decimal-seperators
AlexHodgson Feb 5, 2024
2d78e0f
Merge branch 'main' into feat/to-numeric-decimal-seperators
AlexHodgson Feb 6, 2024
a24a0d2
Add more test cases
AlexHodgson Feb 7, 2024
9c39b4d
Merge branch 'main' into feat/to-numeric-decimal-seperators
AlexHodgson Feb 7, 2024
cb1ef67
Merge branch 'main' into feat/to-numeric-decimal-seperators
AlexHodgson Feb 9, 2024
87093fd
Quotation marks
AlexHodgson Feb 9, 2024
0243ccb
Merge branch 'main' into feat/to-numeric-decimal-seperators
AlexHodgson Feb 10, 2024
0bfeb50
Move documentation to whatsnew 3.0.0
AlexHodgson Feb 10, 2024
72c93cc
Merge branch 'feat/to-numeric-decimal-seperators' of github.com:AlexH…
AlexHodgson Feb 10, 2024
f33fbf5
fix messy merge on whatsnew file
AlexHodgson Feb 10, 2024
e696a92
try again
AlexHodgson Feb 10, 2024
83d9004
Merge branch 'main' into feat/to-numeric-decimal-seperators
AlexHodgson Feb 13, 2024
493e86b
Ensure 0 length separators are not passed in
AlexHodgson Feb 16, 2024
a8e4845
Merge branch 'feat/to-numeric-decimal-seperators' of github.com:AlexH…
AlexHodgson Feb 16, 2024
837a990
Merge branch 'main' into feat/to-numeric-decimal-seperators
AlexHodgson Feb 16, 2024
a2c0ad2
Remove debug print
AlexHodgson Feb 16, 2024
17f5e7e
Merge branch 'feat/to-numeric-decimal-seperators' of github.com:AlexH…
AlexHodgson Feb 16, 2024
565f148
Merge branch 'main' into feat/to-numeric-decimal-seperators
AlexHodgson Feb 22, 2024
52605cb
Merge branch 'main' into feat/to-numeric-decimal-seperators
AlexHodgson Feb 29, 2024
dddd04e
Merge branch 'main' into feat/to-numeric-decimal-seperators
AlexHodgson Mar 6, 2024
8f987ad
Merge branch 'main' into feat/to-numeric-decimal-seperators
AlexHodgson Mar 13, 2024
51f8b06
Merge branch 'main' into feat/to-numeric-decimal-seperators
AlexHodgson Mar 18, 2024
35d8757
Merge branch 'main' into feat/to-numeric-decimal-seperators
AlexHodgson Mar 30, 2024
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
19 changes: 18 additions & 1 deletion doc/source/whatsnew/v3.0.0.rst
Original file line number Diff line number Diff line change
Expand Up @@ -14,7 +14,24 @@ including other versions of pandas.
Enhancements
~~~~~~~~~~~~

.. _whatsnew_300.enhancements.enhancement1:
.. _whatsnew_300.enhancements.to_numeric_separators:

Custom Thousand and Decimal Separators added to :func:`to_numeric`
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
Added two new parameters, ``decimal`` and ``thousands`` to :func:`to_numeric`, allowing users to specify custom decimal points and thousand separators. (:issue:`#4674`).
``decimal`` has default value ``'.''`` and ``thousands`` has default value ``None``, meaning the string would contain no symbols to demarcate groups of thousands and use ``.`` as the decimal point. This default behaviour is the same as :func:`to_numeric` before the parameters were added.
:func:`to_numeric` will now be able to parse number strings such as '1,000,000' or '1.000,5' if the user provides the correct parameters.

An example of the new functionality:

.. code-block:: python

>>> s = pd.Series(['1,5', '2.000.000', -3])
>>> pd.to_numeric(s, thousands='.', decimal=',')
0 1.5
1 2000000.0
2 -3.0
dtype: float64

enhancement1
^^^^^^^^^^^^
Expand Down
8 changes: 4 additions & 4 deletions pandas/_libs/include/pandas/parser/pd_parser.h
Original file line number Diff line number Diff line change
Expand Up @@ -17,8 +17,8 @@ extern "C" {
#include <Python.h>

typedef struct {
int (*to_double)(char *, double *, char, char, int *);
int (*floatify)(PyObject *, double *, int *);
int (*to_double)(char *, double *, char, char, char, int *);
int (*floatify)(PyObject *, double *, int *, char, char);
void *(*new_rd_source)(PyObject *);
void (*del_rd_source)(void *);
char *(*buffer_rd_bytes)(void *, size_t, size_t *, int *, const char *);
Expand Down Expand Up @@ -58,8 +58,8 @@ static PandasParser_CAPI *PandasParserAPI = NULL;

#define to_double(item, p_value, sci, decimal, maybe_int) \
PandasParserAPI->to_double((item), (p_value), (sci), (decimal), (maybe_int))
#define floatify(str, result, maybe_int) \
PandasParserAPI->floatify((str), (result), (maybe_int))
#define floatify(str, result, maybe_int, dec, tsep) \
PandasParserAPI->floatify((str), (result), (maybe_int), (dec), (tsep))
#define new_rd_source(obj) PandasParserAPI->new_rd_source((obj))
#define del_rd_source(src) PandasParserAPI->del_rd_source((src))
#define buffer_rd_bytes(source, nbytes, bytes_read, status, encoding_errors) \
Expand Down
4 changes: 4 additions & 0 deletions pandas/_libs/lib.pyi
Original file line number Diff line number Diff line change
Expand Up @@ -124,6 +124,8 @@ def maybe_convert_numeric(
convert_empty: bool = ...,
coerce_numeric: bool = ...,
convert_to_masked_nullable: Literal[False] = ...,
thousands: str | None = ...,
decimal: str = ...,
) -> tuple[np.ndarray, None]: ...
@overload
def maybe_convert_numeric(
Expand All @@ -133,6 +135,8 @@ def maybe_convert_numeric(
coerce_numeric: bool = ...,
*,
convert_to_masked_nullable: Literal[True],
thousands: str | None = ...,
decimal: str = ...,
) -> tuple[np.ndarray, np.ndarray]: ...

# TODO: restrict `arr`?
Expand Down
43 changes: 39 additions & 4 deletions pandas/_libs/lib.pyx
Original file line number Diff line number Diff line change
Expand Up @@ -69,7 +69,8 @@ from pandas._libs.interval import Interval


cdef extern from "pandas/parser/pd_parser.h":
int floatify(object, float64_t *result, int *maybe_int) except -1
int floatify(object, float64_t *result, int *maybe_int,
char dec, char tsep) except -1
void PandasParser_IMPORT()

PandasParser_IMPORT
Expand Down Expand Up @@ -2204,6 +2205,8 @@ def maybe_convert_numeric(
bint convert_empty=True,
bint coerce_numeric=False,
bint convert_to_masked_nullable=False,
str thousands=None,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do these need to be str in the first place or would a declaration of char simplify things? I am not sure how Cython works in this case but it seems like it might be able to automatically handle that object -> ctype conversion

Copy link
Author

@AlexHodgson AlexHodgson Feb 13, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Your suggestion was my original implementation, and it ran fine, but I ran into issues with the mypy tests. It wasn't matching the stub function to its implentation here as the parameter types were different: I couldn't use char dtype in lib.pyi, and I was trying to declare it as bytes. It may work to import the cython dtypes into the stub file but I'm not sure if this is the best option or there is another neat method.

str decimal="."
) -> tuple[np.ndarray, np.ndarray | None]:
"""
Convert object array to a numeric array if possible.
Expand Down Expand Up @@ -2231,6 +2234,14 @@ def maybe_convert_numeric(
convert_to_masked_nullable : bool, default False
Whether to return a mask for the converted values. This also disables
upcasting for ints with nulls to float64.
thousands : str, default None
Character used to separate groups of thousands for readability,
e.g. ',' in 1,000,000
Must only be 1 character long.
decimal : str, default '.'
Character used to separate decimal section from the integer
section of the number, e.g., '.' in 12.45
Must only be 1 character long.
Returns
-------
np.ndarray
Expand All @@ -2247,6 +2258,28 @@ def maybe_convert_numeric(
cdef:
object val = values[0]

# Convert python strings into ones readable by C

cdef char* tsep
cdef char* dsep
# Use null char to represent lack of separator
if thousands is None:
tsep = "\0"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is confusing to assign the nul byte instead of just assigning to NULL, especially with how Cython handles char * like this (although the comment above about using char declarations in the function signature would probably simplify this)

Copy link
Author

@AlexHodgson AlexHodgson Feb 13, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did it this way as precise_xstrtod() in the C parser uses the null byte to represent the lack of a separator when processing a string (see line 1633 of tokenizer.c), previously this was just hardcoded as one of its parameters, but unless we want to change the C implementation then tsep needs to be set to the null char at some point, the assignment could be done in C rather than cython if you think that's neater.

else:
bytes_tsep = thousands.encode("UTF-8")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Doesn't thousands.encode return a PyObject *? Assigning that to tsep does not seem correct

Copy link
Author

@AlexHodgson AlexHodgson Feb 13, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cython handles the conversion when assigning bytes_tsep to tsep. I followed the cython documentation for converting str to bytes here: https://cython.readthedocs.io/en/latest/src/tutorial/strings.html#encoding-text-to-bytes
If there is a way solve the char dtype issue as mentioned above then this can be simplified though.

tsep = bytes_tsep

bytes_dsep = decimal.encode("UTF-8")
dsep = bytes_dsep

# Validate separators
if len(tsep) > 1:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure why this would be useful but raising like this prevents a multi-byte character from being used as a separator. That should be tested either way

Copy link
Author

@AlexHodgson AlexHodgson Feb 13, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using wide characters for these separators does indeed seem pretty odd and an unlikely use case, if we do need to support this then it would require a change of approach as the C parser uses a single char to represent the thousasnds separator. Otherwise I can make the error more specific and document that it must be single width.

raise ValueError("Thousands separator must not exceed length 1")
if len(dsep) > 1:
raise ValueError("Decimal separator must have length 1")
if tsep == dsep:
raise ValueError("Decimal and thousand separators must not be the same")

if util.is_integer_object(val):
try:
maybe_ints = values.astype("i8")
Expand Down Expand Up @@ -2354,8 +2387,7 @@ def maybe_convert_numeric(
seen.float_ = True
else:
try:
floatify(val, &fval, &maybe_int)

floatify(val, &fval, &maybe_int, dsep[0], tsep[0])
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this reachable if the length of dsep or tsep is every zero? i.e. if someone did thousands="" seems like this could segfault?

Copy link
Author

@AlexHodgson AlexHodgson Feb 13, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes you're right, I'll add a validation check to make sure both tsep and dsep will be at least length 1

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed the check on line 2227 to filter out zero length separators, the validation checks may need changing based on the decision re. multi width characters but the segfault should be fixed.

if fval in na_values:
seen.saw_null()
floats[i] = complexes[i] = NaN
Expand All @@ -2368,7 +2400,10 @@ def maybe_convert_numeric(
floats[i] = fval

if maybe_int:
as_int = int(val)
if thousands is None:
as_int = int(val)
else:
as_int = int(val.replace(thousands, ""))

if as_int in na_values:
mask[i] = 1
Expand Down
10 changes: 5 additions & 5 deletions pandas/_libs/src/parser/pd_parser.c
Original file line number Diff line number Diff line change
Expand Up @@ -13,22 +13,22 @@ Distributed under the terms of the BSD Simplified License.
#include "pandas/portable.h"

static int to_double(char *item, double *p_value, char sci, char decimal,
int *maybe_int) {
char tsep, int *maybe_int) {
char *p_end = NULL;
int error = 0;

/* Switch to precise xstrtod GH 31364 */
*p_value =
precise_xstrtod(item, &p_end, decimal, sci, '\0', 1, &error, maybe_int);
precise_xstrtod(item, &p_end, decimal, sci, tsep, 1, &error, maybe_int);

return (error == 0) && (!*p_end);
}

static int floatify(PyObject *str, double *result, int *maybe_int) {
static int floatify(PyObject *str, double *result, int *maybe_int, char dec,
char tsep) {
char *data;
PyObject *tmp = NULL;
const char sci = 'E';
const char dec = '.';

if (PyBytes_Check(str)) {
data = PyBytes_AS_STRING(str);
Expand All @@ -43,7 +43,7 @@ static int floatify(PyObject *str, double *result, int *maybe_int) {
return -1;
}

const int status = to_double(data, result, sci, dec, maybe_int);
const int status = to_double(data, result, sci, dec, tsep, maybe_int);

if (!status) {
/* handle inf/-inf infinity/-infinity */
Expand Down
28 changes: 28 additions & 0 deletions pandas/core/tools/numeric.py
Original file line number Diff line number Diff line change
Expand Up @@ -44,6 +44,8 @@ def to_numeric(
errors: DateTimeErrorChoices = "raise",
downcast: Literal["integer", "signed", "unsigned", "float"] | None = None,
dtype_backend: DtypeBackend | lib.NoDefault = lib.no_default,
thousands: str | None = None,
decimal: str = ".",
):
"""
Convert argument to a numeric type.
Expand Down Expand Up @@ -99,6 +101,20 @@ def to_numeric(

.. versionadded:: 2.0

thousands : str | None, default None
Character used to separate groups of thousands for readability,
e.g. ',' in 1,000,000
Must only be 1 character long.

.. versionadded:: 2.2.1

decimal : str, default '.'
Character used to separate decimal section from the integer
section of the number, e.g., '.' in 12.45
Must only be 1 character long.

.. versionadded:: 2.2.1

Returns
-------
ret
Expand Down Expand Up @@ -155,6 +171,15 @@ def to_numeric(
1 2.1
2 3.0
dtype: Float32

Handling of data with non standard decimal or thousand separators

>>> s = pd.Series(["1,5", "2.000.000", -3])
>>> pd.to_numeric(s, thousands=".", decimal=",")
0 1.5
1 2000000.0
2 -3.0
dtype: float64
"""
if downcast not in (None, "integer", "signed", "unsigned", "float"):
raise ValueError("invalid downcasting method provided")
Expand Down Expand Up @@ -203,6 +228,7 @@ def to_numeric(
mask = values.isna()
values = values.dropna().to_numpy()
new_mask: np.ndarray | None = None

if is_numeric_dtype(values_dtype):
pass
elif lib.is_np_dtype(values_dtype, "mM"):
Expand All @@ -217,6 +243,8 @@ def to_numeric(
convert_to_masked_nullable=dtype_backend is not lib.no_default
or isinstance(values_dtype, StringDtype)
and not values_dtype.storage == "pyarrow_numpy",
thousands=thousands,
decimal=decimal,
)

if new_mask is not None:
Expand Down
46 changes: 46 additions & 0 deletions pandas/tests/tools/test_to_numeric.py
Original file line number Diff line number Diff line change
Expand Up @@ -904,3 +904,49 @@ def test_coerce_pyarrow_backend():
result = to_numeric(ser, errors="coerce", dtype_backend="pyarrow")
expected = Series([1, 2, None], dtype=ArrowDtype(pa.int64()))
tm.assert_series_equal(result, expected)


def test_custom_decimals():
# GH 4674
ser = Series(["1,5", "20,005", -3])
result = to_numeric(ser, decimal=",")
expected = Series([1.5, 20.005, -3])
tm.assert_series_equal(result, expected)


def test_custom_thousands():
# GH 4674
ser = Series(["1,001", "2,000,000", -3])
result = to_numeric(ser, thousands=",")
expected = Series([1001, 2000000, -3])
tm.assert_series_equal(result, expected)


def test_custom_thousands_and_decimals():
# GH 4674
ser = Series(["1.000,0", "2.000.000,5", "2,5", "3"])
result = to_numeric(ser, decimal=",", thousands=".")
expected = Series([1000.0, 2000000.5, 2.5, 3])
tm.assert_series_equal(result, expected)


def test_separator_validation():
# GH 4674
ser = Series(["1", "2", "3"])
with pytest.raises(
ValueError, match="Decimal and thousand separators must not be the same"
):
to_numeric(ser, thousands=".")

with pytest.raises(
ValueError, match="Decimal and thousand separators must not be the same"
):
to_numeric(ser, thousands=",", decimal=",")

with pytest.raises(
ValueError, match="Thousands separator must not exceed length 1"
):
to_numeric(ser, thousands="test")

with pytest.raises(ValueError, match="Decimal separator must have length 1"):
to_numeric(ser, decimal="test")