-
-
Notifications
You must be signed in to change notification settings - Fork 18.6k
ENH: Add decimal and thousand separator params to to_numeric() #56934
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from 71 commits
fd7c1ff
4bceef2
ea36c28
64096ef
ae82729
6fb82cb
04c25b5
3c74258
6ac1be8
fd822c6
26a1295
b8a5248
77f0174
32e2c1b
b381201
90ec1a1
282ba6b
7bcaa6e
df7f4f1
c810498
3096fc9
38189a5
d5f5fed
67758c6
f94f560
06e406d
c837279
cde703a
d69b588
53ffb96
1d3b9e7
5d625bb
acce835
a7e3cf8
2977c63
1593e42
093a1d6
d0fa3db
347ef91
1d8e146
574fb10
aae7e0a
2327c18
a7034c9
c8b3ac0
f8a6407
f6c735a
4ecfb6f
8474575
4bb2a94
91195b8
0c33992
f79b91a
350e5d1
ffd50de
a2cc191
d7890f8
15825a7
c5ac8cb
d257d97
2d78e0f
a24a0d2
9c39b4d
cb1ef67
87093fd
0243ccb
0bfeb50
72c93cc
f33fbf5
e696a92
83d9004
493e86b
a8e4845
837a990
a2c0ad2
17f5e7e
565f148
52605cb
dddd04e
8f987ad
51f8b06
35d8757
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -69,7 +69,8 @@ from pandas._libs.interval import Interval | |
|
||
|
||
cdef extern from "pandas/parser/pd_parser.h": | ||
int floatify(object, float64_t *result, int *maybe_int) except -1 | ||
int floatify(object, float64_t *result, int *maybe_int, | ||
char dec, char tsep) except -1 | ||
void PandasParser_IMPORT() | ||
|
||
PandasParser_IMPORT | ||
|
@@ -2204,6 +2205,8 @@ def maybe_convert_numeric( | |
bint convert_empty=True, | ||
bint coerce_numeric=False, | ||
bint convert_to_masked_nullable=False, | ||
str thousands=None, | ||
str decimal="." | ||
) -> tuple[np.ndarray, np.ndarray | None]: | ||
""" | ||
Convert object array to a numeric array if possible. | ||
|
@@ -2231,6 +2234,14 @@ def maybe_convert_numeric( | |
convert_to_masked_nullable : bool, default False | ||
Whether to return a mask for the converted values. This also disables | ||
upcasting for ints with nulls to float64. | ||
thousands : str, default None | ||
Character used to separate groups of thousands for readability, | ||
e.g. ',' in 1,000,000 | ||
Must only be 1 character long. | ||
decimal : str, default '.' | ||
Character used to separate decimal section from the integer | ||
section of the number, e.g., '.' in 12.45 | ||
Must only be 1 character long. | ||
Returns | ||
------- | ||
np.ndarray | ||
|
@@ -2247,6 +2258,28 @@ def maybe_convert_numeric( | |
cdef: | ||
object val = values[0] | ||
|
||
# Convert python strings into ones readable by C | ||
|
||
cdef char* tsep | ||
cdef char* dsep | ||
# Use null char to represent lack of separator | ||
if thousands is None: | ||
tsep = "\0" | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I think this is confusing to assign the nul byte instead of just assigning to NULL, especially with how Cython handles There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I did it this way as |
||
else: | ||
bytes_tsep = thousands.encode("UTF-8") | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Doesn't There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Cython handles the conversion when assigning |
||
tsep = bytes_tsep | ||
|
||
bytes_dsep = decimal.encode("UTF-8") | ||
dsep = bytes_dsep | ||
|
||
# Validate separators | ||
if len(tsep) > 1: | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Not sure why this would be useful but raising like this prevents a multi-byte character from being used as a separator. That should be tested either way There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Using wide characters for these separators does indeed seem pretty odd and an unlikely use case, if we do need to support this then it would require a change of approach as the C parser uses a single |
||
raise ValueError("Thousands separator must not exceed length 1") | ||
if len(dsep) > 1: | ||
raise ValueError("Decimal separator must have length 1") | ||
if tsep == dsep: | ||
raise ValueError("Decimal and thousand separators must not be the same") | ||
|
||
if util.is_integer_object(val): | ||
try: | ||
maybe_ints = values.astype("i8") | ||
|
@@ -2354,8 +2387,7 @@ def maybe_convert_numeric( | |
seen.float_ = True | ||
else: | ||
try: | ||
floatify(val, &fval, &maybe_int) | ||
|
||
floatify(val, &fval, &maybe_int, dsep[0], tsep[0]) | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Is this reachable if the length of dsep or tsep is every zero? i.e. if someone did There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Yes you're right, I'll add a validation check to make sure both There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Changed the check on line 2227 to filter out zero length separators, the validation checks may need changing based on the decision re. multi width characters but the segfault should be fixed. |
||
if fval in na_values: | ||
seen.saw_null() | ||
floats[i] = complexes[i] = NaN | ||
|
@@ -2368,7 +2400,10 @@ def maybe_convert_numeric( | |
floats[i] = fval | ||
|
||
if maybe_int: | ||
as_int = int(val) | ||
if thousands is None: | ||
as_int = int(val) | ||
else: | ||
as_int = int(val.replace(thousands, "")) | ||
|
||
if as_int in na_values: | ||
mask[i] = 1 | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do these need to be
str
in the first place or would a declaration ofchar
simplify things? I am not sure how Cython works in this case but it seems like it might be able to automatically handle that object -> ctype conversionUh oh!
There was an error while loading. Please reload this page.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Your suggestion was my original implementation, and it ran fine, but I ran into issues with the mypy tests. It wasn't matching the stub function to its implentation here as the parameter types were different: I couldn't use
char
dtype inlib.pyi
, and I was trying to declare it asbytes
. It may work to import the cython dtypes into the stub file but I'm not sure if this is the best option or there is another neat method.