[FIX] Handle decimal and thousand separator in 'round_trip' converer #35377

ales-erjavec · 2020-07-22T11:32:33Z

closes BUG: read_csv fails with float_precision="round_trip" and decimal="," #35365
tests added / passed
passes black pandas
passes git diff upstream/master -u -- "*.py" | flake8 --diff
whatsnew entry

In case of non c-locale decimal and tsep, copy and fixup the source string before passing it to PyOS_string_to_double

pandas/tests/io/parser/test_c_parser_only.py

pandas/_libs/src/parser/tokenizer.c

pep8speaks · 2020-07-23T11:59:20Z

Hello @ales-erjavec! Thanks for updating this PR. We checked the lines you've touched for PEP 8 issues, and found:

There are currently no PEP 8 issues detected in this Pull Request. Cheers! 🍻

Comment last updated at 2020-08-10 10:12:58 UTC

pandas/_libs/src/parser/tokenizer.c

jreback

nice test cases. can you merge master & add a whatsnew note in 1.2, I/O section. ping on green.

jreback · 2020-08-06T23:41:26Z

@gfyoung or @WillAyd if comments.

jreback · 2020-08-07T12:31:27Z

@ales-erjavec can you also run the current csv asv's to see if perf is changed; we might need additional coverage for thousands sep as well.

WillAyd

Looks really good - great job progressing this along.

A few more questions / comments I think are minor in nature

WillAyd · 2020-08-07T15:25:19Z

pandas/_libs/src/parser/tokenizer.c

+       p++;
+    }
+    // Copy the remainder of the string as is.
+    memcpy(dst, p, length + 1 - (p - s));


Can you use strncpy here instead? Would make this slightly more readable and then you won't need to manually add a null byte

WillAyd · 2020-08-07T15:25:28Z

pandas/_libs/src/parser/tokenizer.c

+    // Copy integer part dropping `tsep`
+    while (isdigit_ascii(*p)) {
+        *dst++ = *p++;
+        p += (tsep != '\0' && *p == tsep);


Does the null byte check here serve a purpose?

Yes. tsep is null when no thousands separator is defined. But the loop must not skip the terminating null byte in the input string (the second condition) which would happen on e.g. '123\0'

pandas/_libs/src/parser/tokenizer.c

WillAyd · 2020-08-07T15:28:42Z

pandas/_libs/src/parser/tokenizer.c

+    } else {
+        *error = -1;
+        if (q != NULL) {
+           // p and pc are different len due to tsep removal. Can't report


Is there a test that hits this?

Yes. The "1,2e1_0" for instance. I.e something that is prefixed with a number but has extra trailing characters.

In case of non c-locale decimal and tsep, copy and fixup the source string before passing it to PyOS_string_to_double

Always make a copy of input string.

ales-erjavec · 2020-08-10T11:23:24Z

... can you also run the current csv asv's to see if perf is changed

asv continuous -f 1.1 -E virtualenv --python 3.7 origin/master round-trip-decimal -b ^io.csv

· No executable found for python 3.6
· Creating environments
· Discovering benchmarks
·· Uninstalling from virtualenv-py3.7-Cython0.29.16-jinja2-matplotlib-numba-numexpr-numpy-odfpy-openpyxl-pytest-scipy-sqlalchemy-tables-xlrd-xlsxwriter-xlwt.
·· Installing 8703ad6e <round-trip-decimal> into virtualenv-py3.7-Cython0.29.16-jinja2-matplotlib-numba-numexpr-numpy-odfpy-openpyxl-pytest-scipy-sqlalchemy-tables-xlrd-xlsxwriter-xlwt..
· Running 48 total benchmarks (2 commits * 1 environments * 24 benchmarks)
[  0.00%] · For pandas commit aefae55e <master> (round 1/2):
[  0.00%] ·· Building for virtualenv-py3.7-Cython0.29.16-jinja2-matplotlib-numba-numexpr-numpy-odfpy-openpyxl-pytest-scipy-sqlalchemy-tables-xlrd-xlsxwriter-xlwt...
[  0.00%] ·· Benchmarking virtualenv-py3.7-Cython0.29.16-jinja2-matplotlib-numba-numexpr-numpy-odfpy-openpyxl-pytest-scipy-sqlalchemy-tables-xlrd-xlsxwriter-xlwt
[  1.04%] ··· Running (io.csv.ParseDateComparison.time_read_csv_dayfirst--)..............
[ 16.67%] ··· Running (io.csv.ReadCSVParseSpecialDate.time_read_special_date--).........
[ 25.00%] · For pandas commit 8703ad6e <round-trip-decimal> (round 1/2):
[ 25.00%] ·· Building for virtualenv-py3.7-Cython0.29.16-jinja2-matplotlib-numba-numexpr-numpy-odfpy-openpyxl-pytest-scipy-sqlalchemy-tables-xlrd-xlsxwriter-xlwt...
[ 25.00%] ·· Benchmarking virtualenv-py3.7-Cython0.29.16-jinja2-matplotlib-numba-numexpr-numpy-odfpy-openpyxl-pytest-scipy-sqlalchemy-tables-xlrd-xlsxwriter-xlwt
[ 26.04%] ··· Running (io.csv.ParseDateComparison.time_read_csv_dayfirst--)..............
[ 41.67%] ··· Running (io.csv.ReadCSVParseSpecialDate.time_read_special_date--).........
[ 50.00%] · For pandas commit 8703ad6e <round-trip-decimal> (round 2/2):
[ 50.00%] ·· Benchmarking virtualenv-py3.7-Cython0.29.16-jinja2-matplotlib-numba-numexpr-numpy-odfpy-openpyxl-pytest-scipy-sqlalchemy-tables-xlrd-xlsxwriter-xlwt
[ 51.04%] ··· io.csv.ParseDateComparison.time_read_csv_dayfirst                                                                                           ok
[ 51.04%] ··· ============= ============
               cache_dates              
              ------------- ------------
                  False      7.97±0.1ms 
                   True      3.94±0.1ms 
              ============= ============

[ 52.08%] ··· io.csv.ParseDateComparison.time_to_datetime_dayfirst                                                                                        ok
[ 52.08%] ··· ============= ============
               cache_dates              
              ------------- ------------
                  False      8.42±0.2ms 
                   True      3.83±0.2ms 
              ============= ============

[ 53.12%] ··· io.csv.ParseDateComparison.time_to_datetime_format_DD_MM_YYYY                                                                               ok
[ 53.12%] ··· ============= =============
               cache_dates               
              ------------- -------------
                  False        25.8±1ms  
                   True      3.76±0.05ms 
              ============= =============

[ 54.17%] ··· io.csv.ReadCSVCachedParseDates.time_read_csv_cached                                                                                         ok
[ 54.17%] ··· ========== =============
               do_cache               
              ---------- -------------
                 True     1.81±0.03ms 
                False     1.80±0.01ms 
              ========== =============

[ 55.21%] ··· io.csv.ReadCSVCategorical.time_convert_direct                                                                                       29.9±0.4ms
[ 56.25%] ··· io.csv.ReadCSVCategorical.time_convert_post                                                                                         46.6±0.9ms
[ 57.29%] ··· io.csv.ReadCSVComment.time_comment                                                                                                    24.8±2ms
[ 58.33%] ··· io.csv.ReadCSVConcatDatetime.time_read_csv                                                                                            30.7±2ms
[ 59.38%] ··· io.csv.ReadCSVConcatDatetimeBadDateValue.time_read_csv                                                                                      ok
[ 59.38%] ··· ================ ============
               bad_date_value              
              ---------------- ------------
                    nan         14.2±0.8ms 
                     0          11.7±0.2ms 
                     ''         12.0±0.3ms 
              ================ ============

[ 60.42%] ··· io.csv.ReadCSVDInferDatetimeFormat.time_read_csv                                                                                            ok
[ 60.42%] ··· ======================= ============= ============= =============
              --                                        format                 
              ----------------------- -----------------------------------------
               infer_datetime_format      custom       iso8601         ymd     
              ======================= ============= ============= =============
                        True           5.54±0.06ms   2.05±0.02ms    2.07±0.1ms 
                       False             101±4ms     1.66±0.05ms   1.55±0.01ms 
              ======================= ============= ============= =============

[ 61.46%] ··· io.csv.ReadCSVFloatPrecision.time_read_csv                                                                                                  ok
[ 61.46%] ··· ===== ============= ============= ================ ============= ============= ================
              --                                    decimal / float_precision                                
              ----- -----------------------------------------------------------------------------------------
               sep     . / None      . / high    . / round_trip     _ / None      _ / high    _ / round_trip 
              ===== ============= ============= ================ ============= ============= ================
                ,    1.69±0.04ms   1.64±0.04ms    2.96±0.07ms     1.83±0.08ms   1.82±0.04ms    1.87±0.06ms   
                ;    1.70±0.08ms   1.65±0.06ms    2.98±0.08ms     1.82±0.05ms   1.86±0.07ms     1.90±0.1ms   
              ===== ============= ============= ================ ============= ============= ================

[ 62.50%] ··· io.csv.ReadCSVFloatPrecision.time_read_csv_python_engine                                                                                    ok
[ 62.50%] ··· ===== ============= ============= ================ ============= ============= ================
              --                                    decimal / float_precision                                
              ----- -----------------------------------------------------------------------------------------
               sep     . / None      . / high    . / round_trip     _ / None      _ / high    _ / round_trip 
              ===== ============= ============= ================ ============= ============= ================
                ,    4.42±0.05ms   4.44±0.08ms    4.47±0.05ms     3.77±0.06ms   3.87±0.03ms    3.85±0.07ms   
                ;    4.42±0.02ms   4.45±0.08ms    4.42±0.02ms      3.98±0.1ms   3.82±0.05ms    3.82±0.05ms   
              ===== ============= ============= ================ ============= ============= ================

[ 63.54%] ··· io.csv.ReadCSVMemoryGrowth.mem_parser_chunks                                                                                                 0
[ 64.58%] ··· io.csv.ReadCSVParseDates.time_baseline                                                                                             1.70±0.01ms
[ 65.62%] ··· io.csv.ReadCSVParseDates.time_multiple_date                                                                                        2.02±0.02ms
[ 66.67%] ··· io.csv.ReadCSVParseSpecialDate.time_read_special_date                                                                                       ok
[ 66.67%] ··· ======= =============
               value               
              ------- -------------
                 mY     7.76±0.3ms 
                mdY    3.90±0.05ms 
                 hm    3.43±0.05ms 
              ======= =============

[ 67.71%] ··· io.csv.ReadCSVSkipRows.time_skipprows                                                                                                       ok
[ 67.71%] ··· ========== =============
               skiprows               
              ---------- -------------
                 None      17.1±0.1ms 
                10000     11.8±0.06ms 
              ========== =============

[ 68.75%] ··· io.csv.ReadCSVThousands.time_thousands                                                                                                      ok
[ 68.75%] ··· ===== ============= ============
              --            thousands         
              ----- --------------------------
               sep       None          ,      
              ===== ============= ============
                ,    15.8±0.09ms   15.1±0.3ms 
                |     15.8±0.1ms   15.5±0.4ms 
              ===== ============= ============

[ 69.79%] ··· io.csv.ReadUint64Integers.time_read_uint64                                                                                          3.14±0.2ms
[ 70.83%] ··· io.csv.ReadUint64Integers.time_read_uint64_na_values                                                                                5.54±0.3ms
[ 71.88%] ··· io.csv.ReadUint64Integers.time_read_uint64_neg_values                                                                               5.32±0.3ms
[ 72.92%] ··· io.csv.ToCSV.time_frame                                                                                                                     ok
[ 72.92%] ··· ======= ============
                kind              
              ------- ------------
                wide    149±2ms   
                long    215±3ms   
               mixed   26.2±0.2ms 
              ======= ============

[ 73.96%] ··· io.csv.ToCSVDatetime.time_frame_date_formatting                                                                                     12.0±0.1ms
[ 75.00%] ··· io.csv.ToCSVDatetimeBig.time_frame                                                                                                          ok
[ 75.00%] ··· ======== =============
                obs                 
              -------- -------------
                1000    7.22±0.08ms 
               10000     64.7±0.4ms 
               100000     647±6ms   
              ======== =============

[ 75.00%] · For pandas commit aefae55e <master> (round 2/2):
[ 75.00%] ·· Building for virtualenv-py3.7-Cython0.29.16-jinja2-matplotlib-numba-numexpr-numpy-odfpy-openpyxl-pytest-scipy-sqlalchemy-tables-xlrd-xlsxwriter-xlwt...
[ 75.00%] ·· Benchmarking virtualenv-py3.7-Cython0.29.16-jinja2-matplotlib-numba-numexpr-numpy-odfpy-openpyxl-pytest-scipy-sqlalchemy-tables-xlrd-xlsxwriter-xlwt
[ 76.04%] ··· io.csv.ParseDateComparison.time_read_csv_dayfirst                                                                                           ok
[ 76.04%] ··· ============= ============
               cache_dates              
              ------------- ------------
                  False      8.22±0.1ms 
                   True      3.88±0.2ms 
              ============= ============

[ 77.08%] ··· io.csv.ParseDateComparison.time_to_datetime_dayfirst                                                                                        ok
[ 77.08%] ··· ============= =============
               cache_dates               
              ------------- -------------
                  False       8.74±0.3ms 
                   True      3.89±0.09ms 
              ============= =============

[ 78.12%] ··· io.csv.ParseDateComparison.time_to_datetime_format_DD_MM_YYYY                                                                               ok
[ 78.12%] ··· ============= ============
               cache_dates              
              ------------- ------------
                  False       23.0±1ms  
                   True      3.89±0.1ms 
              ============= ============

[ 79.17%] ··· io.csv.ReadCSVCachedParseDates.time_read_csv_cached                                                                                         ok
[ 79.17%] ··· ========== =============
               do_cache               
              ---------- -------------
                 True     1.88±0.05ms 
                False     1.83±0.03ms 
              ========== =============

[ 80.21%] ··· io.csv.ReadCSVCategorical.time_convert_direct                                                                                       30.4±0.6ms
[ 81.25%] ··· io.csv.ReadCSVCategorical.time_convert_post                                                                                         46.8±0.4ms
[ 82.29%] ··· io.csv.ReadCSVComment.time_comment                                                                                                  25.4±0.4ms
[ 83.33%] ··· io.csv.ReadCSVConcatDatetime.time_read_csv                                                                                            32.0±1ms
[ 84.38%] ··· io.csv.ReadCSVConcatDatetimeBadDateValue.time_read_csv                                                                                      ok
[ 84.38%] ··· ================ ============
               bad_date_value              
              ---------------- ------------
                    nan         14.5±0.6ms 
                     0          11.9±0.3ms 
                     ''         13.4±0.7ms 
              ================ ============

[ 85.42%] ··· io.csv.ReadCSVDInferDatetimeFormat.time_read_csv                                                                                            ok
[ 85.42%] ··· ======================= ============= ============= =============
              --                                        format                 
              ----------------------- -----------------------------------------
               infer_datetime_format      custom       iso8601         ymd     
              ======================= ============= ============= =============
                        True           5.54±0.08ms   2.06±0.03ms   2.09±0.02ms 
                       False             102±6ms     1.65±0.03ms   1.61±0.07ms 
              ======================= ============= ============= =============

[ 86.46%] ··· io.csv.ReadCSVFloatPrecision.time_read_csv                                                                                                  ok
[ 86.46%] ··· ===== ============= ============= ================ ============= ============= ================
              --                                    decimal / float_precision                                
              ----- -----------------------------------------------------------------------------------------
               sep     . / None      . / high    . / round_trip     _ / None      _ / high    _ / round_trip 
              ===== ============= ============= ================ ============= ============= ================
                ,    1.70±0.05ms   1.65±0.05ms    2.70±0.08ms     1.86±0.04ms   1.82±0.06ms    1.84±0.05ms   
                ;    1.73±0.04ms   1.68±0.07ms    2.74±0.06ms     1.83±0.03ms   1.83±0.05ms    1.86±0.06ms   
              ===== ============= ============= ================ ============= ============= ================

[ 87.50%] ··· io.csv.ReadCSVFloatPrecision.time_read_csv_python_engine                                                                                    ok
[ 87.50%] ··· ===== ============= ============= ================ ============ ============= ================
              --                                   decimal / float_precision                                
              ----- ----------------------------------------------------------------------------------------
               sep     . / None      . / high    . / round_trip    _ / None      _ / high    _ / round_trip 
              ===== ============= ============= ================ ============ ============= ================
                ,     4.52±0.1ms    4.51±0.1ms     4.51±0.1ms     4.26±0.5ms   3.83±0.05ms    3.85±0.07ms   
                ;    4.37±0.07ms   4.42±0.03ms    4.42±0.08ms     3.83±0.1ms    3.98±0.2ms    3.86±0.06ms   
              ===== ============= ============= ================ ============ ============= ================

[ 88.54%] ··· io.csv.ReadCSVMemoryGrowth.mem_parser_chunks                                                                                                 0
[ 89.58%] ··· io.csv.ReadCSVParseDates.time_baseline                                                                                             1.70±0.02ms
[ 90.62%] ··· io.csv.ReadCSVParseDates.time_multiple_date                                                                                        2.06±0.03ms
[ 91.67%] ··· io.csv.ReadCSVParseSpecialDate.time_read_special_date                                                                                       ok
[ 91.67%] ··· ======= =============
               value               
              ------- -------------
                 mY     7.92±0.4ms 
                mdY    3.88±0.04ms 
                 hm    3.41±0.06ms 
              ======= =============

[ 92.71%] ··· io.csv.ReadCSVSkipRows.time_skipprows                                                                                                       ok
[ 92.71%] ··· ========== ============
               skiprows              
              ---------- ------------
                 None     17.1±0.2ms 
                10000     11.9±0.1ms 
              ========== ============

[ 93.75%] ··· io.csv.ReadCSVThousands.time_thousands                                                                                                      ok
[ 93.75%] ··· ===== ============= ============
              --            thousands         
              ----- --------------------------
               sep       None          ,      
              ===== ============= ============
                ,    15.9±0.06ms   15.3±0.3ms 
                |     16.0±0.2ms   15.7±0.3ms 
              ===== ============= ============

[ 94.79%] ··· io.csv.ReadUint64Integers.time_read_uint64                                                                                          3.13±0.3ms
[ 95.83%] ··· io.csv.ReadUint64Integers.time_read_uint64_na_values                                                                                5.54±0.3ms
[ 96.88%] ··· io.csv.ReadUint64Integers.time_read_uint64_neg_values                                                                               5.35±0.2ms
[ 97.92%] ··· io.csv.ToCSV.time_frame                                                                                                                     ok
[ 97.92%] ··· ======= ============
                kind              
              ------- ------------
                wide   146±0.7ms  
                long   212±0.4ms  
               mixed   25.9±0.1ms 
              ======= ============

[ 98.96%] ··· io.csv.ToCSVDatetime.time_frame_date_formatting                                                                                     11.9±0.1ms
[100.00%] ··· io.csv.ToCSVDatetimeBig.time_frame                                                                                                          ok
[100.00%] ··· ======== =============
                obs                 
              -------- -------------
                1000    7.25±0.05ms 
               10000     64.6±0.4ms 
               100000    690±100ms  
              ======== =============


BENCHMARKS NOT SIGNIFICANTLY CHANGED.

BENCHMARKS NOT SIGNIFICANTLY CHANGED.

WillAyd

looks good - very nicely done. @jreback

jreback · 2020-08-11T00:36:31Z

thanks @ales-erjavec

ales-erjavec force-pushed the round-trip-decimal branch 2 times, most recently from 5020dfd to aab2f52 Compare July 22, 2020 14:31

simonjayhawkins added Bug IO CSV read_csv, to_csv labels Jul 22, 2020

WillAyd requested changes Jul 22, 2020

View reviewed changes

pandas/tests/io/parser/test_c_parser_only.py Outdated Show resolved Hide resolved

pandas/_libs/src/parser/tokenizer.c Outdated Show resolved Hide resolved

ales-erjavec force-pushed the round-trip-decimal branch from aab2f52 to cc0c228 Compare July 23, 2020 11:59

ales-erjavec force-pushed the round-trip-decimal branch from cc0c228 to e4bca08 Compare July 23, 2020 12:03

WillAyd reviewed Jul 23, 2020

View reviewed changes

pandas/_libs/src/parser/tokenizer.c Outdated Show resolved Hide resolved

ales-erjavec force-pushed the round-trip-decimal branch from e4bca08 to 9cf0755 Compare July 27, 2020 14:18

ales-erjavec requested a review from WillAyd August 5, 2020 09:33

jreback added this to the 1.2 milestone Aug 6, 2020

jreback requested changes Aug 6, 2020

View reviewed changes

jreback requested a review from gfyoung August 6, 2020 23:41

gfyoung approved these changes Aug 7, 2020

View reviewed changes

WillAyd requested changes Aug 7, 2020

View reviewed changes

ales-erjavec added 7 commits August 10, 2020 10:34

Handle decimal and tsep in round_trip converter (pandas-dev#35365)

2da9aa7

In case of non c-locale decimal and tsep, copy and fixup the source string before passing it to PyOS_string_to_double

Simplify str_copy_decimal_str.

dcd8de1

Add test for float_precision parsers equality

cd23ba1

Simplify round_trip converter

bee7fdf

Always make a copy of input string.

Use strncpy

66f8be5

Just define the endptr and endpc

42c509c

Add whatsnew entry

8703ad6

ales-erjavec force-pushed the round-trip-decimal branch from 74be91c to 8703ad6 Compare August 10, 2020 10:12

ales-erjavec requested a review from WillAyd August 10, 2020 11:24

WillAyd approved these changes Aug 10, 2020

View reviewed changes

jreback approved these changes Aug 11, 2020

View reviewed changes

jreback merged commit c87e40c into pandas-dev:master Aug 11, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[FIX] Handle decimal and thousand separator in 'round_trip' converer #35377

[FIX] Handle decimal and thousand separator in 'round_trip' converer #35377

ales-erjavec commented Jul 22, 2020 •

edited

Loading

pep8speaks commented Jul 23, 2020 •

edited

Loading

jreback left a comment

jreback commented Aug 6, 2020

jreback commented Aug 7, 2020

WillAyd left a comment

WillAyd Aug 7, 2020

ales-erjavec Aug 10, 2020

WillAyd Aug 7, 2020

ales-erjavec Aug 10, 2020

WillAyd Aug 7, 2020

ales-erjavec Aug 10, 2020

ales-erjavec commented Aug 10, 2020

WillAyd left a comment

jreback commented Aug 11, 2020

[FIX] Handle decimal and thousand separator in 'round_trip' converer #35377

[FIX] Handle decimal and thousand separator in 'round_trip' converer #35377

Conversation

ales-erjavec commented Jul 22, 2020 • edited Loading

pep8speaks commented Jul 23, 2020 • edited Loading

Comment last updated at 2020-08-10 10:12:58 UTC

jreback left a comment

Choose a reason for hiding this comment

jreback commented Aug 6, 2020

jreback commented Aug 7, 2020

WillAyd left a comment

Choose a reason for hiding this comment

WillAyd Aug 7, 2020

Choose a reason for hiding this comment

ales-erjavec Aug 10, 2020

Choose a reason for hiding this comment

WillAyd Aug 7, 2020

Choose a reason for hiding this comment

ales-erjavec Aug 10, 2020

Choose a reason for hiding this comment

WillAyd Aug 7, 2020

Choose a reason for hiding this comment

ales-erjavec Aug 10, 2020

Choose a reason for hiding this comment

ales-erjavec commented Aug 10, 2020

WillAyd left a comment

Choose a reason for hiding this comment

jreback commented Aug 11, 2020

ales-erjavec commented Jul 22, 2020 •

edited

Loading

pep8speaks commented Jul 23, 2020 •

edited

Loading