Skip to content

[FIX] Handle decimal and thousand separator in 'round_trip' converer #35377

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 7 commits into from
Aug 11, 2020

Conversation

ales-erjavec
Copy link
Contributor

@ales-erjavec ales-erjavec commented Jul 22, 2020

In case of non c-locale decimal and tsep, copy and fixup the source string before passing it to PyOS_string_to_double

@ales-erjavec ales-erjavec force-pushed the round-trip-decimal branch 2 times, most recently from 5020dfd to aab2f52 Compare July 22, 2020 14:31
@simonjayhawkins simonjayhawkins added Bug IO CSV read_csv, to_csv labels Jul 22, 2020
@pep8speaks
Copy link

pep8speaks commented Jul 23, 2020

Hello @ales-erjavec! Thanks for updating this PR. We checked the lines you've touched for PEP 8 issues, and found:

There are currently no PEP 8 issues detected in this Pull Request. Cheers! 🍻

Comment last updated at 2020-08-10 10:12:58 UTC

@ales-erjavec ales-erjavec requested a review from WillAyd August 5, 2020 09:33
@jreback jreback added this to the 1.2 milestone Aug 6, 2020
Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nice test cases. can you merge master & add a whatsnew note in 1.2, I/O section. ping on green.

@jreback jreback requested a review from gfyoung August 6, 2020 23:41
@jreback
Copy link
Contributor

jreback commented Aug 6, 2020

@gfyoung or @WillAyd if comments.

@jreback
Copy link
Contributor

jreback commented Aug 7, 2020

@ales-erjavec can you also run the current csv asv's to see if perf is changed; we might need additional coverage for thousands sep as well.

Copy link
Member

@WillAyd WillAyd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks really good - great job progressing this along.

A few more questions / comments I think are minor in nature

p++;
}
// Copy the remainder of the string as is.
memcpy(dst, p, length + 1 - (p - s));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you use strncpy here instead? Would make this slightly more readable and then you won't need to manually add a null byte

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

// Copy integer part dropping `tsep`
while (isdigit_ascii(*p)) {
*dst++ = *p++;
p += (tsep != '\0' && *p == tsep);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does the null byte check here serve a purpose?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. tsep is null when no thousands separator is defined. But the loop must not skip the terminating null byte in the input string (the second condition) which would happen on e.g. '123\0'

} else {
*error = -1;
if (q != NULL) {
// p and pc are different len due to tsep removal. Can't report
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a test that hits this?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. The "1,2e1_0" for instance. I.e something that is prefixed with a number but has extra trailing characters.

@ales-erjavec
Copy link
Contributor Author

... can you also run the current csv asv's to see if perf is changed

asv continuous -f 1.1 -E virtualenv --python 3.7 origin/master round-trip-decimal -b ^io.csv

· No executable found for python 3.6
· Creating environments
· Discovering benchmarks
·· Uninstalling from virtualenv-py3.7-Cython0.29.16-jinja2-matplotlib-numba-numexpr-numpy-odfpy-openpyxl-pytest-scipy-sqlalchemy-tables-xlrd-xlsxwriter-xlwt.
·· Installing 8703ad6e <round-trip-decimal> into virtualenv-py3.7-Cython0.29.16-jinja2-matplotlib-numba-numexpr-numpy-odfpy-openpyxl-pytest-scipy-sqlalchemy-tables-xlrd-xlsxwriter-xlwt..
· Running 48 total benchmarks (2 commits * 1 environments * 24 benchmarks)
[  0.00%] · For pandas commit aefae55e <master> (round 1/2):
[  0.00%] ·· Building for virtualenv-py3.7-Cython0.29.16-jinja2-matplotlib-numba-numexpr-numpy-odfpy-openpyxl-pytest-scipy-sqlalchemy-tables-xlrd-xlsxwriter-xlwt...
[  0.00%] ·· Benchmarking virtualenv-py3.7-Cython0.29.16-jinja2-matplotlib-numba-numexpr-numpy-odfpy-openpyxl-pytest-scipy-sqlalchemy-tables-xlrd-xlsxwriter-xlwt
[  1.04%] ··· Running (io.csv.ParseDateComparison.time_read_csv_dayfirst--)..............
[ 16.67%] ··· Running (io.csv.ReadCSVParseSpecialDate.time_read_special_date--).........
[ 25.00%] · For pandas commit 8703ad6e <round-trip-decimal> (round 1/2):
[ 25.00%] ·· Building for virtualenv-py3.7-Cython0.29.16-jinja2-matplotlib-numba-numexpr-numpy-odfpy-openpyxl-pytest-scipy-sqlalchemy-tables-xlrd-xlsxwriter-xlwt...
[ 25.00%] ·· Benchmarking virtualenv-py3.7-Cython0.29.16-jinja2-matplotlib-numba-numexpr-numpy-odfpy-openpyxl-pytest-scipy-sqlalchemy-tables-xlrd-xlsxwriter-xlwt
[ 26.04%] ··· Running (io.csv.ParseDateComparison.time_read_csv_dayfirst--)..............
[ 41.67%] ··· Running (io.csv.ReadCSVParseSpecialDate.time_read_special_date--).........
[ 50.00%] · For pandas commit 8703ad6e <round-trip-decimal> (round 2/2):
[ 50.00%] ·· Benchmarking virtualenv-py3.7-Cython0.29.16-jinja2-matplotlib-numba-numexpr-numpy-odfpy-openpyxl-pytest-scipy-sqlalchemy-tables-xlrd-xlsxwriter-xlwt
[ 51.04%] ··· io.csv.ParseDateComparison.time_read_csv_dayfirst                                                                                           ok
[ 51.04%] ··· ============= ============
               cache_dates              
              ------------- ------------
                  False      7.97±0.1ms 
                   True      3.94±0.1ms 
              ============= ============

[ 52.08%] ··· io.csv.ParseDateComparison.time_to_datetime_dayfirst                                                                                        ok
[ 52.08%] ··· ============= ============
               cache_dates              
              ------------- ------------
                  False      8.42±0.2ms 
                   True      3.83±0.2ms 
              ============= ============

[ 53.12%] ··· io.csv.ParseDateComparison.time_to_datetime_format_DD_MM_YYYY                                                                               ok
[ 53.12%] ··· ============= =============
               cache_dates               
              ------------- -------------
                  False        25.8±1ms  
                   True      3.76±0.05ms 
              ============= =============

[ 54.17%] ··· io.csv.ReadCSVCachedParseDates.time_read_csv_cached                                                                                         ok
[ 54.17%] ··· ========== =============
               do_cache               
              ---------- -------------
                 True     1.81±0.03ms 
                False     1.80±0.01ms 
              ========== =============

[ 55.21%] ··· io.csv.ReadCSVCategorical.time_convert_direct                                                                                       29.9±0.4ms
[ 56.25%] ··· io.csv.ReadCSVCategorical.time_convert_post                                                                                         46.6±0.9ms
[ 57.29%] ··· io.csv.ReadCSVComment.time_comment                                                                                                    24.8±2ms
[ 58.33%] ··· io.csv.ReadCSVConcatDatetime.time_read_csv                                                                                            30.7±2ms
[ 59.38%] ··· io.csv.ReadCSVConcatDatetimeBadDateValue.time_read_csv                                                                                      ok
[ 59.38%] ··· ================ ============
               bad_date_value              
              ---------------- ------------
                    nan         14.2±0.8ms 
                     0          11.7±0.2ms 
                     ''         12.0±0.3ms 
              ================ ============

[ 60.42%] ··· io.csv.ReadCSVDInferDatetimeFormat.time_read_csv                                                                                            ok
[ 60.42%] ··· ======================= ============= ============= =============
              --                                        format                 
              ----------------------- -----------------------------------------
               infer_datetime_format      custom       iso8601         ymd     
              ======================= ============= ============= =============
                        True           5.54±0.06ms   2.05±0.02ms    2.07±0.1ms 
                       False             101±4ms     1.66±0.05ms   1.55±0.01ms 
              ======================= ============= ============= =============

[ 61.46%] ··· io.csv.ReadCSVFloatPrecision.time_read_csv                                                                                                  ok
[ 61.46%] ··· ===== ============= ============= ================ ============= ============= ================
              --                                    decimal / float_precision                                
              ----- -----------------------------------------------------------------------------------------
               sep     . / None      . / high    . / round_trip     _ / None      _ / high    _ / round_trip 
              ===== ============= ============= ================ ============= ============= ================
                ,    1.69±0.04ms   1.64±0.04ms    2.96±0.07ms     1.83±0.08ms   1.82±0.04ms    1.87±0.06ms   
                ;    1.70±0.08ms   1.65±0.06ms    2.98±0.08ms     1.82±0.05ms   1.86±0.07ms     1.90±0.1ms   
              ===== ============= ============= ================ ============= ============= ================

[ 62.50%] ··· io.csv.ReadCSVFloatPrecision.time_read_csv_python_engine                                                                                    ok
[ 62.50%] ··· ===== ============= ============= ================ ============= ============= ================
              --                                    decimal / float_precision                                
              ----- -----------------------------------------------------------------------------------------
               sep     . / None      . / high    . / round_trip     _ / None      _ / high    _ / round_trip 
              ===== ============= ============= ================ ============= ============= ================
                ,    4.42±0.05ms   4.44±0.08ms    4.47±0.05ms     3.77±0.06ms   3.87±0.03ms    3.85±0.07ms   
                ;    4.42±0.02ms   4.45±0.08ms    4.42±0.02ms      3.98±0.1ms   3.82±0.05ms    3.82±0.05ms   
              ===== ============= ============= ================ ============= ============= ================

[ 63.54%] ··· io.csv.ReadCSVMemoryGrowth.mem_parser_chunks                                                                                                 0
[ 64.58%] ··· io.csv.ReadCSVParseDates.time_baseline                                                                                             1.70±0.01ms
[ 65.62%] ··· io.csv.ReadCSVParseDates.time_multiple_date                                                                                        2.02±0.02ms
[ 66.67%] ··· io.csv.ReadCSVParseSpecialDate.time_read_special_date                                                                                       ok
[ 66.67%] ··· ======= =============
               value               
              ------- -------------
                 mY     7.76±0.3ms 
                mdY    3.90±0.05ms 
                 hm    3.43±0.05ms 
              ======= =============

[ 67.71%] ··· io.csv.ReadCSVSkipRows.time_skipprows                                                                                                       ok
[ 67.71%] ··· ========== =============
               skiprows               
              ---------- -------------
                 None      17.1±0.1ms 
                10000     11.8±0.06ms 
              ========== =============

[ 68.75%] ··· io.csv.ReadCSVThousands.time_thousands                                                                                                      ok
[ 68.75%] ··· ===== ============= ============
              --            thousands         
              ----- --------------------------
               sep       None          ,      
              ===== ============= ============
                ,    15.8±0.09ms   15.1±0.3ms 
                |     15.8±0.1ms   15.5±0.4ms 
              ===== ============= ============

[ 69.79%] ··· io.csv.ReadUint64Integers.time_read_uint64                                                                                          3.14±0.2ms
[ 70.83%] ··· io.csv.ReadUint64Integers.time_read_uint64_na_values                                                                                5.54±0.3ms
[ 71.88%] ··· io.csv.ReadUint64Integers.time_read_uint64_neg_values                                                                               5.32±0.3ms
[ 72.92%] ··· io.csv.ToCSV.time_frame                                                                                                                     ok
[ 72.92%] ··· ======= ============
                kind              
              ------- ------------
                wide    149±2ms   
                long    215±3ms   
               mixed   26.2±0.2ms 
              ======= ============

[ 73.96%] ··· io.csv.ToCSVDatetime.time_frame_date_formatting                                                                                     12.0±0.1ms
[ 75.00%] ··· io.csv.ToCSVDatetimeBig.time_frame                                                                                                          ok
[ 75.00%] ··· ======== =============
                obs                 
              -------- -------------
                1000    7.22±0.08ms 
               10000     64.7±0.4ms 
               100000     647±6ms   
              ======== =============

[ 75.00%] · For pandas commit aefae55e <master> (round 2/2):
[ 75.00%] ·· Building for virtualenv-py3.7-Cython0.29.16-jinja2-matplotlib-numba-numexpr-numpy-odfpy-openpyxl-pytest-scipy-sqlalchemy-tables-xlrd-xlsxwriter-xlwt...
[ 75.00%] ·· Benchmarking virtualenv-py3.7-Cython0.29.16-jinja2-matplotlib-numba-numexpr-numpy-odfpy-openpyxl-pytest-scipy-sqlalchemy-tables-xlrd-xlsxwriter-xlwt
[ 76.04%] ··· io.csv.ParseDateComparison.time_read_csv_dayfirst                                                                                           ok
[ 76.04%] ··· ============= ============
               cache_dates              
              ------------- ------------
                  False      8.22±0.1ms 
                   True      3.88±0.2ms 
              ============= ============

[ 77.08%] ··· io.csv.ParseDateComparison.time_to_datetime_dayfirst                                                                                        ok
[ 77.08%] ··· ============= =============
               cache_dates               
              ------------- -------------
                  False       8.74±0.3ms 
                   True      3.89±0.09ms 
              ============= =============

[ 78.12%] ··· io.csv.ParseDateComparison.time_to_datetime_format_DD_MM_YYYY                                                                               ok
[ 78.12%] ··· ============= ============
               cache_dates              
              ------------- ------------
                  False       23.0±1ms  
                   True      3.89±0.1ms 
              ============= ============

[ 79.17%] ··· io.csv.ReadCSVCachedParseDates.time_read_csv_cached                                                                                         ok
[ 79.17%] ··· ========== =============
               do_cache               
              ---------- -------------
                 True     1.88±0.05ms 
                False     1.83±0.03ms 
              ========== =============

[ 80.21%] ··· io.csv.ReadCSVCategorical.time_convert_direct                                                                                       30.4±0.6ms
[ 81.25%] ··· io.csv.ReadCSVCategorical.time_convert_post                                                                                         46.8±0.4ms
[ 82.29%] ··· io.csv.ReadCSVComment.time_comment                                                                                                  25.4±0.4ms
[ 83.33%] ··· io.csv.ReadCSVConcatDatetime.time_read_csv                                                                                            32.0±1ms
[ 84.38%] ··· io.csv.ReadCSVConcatDatetimeBadDateValue.time_read_csv                                                                                      ok
[ 84.38%] ··· ================ ============
               bad_date_value              
              ---------------- ------------
                    nan         14.5±0.6ms 
                     0          11.9±0.3ms 
                     ''         13.4±0.7ms 
              ================ ============

[ 85.42%] ··· io.csv.ReadCSVDInferDatetimeFormat.time_read_csv                                                                                            ok
[ 85.42%] ··· ======================= ============= ============= =============
              --                                        format                 
              ----------------------- -----------------------------------------
               infer_datetime_format      custom       iso8601         ymd     
              ======================= ============= ============= =============
                        True           5.54±0.08ms   2.06±0.03ms   2.09±0.02ms 
                       False             102±6ms     1.65±0.03ms   1.61±0.07ms 
              ======================= ============= ============= =============

[ 86.46%] ··· io.csv.ReadCSVFloatPrecision.time_read_csv                                                                                                  ok
[ 86.46%] ··· ===== ============= ============= ================ ============= ============= ================
              --                                    decimal / float_precision                                
              ----- -----------------------------------------------------------------------------------------
               sep     . / None      . / high    . / round_trip     _ / None      _ / high    _ / round_trip 
              ===== ============= ============= ================ ============= ============= ================
                ,    1.70±0.05ms   1.65±0.05ms    2.70±0.08ms     1.86±0.04ms   1.82±0.06ms    1.84±0.05ms   
                ;    1.73±0.04ms   1.68±0.07ms    2.74±0.06ms     1.83±0.03ms   1.83±0.05ms    1.86±0.06ms   
              ===== ============= ============= ================ ============= ============= ================

[ 87.50%] ··· io.csv.ReadCSVFloatPrecision.time_read_csv_python_engine                                                                                    ok
[ 87.50%] ··· ===== ============= ============= ================ ============ ============= ================
              --                                   decimal / float_precision                                
              ----- ----------------------------------------------------------------------------------------
               sep     . / None      . / high    . / round_trip    _ / None      _ / high    _ / round_trip 
              ===== ============= ============= ================ ============ ============= ================
                ,     4.52±0.1ms    4.51±0.1ms     4.51±0.1ms     4.26±0.5ms   3.83±0.05ms    3.85±0.07ms   
                ;    4.37±0.07ms   4.42±0.03ms    4.42±0.08ms     3.83±0.1ms    3.98±0.2ms    3.86±0.06ms   
              ===== ============= ============= ================ ============ ============= ================

[ 88.54%] ··· io.csv.ReadCSVMemoryGrowth.mem_parser_chunks                                                                                                 0
[ 89.58%] ··· io.csv.ReadCSVParseDates.time_baseline                                                                                             1.70±0.02ms
[ 90.62%] ··· io.csv.ReadCSVParseDates.time_multiple_date                                                                                        2.06±0.03ms
[ 91.67%] ··· io.csv.ReadCSVParseSpecialDate.time_read_special_date                                                                                       ok
[ 91.67%] ··· ======= =============
               value               
              ------- -------------
                 mY     7.92±0.4ms 
                mdY    3.88±0.04ms 
                 hm    3.41±0.06ms 
              ======= =============

[ 92.71%] ··· io.csv.ReadCSVSkipRows.time_skipprows                                                                                                       ok
[ 92.71%] ··· ========== ============
               skiprows              
              ---------- ------------
                 None     17.1±0.2ms 
                10000     11.9±0.1ms 
              ========== ============

[ 93.75%] ··· io.csv.ReadCSVThousands.time_thousands                                                                                                      ok
[ 93.75%] ··· ===== ============= ============
              --            thousands         
              ----- --------------------------
               sep       None          ,      
              ===== ============= ============
                ,    15.9±0.06ms   15.3±0.3ms 
                |     16.0±0.2ms   15.7±0.3ms 
              ===== ============= ============

[ 94.79%] ··· io.csv.ReadUint64Integers.time_read_uint64                                                                                          3.13±0.3ms
[ 95.83%] ··· io.csv.ReadUint64Integers.time_read_uint64_na_values                                                                                5.54±0.3ms
[ 96.88%] ··· io.csv.ReadUint64Integers.time_read_uint64_neg_values                                                                               5.35±0.2ms
[ 97.92%] ··· io.csv.ToCSV.time_frame                                                                                                                     ok
[ 97.92%] ··· ======= ============
                kind              
              ------- ------------
                wide   146±0.7ms  
                long   212±0.4ms  
               mixed   25.9±0.1ms 
              ======= ============

[ 98.96%] ··· io.csv.ToCSVDatetime.time_frame_date_formatting                                                                                     11.9±0.1ms
[100.00%] ··· io.csv.ToCSVDatetimeBig.time_frame                                                                                                          ok
[100.00%] ··· ======== =============
                obs                 
              -------- -------------
                1000    7.25±0.05ms 
               10000     64.6±0.4ms 
               100000    690±100ms  
              ======== =============


BENCHMARKS NOT SIGNIFICANTLY CHANGED.

BENCHMARKS NOT SIGNIFICANTLY CHANGED.

@ales-erjavec ales-erjavec requested a review from WillAyd August 10, 2020 11:24
Copy link
Member

@WillAyd WillAyd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks good - very nicely done. @jreback

@jreback jreback merged commit c87e40c into pandas-dev:master Aug 11, 2020
@jreback
Copy link
Contributor

jreback commented Aug 11, 2020

thanks @ales-erjavec

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug IO CSV read_csv, to_csv
Projects
None yet
Development

Successfully merging this pull request may close these issues.

BUG: read_csv fails with float_precision="round_trip" and decimal=","
6 participants