Skip to content

BUG: round_trip parser initial/trailing whitespace #43714

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged

Conversation

ales-erjavec
Copy link
Contributor

@ales-erjavec ales-erjavec force-pushed the round-trip-parser-trailing-space branch from 1c5d7b8 to 9112165 Compare September 23, 2021 10:31
@ales-erjavec ales-erjavec force-pushed the round-trip-parser-trailing-space branch from 9112165 to 1ef7d82 Compare September 23, 2021 10:49
@jreback jreback added the IO CSV read_csv, to_csv label Sep 28, 2021
Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you run the asv suite for all csv parsing. i would expect this to have some sort of perf hit.

@pep8speaks
Copy link

pep8speaks commented Oct 1, 2021

Hello @ales-erjavec! Thanks for updating this PR. We checked the lines you've touched for PEP 8 issues, and found:

There are currently no PEP 8 issues detected in this Pull Request. Cheers! 🍻

Comment last updated at 2021-10-01 19:35:16 UTC

@ales-erjavec ales-erjavec force-pushed the round-trip-parser-trailing-space branch from 91ab7be to 32305bc Compare October 1, 2021 19:19
@ales-erjavec
Copy link
Contributor Author

can you run the asv suite for all csv parsing. i would expect this to have some sort of perf hit.

asv continuous -f 1.1 -E virtualenv --python 3.9 origin/master round-trip-parser-trailing-space -b "^io.csv"

· No executable found for python 3.8
· Creating environments
· Discovering benchmarks
·· Uninstalling from virtualenv-py3.9-Cython0.29.21-jinja2-matplotlib-numba-numexpr-numpy-odfpy-openpyxl-pyarrow-scipy-sqlalchemy-tables-xlrd-xlsxwriter-xlwt
·· Installing 1ef7d82a <round-trip-parser-trailing-space> into virtualenv-py3.9-Cython0.29.21-jinja2-matplotlib-numba-numexpr-numpy-odfpy-openpyxl-pyarrow-scipy-sqlalchemy-tables-xlrd-xlsxwriter-xlwt.
· Running 58 total benchmarks (2 commits * 1 environments * 29 benchmarks)
[  0.00%] · For pandas commit 36d5c9b8 <master> (round 1/2):
[  0.00%] ·· Building for virtualenv-py3.9-Cython0.29.21-jinja2-matplotlib-numba-numexpr-numpy-odfpy-openpyxl-pyarrow-scipy-sqlalchemy-tables-xlrd-xlsxwriter-xlwt..
[  0.00%] ·· Benchmarking virtualenv-py3.9-Cython0.29.21-jinja2-matplotlib-numba-numexpr-numpy-odfpy-openpyxl-pyarrow-scipy-sqlalchemy-tables-xlrd-xlsxwriter-xlwt
[  0.86%] ··· Running (io.csv.ParseDateComparison.time_read_csv_dayfirst--)......................
[ 20.69%] ··· Running (io.csv.ToCSV.time_frame--)......
[ 25.00%] · For pandas commit 1ef7d82a <round-trip-parser-trailing-space> (round 1/2):
[ 25.00%] ·· Building for virtualenv-py3.9-Cython0.29.21-jinja2-matplotlib-numba-numexpr-numpy-odfpy-openpyxl-pyarrow-scipy-sqlalchemy-tables-xlrd-xlsxwriter-xlwt..
[ 25.00%] ·· Benchmarking virtualenv-py3.9-Cython0.29.21-jinja2-matplotlib-numba-numexpr-numpy-odfpy-openpyxl-pyarrow-scipy-sqlalchemy-tables-xlrd-xlsxwriter-xlwt
[ 25.86%] ··· Running (io.csv.ParseDateComparison.time_read_csv_dayfirst--)......................
[ 45.69%] ··· Running (io.csv.ToCSV.time_frame--)......
[ 50.00%] · For pandas commit 1ef7d82a <round-trip-parser-trailing-space> (round 2/2):
[ 50.00%] ·· Benchmarking virtualenv-py3.9-Cython0.29.21-jinja2-matplotlib-numba-numexpr-numpy-odfpy-openpyxl-pyarrow-scipy-sqlalchemy-tables-xlrd-xlsxwriter-xlwt
[ 50.86%] ··· io.csv.ParseDateComparison.time_read_csv_dayfirst                                                                    ok
[ 50.86%] ··· ============= ============
               cache_dates              
              ------------- ------------
                  False      5.01±0.2ms 
                   True      2.62±0.1ms 
              ============= ============

[ 51.72%] ··· io.csv.ParseDateComparison.time_to_datetime_dayfirst                                                                 ok
[ 51.72%] ··· ============= ============
               cache_dates              
              ------------- ------------
                  False      5.04±0.1ms 
                   True      2.49±0.3ms 
              ============= ============

[ 52.59%] ··· io.csv.ParseDateComparison.time_to_datetime_format_DD_MM_YYYY                                                        ok
[ 52.59%] ··· ============= ============
               cache_dates              
              ------------- ------------
                  False      12.8±0.4ms 
                   True      2.88±0.3ms 
              ============= ============

[ 53.45%] ··· io.csv.ReadCSVCachedParseDates.time_read_csv_cached                                                                  ok
[ 53.45%] ··· ========== ============= =============
              --                    engine          
              ---------- ---------------------------
               do_cache        c           python   
              ========== ============= =============
                 True      1.54±0.1ms    1.93±0.1ms 
                False     1.28±0.04ms   2.03±0.07ms 
              ========== ============= =============

[ 54.31%] ··· io.csv.ReadCSVCategorical.time_convert_direct                                                                        ok
[ 54.31%] ··· ======== ============
               engine              
              -------- ------------
                 c      19.2±0.7ms 
               python    113±3ms   
              ======== ============

[ 55.17%] ··· io.csv.ReadCSVCategorical.time_convert_post                                                                          ok
[ 55.17%] ··· ======== ============
               engine              
              -------- ------------
                 c      29.3±0.4ms 
               python    101±2ms   
              ======== ============

[ 56.03%] ··· io.csv.ReadCSVComment.time_comment                                                                                   ok
[ 56.03%] ··· ======== ============
               engine              
              -------- ------------
                 c      13.4±0.5ms 
               python   12.5±0.7ms 
              ======== ============

[ 56.90%] ··· io.csv.ReadCSVConcatDatetime.time_read_csv                                                                   17.3±0.8ms
[ 57.76%] ··· io.csv.ReadCSVConcatDatetimeBadDateValue.time_read_csv                                                               ok
[ 57.76%] ··· ================ ============
               bad_date_value              
              ---------------- ------------
                    nan         8.91±0.4ms 
                     0          6.30±0.3ms 
                     ''         7.34±0.2ms 
              ================ ============

[ 58.62%] ··· io.csv.ReadCSVDInferDatetimeFormat.time_read_csv                                                                     ok
[ 58.62%] ··· ======================= ============ ============= =============
              --                                       format                 
              ----------------------- ----------------------------------------
               infer_datetime_format     custom       iso8601         ymd     
              ======================= ============ ============= =============
                        True           3.41±0.1ms    1.57±0.1ms    1.56±0.1ms 
                       False            65.1±2ms    1.19±0.03ms   1.15±0.09ms 
              ======================= ============ ============= =============

[ 59.48%] ··· io.csv.ReadCSVEngine.time_read_bytescsv                                                                              ok
[ 59.48%] ··· ========= ============
                engine              
              --------- ------------
                  c      11.7±0.3ms 
                python   160±0.7ms  
               pyarrow   5.03±0.2ms 
              ========= ============

[ 60.34%] ··· io.csv.ReadCSVEngine.time_read_stringcsv                                                                             ok
[ 60.34%] ··· ========= ============
                engine              
              --------- ------------
                  c      11.9±0.2ms 
                python    161±2ms   
               pyarrow   6.19±0.3ms 
              ========= ============

[ 61.21%] ··· io.csv.ReadCSVFloatPrecision.time_read_csv                                                                           ok
[ 61.21%] ··· ===== ============= ============= ================ ============= ============= ================
              --                                    decimal / float_precision                                
              ----- -----------------------------------------------------------------------------------------
               sep     . / None      . / high    . / round_trip     _ / None      _ / high    _ / round_trip 
              ===== ============= ============= ================ ============= ============= ================
                ,    1.06±0.03ms     995±40μs      1.75±0.1ms     1.08±0.06ms   1.07±0.01ms    1.16±0.07ms   
                ;    1.07±0.03ms   1.02±0.04ms    1.77±0.06ms     1.13±0.06ms   1.09±0.05ms    1.09±0.06ms   
              ===== ============= ============= ================ ============= ============= ================

[ 62.07%] ··· io.csv.ReadCSVFloatPrecision.time_read_csv_python_engine                                                             ok
[ 62.07%] ··· ===== ============= ============= ================ ============= ============= ================
              --                                    decimal / float_precision                                
              ----- -----------------------------------------------------------------------------------------
               sep     . / None      . / high    . / round_trip     _ / None      _ / high    _ / round_trip 
              ===== ============= ============= ================ ============= ============= ================
                ,    2.55±0.09ms   2.52±0.07ms     2.53±0.1ms     2.13±0.05ms   2.07±0.06ms    2.25±0.05ms   
                ;    2.45±0.09ms    2.62±0.1ms    2.56±0.05ms     2.07±0.07ms   2.08±0.06ms    2.10±0.06ms   
              ===== ============= ============= ================ ============= ============= ================

[ 62.93%] ··· io.csv.ReadCSVMemoryGrowth.mem_parser_chunks                                                                         ok
[ 62.93%] ··· ======== ===
               engine     
              -------- ---
                 c      0 
               python   0 
              ======== ===

[ 63.79%] ··· io.csv.ReadCSVParseDates.time_baseline                                                                               ok
[ 63.79%] ··· ======== =============
               engine               
              -------- -------------
                 c      1.12±0.03ms 
               python   1.15±0.05ms 
              ======== =============

[ 64.66%] ··· io.csv.ReadCSVParseDates.time_multiple_date                                                                          ok
[ 64.66%] ··· ======== =============
               engine               
              -------- -------------
                 c      1.36±0.09ms 
               python   1.34±0.07ms 
              ======== =============

[ 65.52%] ··· io.csv.ReadCSVParseSpecialDate.time_read_special_date                                                                ok
[ 65.52%] ··· ======= ============= ============
              --                engine          
              ------- --------------------------
               value        c          python   
              ======= ============= ============
                 mY    5.23±0.09ms   22.8±0.3ms 
                mdY    2.54±0.07ms   9.02±0.3ms 
                 hm    2.31±0.04ms   8.45±0.2ms 
              ======= ============= ============

[ 66.38%] ··· io.csv.ReadCSVSkipRows.time_skipprows                                                                                ok
[ 66.38%] ··· ========== ============ ============ ============
              --                         engine                
              ---------- --------------------------------------
               skiprows       c          python      pyarrow   
              ========== ============ ============ ============
                 None     9.73±0.2ms   44.2±0.8ms   7.54±0.4ms 
                10000     8.11±0.2ms   32.6±0.5ms   7.60±0.2ms 
              ========== ============ ============ ============

[ 67.24%] ··· io.csv.ReadCSVThousands.time_thousands                                                                               ok
[ 67.24%] ··· ===== ============= =============== ============ ============
              --                       thousands / engine                  
              ----- -------------------------------------------------------
               sep     None / c    None / python     , / c      , / python 
              ===== ============= =============== ============ ============
                ,     7.74±0.2ms     41.8±0.3ms    10.2±0.1ms    99.9±2ms  
                |    7.81±0.05ms     42.6±0.4ms    9.34±0.1ms    99.8±2ms  
              ===== ============= =============== ============ ============

[ 68.10%] ··· io.csv.ReadUint64Integers.time_read_uint64                                                                   2.47±0.1ms
[ 68.97%] ··· io.csv.ReadUint64Integers.time_read_uint64_na_values                                                         4.17±0.2ms
[ 69.83%] ··· io.csv.ReadUint64Integers.time_read_uint64_neg_values                                                        3.97±0.1ms
[ 70.69%] ··· io.csv.ToCSV.time_frame                                                                                              ok
[ 70.69%] ··· ======= ============
                kind              
              ------- ------------
                wide   58.5±0.9ms 
                long   73.5±0.5ms 
               mixed   10.5±0.1ms 
              ======= ============

[ 71.55%] ··· io.csv.ToCSVDatetime.time_frame_date_formatting                                                             9.44±0.08ms
[ 72.41%] ··· io.csv.ToCSVDatetimeBig.time_frame                                                                                   ok
[ 72.41%] ··· ======== =============
                obs                 
              -------- -------------
                1000    2.93±0.06ms 
               10000     24.9±0.4ms 
               100000    246±0.5ms  
              ======== =============

[ 73.28%] ··· io.csv.ToCSVIndexes.time_head_of_multiindex                                                                  1.25±0.02s
[ 74.14%] ··· io.csv.ToCSVIndexes.time_multiindex                                                                             480±2ms
[ 75.00%] ··· io.csv.ToCSVIndexes.time_standard_index                                                                         375±5ms
[ 75.00%] · For pandas commit 36d5c9b8 <master> (round 2/2):
[ 75.00%] ·· Building for virtualenv-py3.9-Cython0.29.21-jinja2-matplotlib-numba-numexpr-numpy-odfpy-openpyxl-pyarrow-scipy-sqlalchemy-tables-xlrd-xlsxwriter-xlwt..
[ 75.00%] ·· Benchmarking virtualenv-py3.9-Cython0.29.21-jinja2-matplotlib-numba-numexpr-numpy-odfpy-openpyxl-pyarrow-scipy-sqlalchemy-tables-xlrd-xlsxwriter-xlwt
[ 75.86%] ··· io.csv.ParseDateComparison.time_read_csv_dayfirst                                                                    ok
[ 75.86%] ··· ============= ============
               cache_dates              
              ------------- ------------
                  False      5.16±0.3ms 
                   True      2.76±0.1ms 
              ============= ============

[ 76.72%] ··· io.csv.ParseDateComparison.time_to_datetime_dayfirst                                                                 ok
[ 76.72%] ··· ============= ============
               cache_dates              
              ------------- ------------
                  False      5.34±0.1ms 
                   True      2.73±0.2ms 
              ============= ============

[ 77.59%] ··· io.csv.ParseDateComparison.time_to_datetime_format_DD_MM_YYYY                                                        ok
[ 77.59%] ··· ============= ============
               cache_dates              
              ------------- ------------
                  False      13.1±0.3ms 
                   True      2.82±0.2ms 
              ============= ============

[ 78.45%] ··· io.csv.ReadCSVCachedParseDates.time_read_csv_cached                                                                  ok
[ 78.45%] ··· ========== ============= ============
              --                   engine          
              ---------- --------------------------
               do_cache        c          python   
              ========== ============= ============
                 True     1.34±0.06ms   1.99±0.2ms 
                False      1.47±0.1ms   2.12±0.2ms 
              ========== ============= ============

[ 79.31%] ··· io.csv.ReadCSVCategorical.time_convert_direct                                                                        ok
[ 79.31%] ··· ======== ============
               engine              
              -------- ------------
                 c      20.1±0.6ms 
               python    116±2ms   
              ======== ============

[ 80.17%] ··· io.csv.ReadCSVCategorical.time_convert_post                                                                          ok
[ 80.17%] ··· ======== ============
               engine              
              -------- ------------
                 c      30.3±0.2ms 
               python    105±2ms   
              ======== ============

[ 81.03%] ··· io.csv.ReadCSVComment.time_comment                                                                                   ok
[ 81.03%] ··· ======== ============
               engine              
              -------- ------------
                 c      13.6±0.7ms 
               python   13.6±0.8ms 
              ======== ============

[ 81.90%] ··· io.csv.ReadCSVConcatDatetime.time_read_csv                                                                     18.1±1ms
[ 82.76%] ··· io.csv.ReadCSVConcatDatetimeBadDateValue.time_read_csv                                                               ok
[ 82.76%] ··· ================ ============
               bad_date_value              
              ---------------- ------------
                    nan         9.72±0.5ms 
                     0          6.76±0.9ms 
                     ''         7.87±0.4ms 
              ================ ============

[ 83.62%] ··· io.csv.ReadCSVDInferDatetimeFormat.time_read_csv                                                                     ok
[ 83.62%] ··· ======================= ============ ============= ============
              --                                       format                
              ----------------------- ---------------------------------------
               infer_datetime_format     custom       iso8601        ymd     
              ======================= ============ ============= ============
                        True           3.77±0.3ms    1.68±0.1ms   1.57±0.1ms 
                       False            70.0±5ms    1.40±0.06ms   1.21±0.2ms 
              ======================= ============ ============= ============

[ 84.48%] ··· io.csv.ReadCSVEngine.time_read_bytescsv                                                                              ok
[ 84.48%] ··· ========= ============
                engine              
              --------- ------------
                  c      11.6±0.6ms 
                python    167±4ms   
               pyarrow   5.60±0.2ms 
              ========= ============

[ 85.34%] ··· io.csv.ReadCSVEngine.time_read_stringcsv                                                                             ok
[ 85.34%] ··· ========= ============
                engine              
              --------- ------------
                  c      12.4±0.8ms 
                python    166±5ms   
               pyarrow   5.44±0.2ms 
              ========= ============

[ 86.21%] ··· io.csv.ReadCSVFloatPrecision.time_read_csv                                                                           ok
[ 86.21%] ··· ===== ============= ============= ================ ============ ============= ================
              --                                   decimal / float_precision                                
              ----- ----------------------------------------------------------------------------------------
               sep     . / None      . / high    . / round_trip    _ / None      _ / high    _ / round_trip 
              ===== ============= ============= ================ ============ ============= ================
                ,     1.08±0.2ms    1.13±0.1ms    1.73±0.08ms     1.14±0.1ms   1.17±0.04ms    1.12±0.07ms   
                ;    1.05±0.09ms   1.09±0.05ms    1.70±0.07ms     1.12±0.1ms    1.22±0.2ms    1.24±0.09ms   
              ===== ============= ============= ================ ============ ============= ================

[ 87.07%] ··· io.csv.ReadCSVFloatPrecision.time_read_csv_python_engine                                                             ok
[ 87.07%] ··· ===== ============ ============ ================ ============= ============ ================
              --                                  decimal / float_precision                               
              ----- --------------------------------------------------------------------------------------
               sep    . / None     . / high    . / round_trip     _ / None     _ / high    _ / round_trip 
              ===== ============ ============ ================ ============= ============ ================
                ,    2.70±0.2ms   2.72±0.1ms     2.66±0.1ms      2.24±0.2ms   2.45±0.1ms     2.30±0.2ms   
                ;    2.53±0.1ms   2.73±0.1ms     2.62±0.2ms     2.35±0.09ms   2.31±0.1ms     2.17±0.1ms   
              ===== ============ ============ ================ ============= ============ ================

[ 87.93%] ··· io.csv.ReadCSVMemoryGrowth.mem_parser_chunks                                                                         ok
[ 87.93%] ··· ======== ===
               engine     
              -------- ---
                 c      0 
               python   0 
              ======== ===

[ 88.79%] ··· io.csv.ReadCSVParseDates.time_baseline                                                                               ok
[ 88.79%] ··· ======== =============
               engine               
              -------- -------------
                 c      1.21±0.07ms 
               python    1.23±0.2ms 
              ======== =============

[ 89.66%] ··· io.csv.ReadCSVParseDates.time_multiple_date                                                                          ok
[ 89.66%] ··· ======== =============
               engine               
              -------- -------------
                 c      1.54±0.05ms 
               python    1.56±0.1ms 
              ======== =============

[ 90.52%] ··· io.csv.ReadCSVParseSpecialDate.time_read_special_date                                                                ok
[ 90.52%] ··· ======= ============ ============
              --                engine         
              ------- -------------------------
               value       c          python   
              ======= ============ ============
                 mY    5.48±0.1ms    23.9±1ms  
                mdY    3.05±0.1ms   8.99±0.7ms 
                 hm    2.49±0.3ms   9.10±0.3ms 
              ======= ============ ============

[ 91.38%] ··· io.csv.ReadCSVSkipRows.time_skipprows                                                                                ok
[ 91.38%] ··· ========== ============ ============ ============
              --                         engine                
              ---------- --------------------------------------
               skiprows       c          python      pyarrow   
              ========== ============ ============ ============
                 None     10.6±0.5ms   45.1±0.8ms   7.97±0.5ms 
                10000     7.90±0.2ms    33.8±1ms    7.59±0.1ms 
              ========== ============ ============ ============

[ 92.24%] ··· io.csv.ReadCSVThousands.time_thousands                                                                               ok
[ 92.24%] ··· ===== ============ =============== ============ ============
              --                      thousands / engine                  
              ----- ------------------------------------------------------
               sep    None / c    None / python     , / c      , / python 
              ===== ============ =============== ============ ============
                ,    8.07±0.3ms      43.6±1ms     10.8±0.4ms    101±3ms   
                |    8.29±0.5ms      43.7±1ms     9.54±0.2ms    103±2ms   
              ===== ============ =============== ============ ============

[ 93.10%] ··· io.csv.ReadUint64Integers.time_read_uint64                                                                   2.61±0.1ms
[ 93.97%] ··· io.csv.ReadUint64Integers.time_read_uint64_na_values                                                         4.36±0.5ms
[ 94.83%] ··· io.csv.ReadUint64Integers.time_read_uint64_neg_values                                                        4.19±0.3ms
[ 95.69%] ··· io.csv.ToCSV.time_frame                                                                                              ok
[ 95.69%] ··· ======= ============
                kind              
              ------- ------------
                wide   59.8±0.3ms 
                long   75.9±0.5ms 
               mixed   10.8±0.3ms 
              ======= ============

[ 96.55%] ··· io.csv.ToCSVDatetime.time_frame_date_formatting                                                              9.84±0.2ms
[ 97.41%] ··· io.csv.ToCSVDatetimeBig.time_frame                                                                                   ok
[ 97.41%] ··· ======== ============
                obs                
              -------- ------------
                1000    2.99±0.1ms 
               10000    25.3±0.3ms 
               100000    249±5ms   
              ======== ============

[ 98.28%] ··· io.csv.ToCSVIndexes.time_head_of_multiindex                                                                  1.27±0.01s
[ 99.14%] ··· io.csv.ToCSVIndexes.time_multiindex                                                                             482±3ms
[100.00%] ··· io.csv.ToCSVIndexes.time_standard_index                                                                         374±3ms

BENCHMARKS NOT SIGNIFICANTLY CHANGED.

BENCHMARKS NOT SIGNIFICANTLY CHANGED.

@ales-erjavec ales-erjavec force-pushed the round-trip-parser-trailing-space branch from 32305bc to 8b611d0 Compare October 1, 2021 19:35
@jreback jreback added this to the 1.4 milestone Oct 2, 2021
@jreback jreback merged commit bd94bb1 into pandas-dev:master Oct 2, 2021
@jreback
Copy link
Contributor

jreback commented Oct 2, 2021

thanks @ales-erjavec very nice!

gasparitiago pushed a commit to gasparitiago/pandas that referenced this pull request Oct 9, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
IO CSV read_csv, to_csv
Projects
None yet
Development

Successfully merging this pull request may close these issues.

BUG: read_csv float_precision="round_trip" parser does not handle initial/trailing spaces
3 participants