Skip to content

BENCH/REF: parametrize CSV benchmarks on engine #38442

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 11 commits into from
Dec 17, 2020

Conversation

arw2019
Copy link
Member

@arw2019 arw2019 commented Dec 13, 2020

  • closes #xxxx
  • tests added / passed
  • passes black pandas
  • passes git diff upstream/master -u -- "*.py" | flake8 --diff
  • whatsnew entry

The diff in basic PR implementing the pyarrow-based CSV engine (#38370) is quite big. A part of that PR is a small refactor of the CSV I/O benchmarks such that they take engine as a parameter. Most of that refactor is not dependent on having the pyarrow engine so I'm spinning it off here to de-bloat #38370.

@jreback jreback added Benchmark Performance (ASV) benchmarks IO CSV read_csv, to_csv labels Dec 13, 2020
@jreback jreback added this to the 1.3 milestone Dec 13, 2020
@jreback
Copy link
Contributor

jreback commented Dec 13, 2020

looks fine, checks are failing though, ping on green.

cc @gfyoung @WillAyd if comments.

@arw2019
Copy link
Member Author

arw2019 commented Dec 13, 2020

Atm not sure what's going on with the benchmark failure - the failure is in asv_bench/benchmarks/tslibs/timedelta.py which doesn't get touched here

@arw2019 arw2019 closed this Dec 13, 2020
@arw2019 arw2019 reopened this Dec 13, 2020
@mroeschke
Copy link
Member

#38476 should fix the unrelated benchmark failure.

Copy link
Member

@WillAyd WillAyd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks good small question

@@ -146,10 +146,10 @@ def time_read_csv(self, bad_date_value):
class ReadCSVSkipRows(BaseIO):

fname = "__test__.csv"
params = [None, 10000]
param_names = ["skiprows"]
params = ([None, 10000], ["c"])
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this supposed to be c and python or just c?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On current master we only have the default (I think that's c)

We have c and python in all the other benchmarks so it seems reasonable to add that here. If there's a reason we don't want that I'll revert

@arw2019
Copy link
Member Author

arw2019 commented Dec 15, 2020

#38476 should fix the unrelated benchmark failure.

Thanks @mroeschke!!

Copy link
Member

@jorisvandenbossche jorisvandenbossche left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!

Is there some logic in which classes are parameterized and which not?



class ReadCSVCategorical(BaseIO):
fname = "__test__.csv"

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could also parametrize this one?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

@arw2019
Copy link
Member Author

arw2019 commented Dec 15, 2020

Is there some logic in which classes are parameterized and which not?

Not that I know of beyond (AFAIK) that the python engine supports all these whereas the c/pyarrow engines only support a proper subset.

I'll do a pass through the entire file and parametrize where both c and python engines are supported.

@arw2019 arw2019 closed this Dec 16, 2020
@arw2019 arw2019 reopened this Dec 16, 2020
@arw2019
Copy link
Member Author

arw2019 commented Dec 17, 2020

CI green

@jreback jreback merged commit 7d8a052 into pandas-dev:master Dec 17, 2020
@jreback
Copy link
Contributor

jreback commented Dec 17, 2020

thanks @arw2019

@arw2019 arw2019 deleted the csv-bench branch December 17, 2020 01:43
luckyvs1 pushed a commit to luckyvs1/pandas that referenced this pull request Jan 20, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Benchmark Performance (ASV) benchmarks IO CSV read_csv, to_csv
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants