Skip to content

ENH: added new cache_dates parameter for read_csv func #25990

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 9 commits into from
May 7, 2019

Conversation

anmyachev
Copy link
Contributor

@anmyachev anmyachev commented Apr 4, 2019

  • closes N/A
  • tests added / passed
  • passes git diff upstream/master -u -- "*.py" | flake8 --diff
  • whatsnew entry

This change allows the user to use caching when parsing dates, which is provided by to_datetime function

@pep8speaks
Copy link

pep8speaks commented Apr 4, 2019

Hello @anmyachev! Thanks for updating this PR. We checked the lines you've touched for PEP 8 issues, and found:

There are currently no PEP 8 issues detected in this Pull Request. Cheers! 🍻

Comment last updated at 2019-05-07 08:30:55 UTC

@codecov
Copy link

codecov bot commented Apr 4, 2019

Codecov Report

Merging #25990 into master will decrease coverage by 49.95%.
The diff coverage is 100%.

Impacted file tree graph

@@             Coverage Diff             @@
##           master   #25990       +/-   ##
===========================================
- Coverage   91.86%    41.9%   -49.96%     
===========================================
  Files         175      175               
  Lines       52543    52544        +1     
===========================================
- Hits        48267    22021    -26246     
- Misses       4276    30523    +26247
Flag Coverage Δ
#multiple ?
#single 41.9% <100%> (-0.07%) ⬇️
Impacted Files Coverage Δ
pandas/io/parsers.py 48.24% <100%> (-47.55%) ⬇️
pandas/io/formats/latex.py 0% <0%> (-100%) ⬇️
pandas/io/sas/sas_constants.py 0% <0%> (-100%) ⬇️
pandas/tseries/plotting.py 0% <0%> (-100%) ⬇️
pandas/tseries/converter.py 0% <0%> (-100%) ⬇️
pandas/io/formats/html.py 0% <0%> (-99.36%) ⬇️
pandas/core/groupby/categorical.py 0% <0%> (-95.46%) ⬇️
pandas/io/sas/sas7bdat.py 0% <0%> (-91.16%) ⬇️
pandas/io/sas/sas_xport.py 0% <0%> (-90.1%) ⬇️
pandas/core/tools/numeric.py 10.44% <0%> (-89.56%) ⬇️
... and 131 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 70773d9...c9b258f. Read the comment docs.

@codecov
Copy link

codecov bot commented Apr 4, 2019

Codecov Report

Merging #25990 into master will decrease coverage by <.01%.
The diff coverage is 100%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master   #25990      +/-   ##
==========================================
- Coverage   91.99%   91.98%   -0.01%     
==========================================
  Files         175      175              
  Lines       52377    52378       +1     
==========================================
- Hits        48182    48180       -2     
- Misses       4195     4198       +3
Flag Coverage Δ
#multiple 90.53% <100%> (ø) ⬆️
#single 40.77% <100%> (-0.13%) ⬇️
Impacted Files Coverage Δ
pandas/io/parsers.py 95.84% <100%> (ø) ⬆️
pandas/io/gbq.py 78.94% <0%> (-10.53%) ⬇️
pandas/core/frame.py 96.9% <0%> (-0.12%) ⬇️
pandas/util/testing.py 90.7% <0%> (+0.1%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 64c1127...687d0c4. Read the comment docs.

@gfyoung gfyoung added Enhancement IO CSV read_csv, to_csv Performance Memory or execution speed performance Datetime Datetime data dtype labels Apr 4, 2019
@jreback
Copy link
Contributor

jreback commented Apr 5, 2019

while we are doing this may as well flip the flag in pd.to_datetime as well to True. (it could be a separate PR), but need to run asv's.

@anmyachev anmyachev force-pushed the cache_dates_api_change branch from 9b0a443 to 2fd52d0 Compare April 5, 2019 06:43
@anmyachev
Copy link
Contributor Author

@jreback I made sure that changing to_datetime signature should be done in a separate request, because, unfortunately, it is not enough to change it alone, I will have to change the signatures of other functions. So last commit will be removed from this PR.

@anmyachev anmyachev force-pushed the cache_dates_api_change branch from 0925e56 to 2fd52d0 Compare April 5, 2019 08:46
@anmyachev
Copy link
Contributor Author

Note for reviewers: it seems that this PR have the same errors as master branch

@anmyachev anmyachev force-pushed the cache_dates_api_change branch from 7ca68a1 to 002c031 Compare April 5, 2019 16:58
@gfyoung
Copy link
Member

gfyoung commented Apr 8, 2019

We already have so many datetime parameters at this point, but I'm thinking at some point we should condense them into a dict. This would allow for ease of parameterization without bloat.

@jreback : What do you think?

@vnlitvinov
Copy link
Contributor

I'm thinking at some point we should condense them into a dict.

Maybe declare a collections.namedtuple for those arguments, so they're fix-named?

@gfyoung
Copy link
Member

gfyoung commented Apr 8, 2019

Maybe declare a collections.namedtuple for those arguments, so they're fix-named?

@vnlitvin : I'm actually inclined to leave it as dict for the time being. I would like to keep that portion of the API flexible so that the only limits on how people want to parse dates is the signature underlying our datetime parsing.

@anmyachev
Copy link
Contributor Author

asv continuous -f 1.1 origin/master cache_dates_api_change -a warmup_time=1 -a sample_time=1

master our_patch ratio test_name
1.45±0.01ms 1.87±0.01ms 1.29 io.csv.ReadCSVParseDates.time_baseline
1.74±0.02ms 2.24±0.02ms 1.29 io.csv.ReadCSVParseDates.time_multiple_date
1.41±0.01ms 1.56±0.02ms 1.11 io.csv.ReadCSVDInferDatetimeFormat.time_read_csv(False, 'iso8601')
1.34±0ms 1.47±0.02ms 1.10 io.csv.ReadCSVDInferDatetimeFormat.time_read_csv(False, 'ymd')
47.3±0.3ms 6.45±0.04ms 0.14 io.csv.ReadCSVCachedParseDates.time_read_csv_cached(True)

@anmyachev
Copy link
Contributor Author

@gfyoung condensation of parameters into dict should be in a separate PR, shouldn't it?

@gfyoung
Copy link
Member

gfyoung commented Apr 9, 2019

@anmyachev : Absolutely! Separate PR would be most appropriate.

@anmyachev
Copy link
Contributor Author

@gfyoung ok. Anything left to fix in this PR?

Copy link
Member

@gfyoung gfyoung left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Over to you @jreback

@jreback jreback added this to the 0.25.0 milestone Apr 9, 2019
@jreback
Copy link
Contributor

jreback commented Apr 9, 2019

small comments, otherwise lgtm. Is that an issue for changing this in pd.to_datetime as well? (the default)

@anmyachev
Copy link
Contributor Author

@jreback when changing the flag in pd.to_datetime to true, errors occur on some platforms. I do not remember exactly what the problem was there. But in the near future I will create the PR right with this change. That's when I can answer your question.

@anmyachev anmyachev force-pushed the cache_dates_api_change branch from ed521d6 to 94c0eb5 Compare April 10, 2019 06:59
@jreback
Copy link
Contributor

jreback commented Apr 10, 2019

@anmyachev this is a troubling comment. #25990 (comment)

can you show me an example?

@anmyachev
Copy link
Contributor Author

anmyachev commented Apr 10, 2019

@jreback after getting result of CI tests for #26043 I can show you an example.
I think we can continue our discussion there.

@vnlitvinov
Copy link
Contributor

@jreback is there anything left to fix before this could be merged?

@vnlitvinov vnlitvinov force-pushed the cache_dates_api_change branch from 94c0eb5 to 1c85780 Compare April 18, 2019 15:14
@vnlitvinov
Copy link
Contributor

@jreback

@anmyachev this is a troubling comment. #25990 (comment)

can you show me an example?

Does our example (#26097) satisfy you here? I feel that it shouldn't be a blocker for this PR.

P.S. I've rebased on master to get rid of merge conflicts (and to have a clean history).

@vnlitvinov
Copy link
Contributor

@jreback @gfyoung anything needed from us?

@jreback
Copy link
Contributor

jreback commented Apr 28, 2019

is there an issue for this?

@anmyachev
Copy link
Contributor Author

is there an issue for this?

@jreback no errors detected for this.

@anmyachev anmyachev force-pushed the cache_dates_api_change branch from ee0b37f to af757f2 Compare April 28, 2019 18:24
@jreback
Copy link
Contributor

jreback commented Apr 28, 2019

no errors detected for this.

what does this mean?

@anmyachev
Copy link
Contributor Author

no errors detected for this.

what does this mean?

I mean: I not found issue for this.

@vnlitvinov
Copy link
Contributor

is there an issue for this?

If you mean Github issue then no, we didn't file one and we didn't fine any.
That's why this PR is used as a reference in whatsnew.

@jreback
Copy link
Contributor

jreback commented May 7, 2019

can you merge master.

@anmyachev anmyachev force-pushed the cache_dates_api_change branch from af757f2 to 687d0c4 Compare May 7, 2019 08:30
@anmyachev
Copy link
Contributor Author

@jreback Done & green

@jreback jreback merged commit f65742c into pandas-dev:master May 7, 2019
@jreback
Copy link
Contributor

jreback commented May 7, 2019

thanks @anmyachev

@TomAugspurger
Copy link
Contributor

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Datetime Datetime data dtype Enhancement IO CSV read_csv, to_csv Performance Memory or execution speed performance
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants