Skip to content

read_html and read_json should provide a cache #6456

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
c0indev3l opened this issue Feb 23, 2014 · 8 comments
Closed

read_html and read_json should provide a cache #6456

c0indev3l opened this issue Feb 23, 2014 · 8 comments
Labels
Docs IO HTML read_html, to_html, Styler.apply, Styler.applymap IO JSON read_json, to_json, json_normalize

Comments

@c0indev3l
Copy link

Hello,

pandas.io.html.read_html should provide a cache such as what requests_cache provides
https://requests-cache.readthedocs.org/en/latest/

persistence into sqlite, mongodb, redis... is very convenient.

Maybe some other pandas functions could also use this cache mechanism.
pandas.io.json.read_json for example

This is what I do

import requests
import requests_cache
expire_after = 15*60 # expire_after=delay_second or None
requests_cache.install_cache('req_cache', backend='sqlite', expire_after=expire_after)
import json
import pandas as pd
from StringIO import StringIO

req = requests.get(url)
io = StringIO(req.content)
df = pd.read_html(io)

Kind regards

@jreback
Copy link
Contributor

jreback commented Feb 23, 2014

This is a nice feature, but I don't think should be part of core pandas. requests is not a dep currently of pandas.

A cookbook recipie or a small section in the io.rst docs would acceptable.

@jreback jreback added the Docs label Feb 23, 2014
@jreback jreback added this to the Someday milestone Feb 23, 2014
@c0indev3l
Copy link
Author

Lot's of people are using Pandas inside IPython notebook (which is a great tool) so they don't care of request limits of an API (number of time you send a request to a server) but when you are using a script that you are modifying that's quite different. And in such a case a cache could be very useful. Sometimes cache can be dangerous because you will get obsolete data but that's the risk.

About requests it's a great library I like how it simplifies things but I also understand that adding a dep must be a deliberate decision.

I also wonder if you know others Pandas functions that could take benefits from such a feature.

@dalejung
Copy link
Contributor

Instead of executing the script repeatedly in a new kernel, I would suggest using ipython and %run. Then you can split out your heavy data calls into a data.py module. This will cache using the normal python module system. If you're using something like vim, you can setup a hotkey to run your file into a tmux window.

.vimrc

autocmd FileType python map <buffer> <S-r> :w<CR>:!tmux send-keys -t ipython "\%time \%run %:p" enter<CR><CR>

runs the file in a tmux session named ipython.

@c0indev3l
Copy link
Author

Thanks for this tips but you should admit that putting only requests_cache.install_cache('req_cache', backend='sqlite', expire_after=None) (or something similar) will be easier (especially when you do not know well the script is being worked on)

@cpcloud
Copy link
Member

cpcloud commented Feb 24, 2014

@c0indev3l I think your solution above is great. pandas plays well with others. We don't want to add any more deps (even optional ones) right now.

@jreback
Copy link
Contributor

jreback commented Feb 24, 2014

closing as not in pandas purview, though @c0indev3l if you'd like to doc your soln would be more than happy to put it in the cookbook

@femtotrader
Copy link

A quite clean solution to this could be to provide an other parameter (a requests.Session or a requests_cache.CachedSession) to _read function.

so we could have

def _read(obj, session=None):

instead of

def _read(obj):

so if no session is passed classic urlopen will be used but if session is not None session (either requests or requests_cache could be passed)

if _is_url(obj):
    if session is None:
        with urlopen(obj) as url:
            text = url.read()
    else:
        text = session.get(url).content

read_htmlshould also have this session parameter.

That's just my own experiment with https://github.com/femtotrader/pandas_datareaders

I added cache mechanism to nearly every Python Pandas DataReaders
(I'm still having some issue with FamaFrench). I faced this read_html issue with Yahoo Finance Options (frames = pd.read_html(url))

I think this issue should be linked to #8961 .

@edvardm
Copy link

edvardm commented May 16, 2022

I see this is closed, but for someone else seeking the solution:

Better to separate concerns. requests is really nice library to fetch resources over HTTP, and there is requests-cache which provides desired functionality. So you can do just

import requests
import pandas as pd

try:
    import requests_cache
    requests_cache.install_cache("my_cache", expire_after=7*86400) # cache any requests for 1 week
except ImportError:
    print("Warning: requests-cache not installed, NOT using cache")        

def get_text(url: str) -> str:
    # use of explicit requests library, so that it automatically uses `requests_cache` if set up
    response = requests.get(url)
    response.raise_for_status()
    return response.text
    
df = pd.read_html(get_text(my_url))

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Docs IO HTML read_html, to_html, Styler.apply, Styler.applymap IO JSON read_json, to_json, json_normalize
Projects
None yet
Development

No branches or pull requests

6 participants