read_html and read_json should provide a cache #6456

c0indev3l · 2014-02-23T22:24:04Z

Hello,

pandas.io.html.read_html should provide a cache such as what requests_cache provides
https://requests-cache.readthedocs.org/en/latest/

persistence into sqlite, mongodb, redis... is very convenient.

Maybe some other pandas functions could also use this cache mechanism.
pandas.io.json.read_json for example

This is what I do

import requests
import requests_cache
expire_after = 15*60 # expire_after=delay_second or None
requests_cache.install_cache('req_cache', backend='sqlite', expire_after=expire_after)
import json
import pandas as pd
from StringIO import StringIO

req = requests.get(url)
io = StringIO(req.content)
df = pd.read_html(io)

Kind regards

The text was updated successfully, but these errors were encountered:

jreback · 2014-02-23T22:30:42Z

This is a nice feature, but I don't think should be part of core pandas. requests is not a dep currently of pandas.

A cookbook recipie or a small section in the io.rst docs would acceptable.

c0indev3l · 2014-02-23T22:42:41Z

Lot's of people are using Pandas inside IPython notebook (which is a great tool) so they don't care of request limits of an API (number of time you send a request to a server) but when you are using a script that you are modifying that's quite different. And in such a case a cache could be very useful. Sometimes cache can be dangerous because you will get obsolete data but that's the risk.

About requests it's a great library I like how it simplifies things but I also understand that adding a dep must be a deliberate decision.

I also wonder if you know others Pandas functions that could take benefits from such a feature.

dalejung · 2014-02-24T00:28:30Z

Instead of executing the script repeatedly in a new kernel, I would suggest using ipython and %run. Then you can split out your heavy data calls into a data.py module. This will cache using the normal python module system. If you're using something like vim, you can setup a hotkey to run your file into a tmux window.

.vimrc

autocmd FileType python map <buffer> <S-r> :w<CR>:!tmux send-keys -t ipython "\%time \%run %:p" enter<CR><CR>

runs the file in a tmux session named ipython.

c0indev3l · 2014-02-24T14:19:19Z

Thanks for this tips but you should admit that putting only requests_cache.install_cache('req_cache', backend='sqlite', expire_after=None) (or something similar) will be easier (especially when you do not know well the script is being worked on)

cpcloud · 2014-02-24T20:20:03Z

@c0indev3l I think your solution above is great. pandas plays well with others. We don't want to add any more deps (even optional ones) right now.

jreback · 2014-02-24T20:24:02Z

closing as not in pandas purview, though @c0indev3l if you'd like to doc your soln would be more than happy to put it in the cookbook

femtotrader · 2014-12-11T18:26:15Z

A quite clean solution to this could be to provide an other parameter (a requests.Session or a requests_cache.CachedSession) to _read function.

so we could have

def _read(obj, session=None):

instead of

def _read(obj):

so if no session is passed classic urlopen will be used but if session is not None session (either requests or requests_cache could be passed)

if _is_url(obj):
    if session is None:
        with urlopen(obj) as url:
            text = url.read()
    else:
        text = session.get(url).content

read_htmlshould also have this session parameter.

That's just my own experiment with https://github.com/femtotrader/pandas_datareaders

I added cache mechanism to nearly every Python Pandas DataReaders
(I'm still having some issue with FamaFrench). I faced this read_html issue with Yahoo Finance Options (frames = pd.read_html(url))

I think this issue should be linked to #8961 .

edvardm · 2022-05-16T14:54:26Z

I see this is closed, but for someone else seeking the solution:

Better to separate concerns. requests is really nice library to fetch resources over HTTP, and there is requests-cache which provides desired functionality. So you can do just

import requests
import pandas as pd

try:
    import requests_cache
    requests_cache.install_cache("my_cache", expire_after=7*86400) # cache any requests for 1 week
except ImportError:
    print("Warning: requests-cache not installed, NOT using cache")        

def get_text(url: str) -> str:
    # use of explicit requests library, so that it automatically uses `requests_cache` if set up
    response = requests.get(url)
    response.raise_for_status()
    return response.text
    
df = pd.read_html(get_text(my_url))

jreback added the Docs label Feb 23, 2014

jreback added this to the Someday milestone Feb 23, 2014

jreback added HTML labels Feb 23, 2014

jreback closed this as completed Feb 24, 2014

femtotrader mentioned this issue Jul 5, 2014

Use static data for unit tests quantopian/zipline#256

Closed

c0indev3l mentioned this issue Jul 15, 2014

Add Google Finance as intraday DataReader quantopian/zipline#215

Closed

jreback mentioned this issue Nov 2, 2014

DataReader should cache queries. #8713

Closed

femtotrader mentioned this issue Jan 12, 2015

PR to pandas femtotrader/pandas_datareaders_unofficial#3

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

read_html and read_json should provide a cache #6456

read_html and read_json should provide a cache #6456

c0indev3l commented Feb 23, 2014

jreback commented Feb 23, 2014

c0indev3l commented Feb 23, 2014

dalejung commented Feb 24, 2014

c0indev3l commented Feb 24, 2014

cpcloud commented Feb 24, 2014

jreback commented Feb 24, 2014

femtotrader commented Dec 11, 2014

edvardm commented May 16, 2022 •

edited

Loading

read_html and read_json should provide a cache #6456

read_html and read_json should provide a cache #6456

Comments

c0indev3l commented Feb 23, 2014

jreback commented Feb 23, 2014

c0indev3l commented Feb 23, 2014

dalejung commented Feb 24, 2014

c0indev3l commented Feb 24, 2014

cpcloud commented Feb 24, 2014

jreback commented Feb 24, 2014

femtotrader commented Dec 11, 2014

edvardm commented May 16, 2022 • edited Loading

edvardm commented May 16, 2022 •

edited

Loading