Skip to content

Commit 81a5c98

Browse files
CNL: remove the io.data and io.wb modules in favor of pandas-datareader (GH13724) (#13735)
1 parent 2c047d4 commit 81a5c98

File tree

9 files changed

+22
-17311
lines changed

9 files changed

+22
-17311
lines changed

doc/source/remote_data.rst

+8-325
Original file line numberDiff line numberDiff line change
@@ -2,34 +2,21 @@
22

33
.. currentmodule:: pandas
44

5-
.. ipython:: python
6-
:suppress:
7-
8-
import os
9-
import csv
10-
import pandas as pd
11-
12-
import numpy as np
13-
np.random.seed(123456)
14-
randn = np.random.randn
15-
np.set_printoptions(precision=4, suppress=True)
16-
17-
import matplotlib.pyplot as plt
18-
plt.close('all')
19-
20-
from pandas import *
21-
options.display.max_rows=15
22-
import pandas.util.testing as tm
23-
245
******************
256
Remote Data Access
267
******************
278

289
.. _remote_data.pandas_datareader:
2910

30-
.. warning::
11+
DataReader
12+
----------
3113

32-
In pandas 0.17.0, the sub-package ``pandas.io.data`` will be removed in favor of a separately installable `pandas-datareader package <https://github.com/pydata/pandas-datareader>`_. This will allow the data modules to be independently updated to your pandas installation. The API for ``pandas-datareader v0.1.1`` is the same as in ``pandas v0.16.1``. (:issue:`8961`)
14+
The sub-package ``pandas.io.data`` is removed in favor of a separately
15+
installable `pandas-datareader package
16+
<https://github.com/pydata/pandas-datareader>`_. This will allow the data
17+
modules to be independently updated to your pandas installation. The API for
18+
``pandas-datareader v0.1.1`` is the same as in ``pandas v0.16.1``.
19+
(:issue:`8961`)
3320

3421
You should replace the imports of the following:
3522

@@ -43,310 +30,6 @@ Remote Data Access
4330
4431
from pandas_datareader import data, wb
4532
46-
.. _remote_data.data_reader:
47-
48-
Functions from :mod:`pandas.io.data` and :mod:`pandas.io.ga` extract data from various Internet sources into a DataFrame. Currently the following sources are supported:
49-
50-
- :ref:`Yahoo! Finance<remote_data.yahoo>`
51-
- :ref:`Google Finance<remote_data.google>`
52-
- :ref:`St.Louis FED (FRED)<remote_data.fred>`
53-
- :ref:`Kenneth French's data library<remote_data.ff>`
54-
- :ref:`World Bank<remote_data.wb>`
55-
- :ref:`Google Analytics<remote_data.ga>`
56-
57-
It should be noted, that various sources support different kinds of data, so not all sources implement the same methods and the data elements returned might also differ.
58-
59-
.. _remote_data.yahoo:
60-
61-
Yahoo! Finance
62-
--------------
63-
64-
.. ipython:: python
65-
:okwarning:
66-
67-
import pandas.io.data as web
68-
import datetime
69-
start = datetime.datetime(2010, 1, 1)
70-
end = datetime.datetime(2013, 1, 27)
71-
f = web.DataReader("F", 'yahoo', start, end)
72-
f.ix['2010-01-04']
73-
74-
.. _remote_data.yahoo_options:
75-
76-
Yahoo! Finance Options
77-
----------------------
78-
***Experimental***
79-
80-
The ``Options`` class allows the download of options data from Yahoo! Finance.
81-
82-
The ``get_all_data`` method downloads and caches option data for all expiry months
83-
and provides a formatted ``DataFrame`` with a hierarchical index, so it is easy to get
84-
to the specific option you want.
85-
86-
.. ipython:: python
87-
88-
from pandas.io.data import Options
89-
aapl = Options('aapl', 'yahoo')
90-
data = aapl.get_all_data()
91-
data.iloc[0:5, 0:5]
92-
93-
# Show the $100 strike puts at all expiry dates:
94-
data.loc[(100, slice(None), 'put'),:].iloc[0:5, 0:5]
95-
96-
# Show the volume traded of $100 strike puts at all expiry dates:
97-
data.loc[(100, slice(None), 'put'),'Vol'].head()
98-
99-
If you don't want to download all the data, more specific requests can be made.
100-
101-
.. ipython:: python
102-
103-
import datetime
104-
expiry = datetime.date(2016, 1, 1)
105-
data = aapl.get_call_data(expiry=expiry)
106-
data.iloc[0:5:, 0:5]
107-
108-
Note that if you call ``get_all_data`` first, this second call will happen much faster,
109-
as the data is cached.
110-
111-
If a given expiry date is not available, data for the next available expiry will be
112-
returned (January 15, 2015 in the above example).
113-
114-
Available expiry dates can be accessed from the ``expiry_dates`` property.
115-
116-
.. ipython:: python
117-
118-
aapl.expiry_dates
119-
data = aapl.get_call_data(expiry=aapl.expiry_dates[0])
120-
data.iloc[0:5:, 0:5]
121-
122-
A list-like object containing dates can also be passed to the expiry parameter,
123-
returning options data for all expiry dates in the list.
124-
125-
.. ipython:: python
126-
127-
data = aapl.get_near_stock_price(expiry=aapl.expiry_dates[0:3])
128-
data.iloc[0:5:, 0:5]
129-
130-
The ``month`` and ``year`` parameters can be used to get all options data for a given month.
131-
132-
.. _remote_data.google:
133-
134-
Google Finance
135-
--------------
136-
137-
.. ipython:: python
138-
139-
import pandas.io.data as web
140-
import datetime
141-
start = datetime.datetime(2010, 1, 1)
142-
end = datetime.datetime(2013, 1, 27)
143-
f = web.DataReader("F", 'google', start, end)
144-
f.ix['2010-01-04']
145-
146-
.. _remote_data.fred:
147-
148-
FRED
149-
----
150-
151-
.. ipython:: python
152-
153-
import pandas.io.data as web
154-
import datetime
155-
start = datetime.datetime(2010, 1, 1)
156-
end = datetime.datetime(2013, 1, 27)
157-
gdp=web.DataReader("GDP", "fred", start, end)
158-
gdp.ix['2013-01-01']
159-
160-
# Multiple series:
161-
inflation = web.DataReader(["CPIAUCSL", "CPILFESL"], "fred", start, end)
162-
inflation.head()
163-
.. _remote_data.ff:
164-
165-
Fama/French
166-
-----------
167-
168-
Dataset names are listed at `Fama/French Data Library
169-
<http://mba.tuck.dartmouth.edu/pages/faculty/ken.french/data_library.html>`__.
170-
171-
.. ipython:: python
172-
173-
import pandas.io.data as web
174-
ip = web.DataReader("5_Industry_Portfolios", "famafrench")
175-
ip[4].ix[192607]
176-
177-
.. _remote_data.wb:
178-
179-
World Bank
180-
----------
181-
182-
``pandas`` users can easily access thousands of panel data series from the
183-
`World Bank's World Development Indicators <http://data.worldbank.org>`__
184-
by using the ``wb`` I/O functions.
185-
186-
Indicators
187-
~~~~~~~~~~
188-
189-
Either from exploring the World Bank site, or using the search function included,
190-
every world bank indicator is accessible.
191-
192-
For example, if you wanted to compare the Gross Domestic Products per capita in
193-
constant dollars in North America, you would use the ``search`` function:
194-
195-
.. code-block:: ipython
196-
197-
In [1]: from pandas.io import wb
198-
199-
In [2]: wb.search('gdp.*capita.*const').iloc[:,:2]
200-
Out[2]:
201-
id name
202-
3242 GDPPCKD GDP per Capita, constant US$, millions
203-
5143 NY.GDP.PCAP.KD GDP per capita (constant 2005 US$)
204-
5145 NY.GDP.PCAP.KN GDP per capita (constant LCU)
205-
5147 NY.GDP.PCAP.PP.KD GDP per capita, PPP (constant 2005 internation...
206-
207-
Then you would use the ``download`` function to acquire the data from the World
208-
Bank's servers:
209-
210-
.. code-block:: ipython
211-
212-
In [3]: dat = wb.download(indicator='NY.GDP.PCAP.KD', country=['US', 'CA', 'MX'], start=2005, end=2008)
213-
214-
In [4]: print(dat)
215-
NY.GDP.PCAP.KD
216-
country year
217-
Canada 2008 36005.5004978584
218-
2007 36182.9138439757
219-
2006 35785.9698172849
220-
2005 35087.8925933298
221-
Mexico 2008 8113.10219480083
222-
2007 8119.21298908649
223-
2006 7961.96818458178
224-
2005 7666.69796097264
225-
United States 2008 43069.5819857208
226-
2007 43635.5852068142
227-
2006 43228.111147107
228-
2005 42516.3934699993
229-
230-
The resulting dataset is a properly formatted ``DataFrame`` with a hierarchical
231-
index, so it is easy to apply ``.groupby`` transformations to it:
232-
233-
.. code-block:: ipython
234-
235-
In [6]: dat['NY.GDP.PCAP.KD'].groupby(level=0).mean()
236-
Out[6]:
237-
country
238-
Canada 35765.569188
239-
Mexico 7965.245332
240-
United States 43112.417952
241-
dtype: float64
242-
243-
Now imagine you want to compare GDP to the share of people with cellphone
244-
contracts around the world.
245-
246-
.. code-block:: ipython
247-
248-
In [7]: wb.search('cell.*%').iloc[:,:2]
249-
Out[7]:
250-
id name
251-
3990 IT.CEL.SETS.FE.ZS Mobile cellular telephone users, female (% of ...
252-
3991 IT.CEL.SETS.MA.ZS Mobile cellular telephone users, male (% of po...
253-
4027 IT.MOB.COV.ZS Population coverage of mobile cellular telepho...
254-
255-
Notice that this second search was much faster than the first one because
256-
``pandas`` now has a cached list of available data series.
257-
258-
.. code-block:: ipython
259-
260-
In [13]: ind = ['NY.GDP.PCAP.KD', 'IT.MOB.COV.ZS']
261-
In [14]: dat = wb.download(indicator=ind, country='all', start=2011, end=2011).dropna()
262-
In [15]: dat.columns = ['gdp', 'cellphone']
263-
In [16]: print(dat.tail())
264-
gdp cellphone
265-
country year
266-
Swaziland 2011 2413.952853 94.9
267-
Tunisia 2011 3687.340170 100.0
268-
Uganda 2011 405.332501 100.0
269-
Zambia 2011 767.911290 62.0
270-
Zimbabwe 2011 419.236086 72.4
271-
272-
Finally, we use the ``statsmodels`` package to assess the relationship between
273-
our two variables using ordinary least squares regression. Unsurprisingly,
274-
populations in rich countries tend to use cellphones at a higher rate:
275-
276-
.. code-block:: ipython
277-
278-
In [17]: import numpy as np
279-
In [18]: import statsmodels.formula.api as smf
280-
In [19]: mod = smf.ols("cellphone ~ np.log(gdp)", dat).fit()
281-
In [20]: print(mod.summary())
282-
OLS Regression Results
283-
==============================================================================
284-
Dep. Variable: cellphone R-squared: 0.297
285-
Model: OLS Adj. R-squared: 0.274
286-
Method: Least Squares F-statistic: 13.08
287-
Date: Thu, 25 Jul 2013 Prob (F-statistic): 0.00105
288-
Time: 15:24:42 Log-Likelihood: -139.16
289-
No. Observations: 33 AIC: 282.3
290-
Df Residuals: 31 BIC: 285.3
291-
Df Model: 1
292-
===============================================================================
293-
coef std err t P>|t| [95.0% Conf. Int.]
294-
-------------------------------------------------------------------------------
295-
Intercept 16.5110 19.071 0.866 0.393 -22.384 55.406
296-
np.log(gdp) 9.9333 2.747 3.616 0.001 4.331 15.535
297-
==============================================================================
298-
Omnibus: 36.054 Durbin-Watson: 2.071
299-
Prob(Omnibus): 0.000 Jarque-Bera (JB): 119.133
300-
Skew: -2.314 Prob(JB): 1.35e-26
301-
Kurtosis: 11.077 Cond. No. 45.8
302-
==============================================================================
303-
304-
Country Codes
305-
~~~~~~~~~~~~~
306-
307-
.. versionadded:: 0.15.1
308-
309-
The ``country`` argument accepts a string or list of mixed
310-
`two <http://en.wikipedia.org/wiki/ISO_3166-1_alpha-2>`__ or `three <http://en.wikipedia.org/wiki/ISO_3166-1_alpha-3>`__ character
311-
ISO country codes, as well as dynamic `World Bank exceptions <http://data.worldbank.org/node/18>`__ to the ISO standards.
312-
313-
For a list of the the hard-coded country codes (used solely for error handling logic) see ``pandas.io.wb.country_codes``.
314-
315-
Problematic Country Codes & Indicators
316-
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
317-
318-
.. note::
319-
320-
The World Bank's country list and indicators are dynamic. As of 0.15.1,
321-
:func:`wb.download()` is more flexible. To achieve this, the warning
322-
and exception logic changed.
323-
324-
The world bank converts some country codes in their response, which makes error
325-
checking by pandas difficult. Retired indicators still persist in the search.
326-
327-
Given the new flexibility of 0.15.1, improved error handling by the user
328-
may be necessary for fringe cases.
329-
330-
To help identify issues:
331-
332-
There are at least 4 kinds of country codes:
333-
334-
1. Standard (2/3 digit ISO) - returns data, will warn and error properly.
335-
2. Non-standard (WB Exceptions) - returns data, but will falsely warn.
336-
3. Blank - silently missing from the response.
337-
4. Bad - causes the entire response from WB to fail, always exception inducing.
338-
339-
There are at least 3 kinds of indicators:
340-
341-
1. Current - Returns data.
342-
2. Retired - Appears in search results, yet won't return data.
343-
3. Bad - Will not return data.
344-
345-
Use the ``errors`` argument to control warnings and exceptions. Setting
346-
errors to ignore or warn, won't stop failed responses. (ie, 100% bad
347-
indicators, or a single "bad" (#4 above) country code).
348-
349-
See docstrings for more info.
35033
35134
.. _remote_data.ga:
35235

doc/source/whatsnew/v0.19.0.txt

+2
Original file line numberDiff line numberDiff line change
@@ -616,6 +616,8 @@ Deprecations
616616
Removal of prior version deprecations/changes
617617
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
618618
- The ``pd.sandbox`` module has been removed in favor of the external library ``pandas-qt`` (:issue:`13670`)
619+
- The ``pandas.io.data`` and ``pandas.io.wb`` modules are removed in favor of
620+
the `pandas-datareader package <https://github.com/pydata/pandas-datareader>`__ (:issue:`13724`).
619621
- ``DataFrame.to_csv()`` has dropped the ``engine`` parameter, as was deprecated in 0.17.1 (:issue:`11274`, :issue:`13419`)
620622
- ``DataFrame.to_dict()`` has dropped the ``outtype`` parameter in favor of ``orient`` (:issue:`13627`, :issue:`8486`)
621623
- ``pd.Categorical`` has dropped setting of the ``ordered`` attribute directly in favor of the ``set_ordered`` method (:issue:`13671`)

0 commit comments

Comments
 (0)