Skip to content

Commit 9209224

Browse files
author
Vincent Arel-Bundock
committed
DOC: io.wb example
1 parent 29a709c commit 9209224

File tree

2 files changed

+121
-0
lines changed

2 files changed

+121
-0
lines changed

doc/source/io.rst

+120
Original file line numberDiff line numberDiff line change
@@ -2584,3 +2584,123 @@ Tthe dataset names are listed at `Fama/French Data Library
25842584
import pandas.io.data as web
25852585
ip=web.DataReader("5_Industry_Portfolios", "famafrench")
25862586
ip[4].ix[192607]
2587+
2588+
2589+
World Bank panel data in Pandas
2590+
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
2591+
2592+
``Pandas`` users can easily access thousands of panel data series from the
2593+
`World Bank's World Development Indicators <http://data.worldbank.org>`_
2594+
by using the ``wb`` I/O functions.
2595+
2596+
For example, if you wanted to compare the Gross Domestic Products per capita in
2597+
constant dollars in North America, you would use the ``search`` function:
2598+
2599+
.. code:: python
2600+
2601+
In [1]: from pandas.io.wb import search, download
2602+
2603+
In [2]: search('gdp.*capita.*const').iloc[:,:2]
2604+
Out[2]:
2605+
id name
2606+
3242 GDPPCKD GDP per Capita, constant US$, millions
2607+
5143 NY.GDP.PCAP.KD GDP per capita (constant 2005 US$)
2608+
5145 NY.GDP.PCAP.KN GDP per capita (constant LCU)
2609+
5147 NY.GDP.PCAP.PP.KD GDP per capita, PPP (constant 2005 internation...
2610+
2611+
Then you would use the ``download`` function to acquire the data from the World
2612+
Bank's servers:
2613+
2614+
.. code:: python
2615+
2616+
In [3]: dat = download(indicator='NY.GDP.PCAP.KD', country=['US', 'CA', 'MX'], start=2005, end=2008)
2617+
2618+
In [4]: print dat
2619+
NY.GDP.PCAP.KD
2620+
country year
2621+
Canada 2008 36005.5004978584
2622+
2007 36182.9138439757
2623+
2006 35785.9698172849
2624+
2005 35087.8925933298
2625+
Mexico 2008 8113.10219480083
2626+
2007 8119.21298908649
2627+
2006 7961.96818458178
2628+
2005 7666.69796097264
2629+
United States 2008 43069.5819857208
2630+
2007 43635.5852068142
2631+
2006 43228.111147107
2632+
2005 42516.3934699993
2633+
2634+
The resulting dataset is a properly formatted ``DataFrame`` with a hierarchical
2635+
index, so it is easy to apply ``.groupby`` transformations to it:
2636+
2637+
.. code:: python
2638+
2639+
In [6]: dat['NY.GDP.PCAP.KD'].groupby(level=0).mean()
2640+
Out[6]:
2641+
country
2642+
Canada 35765.569188
2643+
Mexico 7965.245332
2644+
United States 43112.417952
2645+
dtype: float64
2646+
2647+
Now imagine you want to compare GDP to the share of people with cellphone
2648+
contracts around the world.
2649+
2650+
.. code:: python
2651+
2652+
In [7]: search('cell.*%').iloc[:,:2]
2653+
Out[7]:
2654+
id name
2655+
3990 IT.CEL.SETS.FE.ZS Mobile cellular telephone users, female (% of ...
2656+
3991 IT.CEL.SETS.MA.ZS Mobile cellular telephone users, male (% of po...
2657+
4027 IT.MOB.COV.ZS Population coverage of mobile cellular telepho...
2658+
2659+
Notice that this second search was much faster than the first one because
2660+
``Pandas`` now has a cached list of available data series.
2661+
2662+
.. code:: python
2663+
2664+
In [13]: ind = ['NY.GDP.PCAP.KD', 'IT.MOB.COV.ZS']
2665+
In [14]: dat = download(indicator=ind, country='all', start=2011, end=2011).dropna()
2666+
In [15]: dat.columns = ['gdp', 'cellphone']
2667+
In [16]: print dat.tail()
2668+
gdp cellphone
2669+
country year
2670+
Swaziland 2011 2413.952853 94.9
2671+
Tunisia 2011 3687.340170 100.0
2672+
Uganda 2011 405.332501 100.0
2673+
Zambia 2011 767.911290 62.0
2674+
Zimbabwe 2011 419.236086 72.4
2675+
2676+
Finally, we use the ``statsmodels`` package to assess the relationship between
2677+
our two variables using ordinary least squares regression. Unsurprisingly,
2678+
populations in rich countries tend to use cellphones at a higher rate:
2679+
2680+
.. code:: python
2681+
2682+
In [17]: import numpy as np
2683+
In [18]: import statsmodels.formula.api as smf
2684+
In [19]: mod = smf.ols("cellphone ~ np.log(gdp)", dat).fit()
2685+
In [20]: print mod.summary()
2686+
OLS Regression Results
2687+
==============================================================================
2688+
Dep. Variable: cellphone R-squared: 0.297
2689+
Model: OLS Adj. R-squared: 0.274
2690+
Method: Least Squares F-statistic: 13.08
2691+
Date: Thu, 25 Jul 2013 Prob (F-statistic): 0.00105
2692+
Time: 15:24:42 Log-Likelihood: -139.16
2693+
No. Observations: 33 AIC: 282.3
2694+
Df Residuals: 31 BIC: 285.3
2695+
Df Model: 1
2696+
===============================================================================
2697+
coef std err t P>|t| [95.0% Conf. Int.]
2698+
-------------------------------------------------------------------------------
2699+
Intercept 16.5110 19.071 0.866 0.393 -22.384 55.406
2700+
np.log(gdp) 9.9333 2.747 3.616 0.001 4.331 15.535
2701+
==============================================================================
2702+
Omnibus: 36.054 Durbin-Watson: 2.071
2703+
Prob(Omnibus): 0.000 Jarque-Bera (JB): 119.133
2704+
Skew: -2.314 Prob(JB): 1.35e-26
2705+
Kurtosis: 11.077 Cond. No. 45.8
2706+
==============================================================================

pandas/io/wb.py

+1
Original file line numberDiff line numberDiff line change
@@ -75,6 +75,7 @@ def download(country=['MX', 'CA', 'US'], indicator=['GDPPCKD', 'GDPPCKN'],
7575
# Clean
7676
out = out.drop('iso2c', axis=1)
7777
out = out.set_index(['country', 'year'])
78+
out = out.convert_objects(convert_numeric=True)
7879
return out
7980

8081

0 commit comments

Comments
 (0)