Skip to content

ENH: Add ISO3 ctry codes and error arg. Fix tests, warn/exception logic #8482 #8551

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
Oct 28, 2014
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
59 changes: 59 additions & 0 deletions doc/source/remote_data.rst
Original file line number Diff line number Diff line change
Expand Up @@ -143,6 +143,12 @@ World Bank
`World Bank's World Development Indicators <http://data.worldbank.org>`__
by using the ``wb`` I/O functions.

Indicators
~~~~~~~~~~

Either from exploring the World Bank site, or using the search function included,
every world bank indicator is accessible.

For example, if you wanted to compare the Gross Domestic Products per capita in
constant dollars in North America, you would use the ``search`` function:

Expand Down Expand Up @@ -254,3 +260,56 @@ populations in rich countries tend to use cellphones at a higher rate:
Skew: -2.314 Prob(JB): 1.35e-26
Kurtosis: 11.077 Cond. No. 45.8
==============================================================================

Country Codes
~~~~~~~~~~~~~

.. versionadded:: 0.15.1

The ``country`` argument accepts a string or list of mixed
`two <http://en.wikipedia.org/wiki/ISO_3166-1_alpha-2>`__ or `three <http://en.wikipedia.org/wiki/ISO_3166-1_alpha-3>`__ character
ISO country codes, as well as dynamic `World Bank exceptions <http://data.worldbank.org/node/18>`__ to the ISO standards.

For a list of the the hard-coded country codes (used solely for error handling logic) see ``pandas.io.wb.country_codes``.

Problematic Country Codes & Indicators
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

.. note::

The World Bank's country list and indicators are dynamic. As of 0.15.1,
:func:`wb.download()` is more flexible. To achieve this, the warning
and exception logic changed.

The world bank converts some country codes,
in their response, which makes error checking by pandas difficult.
Retired indicators still persist in the search.

Given the new flexibility of 0.15.1, improved error handling by the user
may be necessary for fringe cases.

To help identify issues:

There are at least 4 kinds of country codes:

1. Standard (2/3 digit ISO) - returns data, will warn and error properly.
2. Non-standard (WB Exceptions) - returns data, but will falsely warn.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jorisvandenbossche how is this doc formatting?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I made a rendering, after some rst syntax fixes. Posted the PNG below.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is fine

3. Blank - silently missing from the response.
4. Bad - causes the entire response from WB to fail, always exception inducing.

There are at least 3 kinds of indicators:

1. Current - Returns data.
2. Retired - Appears in search results, yet won't return data.
3. Bad - Will not return data.

Use the ``errors`` argument to control warnings and exceptions. Setting
errors to ignore or warn, won't stop failed responses. (ie, 100% bad
indicators, or a single "bad" (#4 above) country code).

See docstrings for more info.





12 changes: 5 additions & 7 deletions doc/source/whatsnew/v0.15.1.txt
Original file line number Diff line number Diff line change
Expand Up @@ -19,18 +19,17 @@ users upgrade to this version.

API changes
~~~~~~~~~~~



.. _whatsnew_0151.enhancements:

Enhancements
~~~~~~~~~~~~

- Added option to select columns when importing Stata files (:issue:`7935`)

- Qualify memory usage in ``DataFrame.info()`` by adding ``+`` if it is a lower bound (:issue:`8578`)


- Added support for 3-character ISO and non-standard country codes in :func:``io.wb.download()`` (:issue:`8482`)
- :ref:`World Bank data requests <remote_data.wb>` now raise Warnings and ValueErrors based on an ``errors`` argument, as well as a list of hard-coded country codes and the World Bank's JSON response. In prior versions, the error messages didn't look at the World Bank's JSON response. Problem-inducing input were simply dropped prior to the request. The issue was that many good countries were cropped in the hard-coded approach. All countries will work now, but some bad countries will raise exceptions because some edge cases break the entire response.

.. _whatsnew_0151.performance:

Performance
Expand All @@ -41,8 +40,7 @@ Performance

Experimental
~~~~~~~~~~~~



.. _whatsnew_0151.bug_fixes:

Bug Fixes
Expand Down
103 changes: 85 additions & 18 deletions pandas/io/tests/test_wb.py
Original file line number Diff line number Diff line change
Expand Up @@ -14,42 +14,109 @@ class TestWB(tm.TestCase):
@slow
@network
def test_wdi_search(self):
raise nose.SkipTest

expected = {u('id'): {2634: u('GDPPCKD'),
4649: u('NY.GDP.PCAP.KD'),
4651: u('NY.GDP.PCAP.KN'),
4653: u('NY.GDP.PCAP.PP.KD')},
u('name'): {2634: u('GDP per Capita, constant US$, '
'millions'),
4649: u('GDP per capita (constant 2000 US$)'),
4651: u('GDP per capita (constant LCU)'),
4653: u('GDP per capita, PPP (constant 2005 '

expected = {u('id'): {6716: u('NY.GDP.PCAP.KD'),
6718: u('NY.GDP.PCAP.KN'),
6720: u('NY.GDP.PCAP.PP.KD')},
u('name'): {6716: u('GDP per capita (constant 2005 US$)'),
6718: u('GDP per capita (constant LCU)'),
6720: u('GDP per capita, PPP (constant 2011 '
'international $)')}}
result = search('gdp.*capita.*constant').ix[:, :2]
result = search('gdp.*capita.*constant').loc[6716:,['id','name']]
expected = pandas.DataFrame(expected)
expected.index = result.index
assert_frame_equal(result, expected)

@slow
@network
def test_wdi_download(self):
raise nose.SkipTest

expected = {'GDPPCKN': {(u('United States'), u('2003')): u('40800.0735367688'), (u('Canada'), u('2004')): u('37857.1261134552'), (u('United States'), u('2005')): u('42714.8594790102'), (u('Canada'), u('2003')): u('37081.4575704003'), (u('United States'), u('2004')): u('41826.1728310667'), (u('Mexico'), u('2003')): u('72720.0691255285'), (u('Mexico'), u('2004')): u('74751.6003347038'), (u('Mexico'), u('2005')): u('76200.2154469437'), (u('Canada'), u('2005')): u('38617.4563629611')}, 'GDPPCKD': {(u('United States'), u('2003')): u('40800.0735367688'), (u('Canada'), u('2004')): u('34397.055116118'), (u('United States'), u('2005')): u('42714.8594790102'), (u('Canada'), u('2003')): u('33692.2812368928'), (u('United States'), u('2004')): u('41826.1728310667'), (u('Mexico'), u('2003')): u('7608.43848670658'), (u('Mexico'), u('2004')): u('7820.99026814334'), (u('Mexico'), u('2005')): u('7972.55364129367'), (u('Canada'), u('2005')): u('35087.8925933298')}}
# Test a bad indicator with double (US), triple (USA),
# standard (CA, MX), non standard (KSV),
# duplicated (US, US, USA), and unknown (BLA) country codes

# ...but NOT a crash inducing country code (World bank strips pandas
# users of the luxury of laziness, because they create their
# own exceptions, and don't clean up legacy country codes.
# ...but NOT a retired indicator (User should want it to error.)

cntry_codes = ['CA', 'MX', 'USA', 'US', 'US', 'KSV', 'BLA']
inds = ['NY.GDP.PCAP.CD','BAD.INDICATOR']

expected = {'NY.GDP.PCAP.CD': {('Canada', '2003'): 28026.006013044702, ('Mexico', '2003'): 6601.0420648056606, ('Canada', '2004'): 31829.522562759001, ('Kosovo', '2003'): 1969.56271307405, ('Mexico', '2004'): 7042.0247834044303, ('United States', '2004'): 41928.886136479705, ('United States', '2003'): 39682.472247320402, ('Kosovo', '2004'): 2135.3328465238301}}
expected = pandas.DataFrame(expected)
result = download(country=['CA', 'MX', 'US', 'junk'], indicator=['GDPPCKD',
'GDPPCKN', 'junk'], start=2003, end=2005)
expected.sort(inplace=True)
result = download(country=cntry_codes, indicator=inds,
start=2003, end=2004, errors='ignore')
result.sort(inplace=True)
expected.index = result.index
assert_frame_equal(result, pandas.DataFrame(expected))

@slow
@network
def test_wdi_download_w_retired_indicator(self):

cntry_codes = ['CA', 'MX', 'US']
# Despite showing up in the search feature, and being listed online,
# the api calls to GDPPCKD don't work in their own query builder, nor
# pandas module. GDPPCKD used to be a common symbol.
# This test is written to ensure that error messages to pandas users
# continue to make sense, rather than a user getting some missing
# key error, cause their JSON message format changed. If
# World bank ever finishes the deprecation of this symbol,
# this nose test should still pass.

inds = ['GDPPCKD']

try:
result = download(country=cntry_codes, indicator=inds,
start=2003, end=2004, errors='ignore')
# If for some reason result actually ever has data, it's cause WB
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so let's maybe drop this test (or does it work 'sometimes'?)

if it usually works but fails ocassionaly, the raise nose.SkipTest on the failure (rather than asserting anything)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I understand your logic. But because I think I do, and cause of my note below, I did 37f90b7 instead.

It will fail consistently, it's just a question of which exception catches it. Since the messages to both contain the same string, and are both ValueErrors, this test should always "work".

The only thing that would cause the test to stop working would be if WB unretired the indicator or their API changed.

More Info:

The indicator is a retired one. If World Bank removed it, from their indicator list, it would get caught at line 161 (all indicators failed), rather than where it gets caught now at 201 (single indicator query failure).

# fixed the issue with this ticker. Find another bad one.
except ValueError as e:
error_raised = True
error_msg = e.args[0]

self.assertTrue(error_raised)
self.assertTrue("No indicators returned data." in error_msg)

# if it ever gets here, it means WB unretired the indicator.
# even if they dropped it completely, it would still get caught above
# or the WB API changed somehow in a really unexpected way.
if len(result) > 0:
raise nose.SkipTest



@slow
@network
def test_wdi_download_w_crash_inducing_countrycode(self):

cntry_codes = ['CA', 'MX', 'US', 'XXX']
inds = ['NY.GDP.PCAP.CD']

try:
result = download(country=cntry_codes, indicator=inds,
start=2003, end=2004, errors='ignore')
except ValueError as e:
error_raised = True
error_msg = e.args[0]

self.assertTrue(error_raised)
self.assertTrue("No indicators returned data." in error_msg)

# if it ever gets here, it means the country code XXX got used by WB
# or the WB API changed somehow in a really unexpected way.
if len(result) > 0:
raise nose.SkipTest

@slow
@network
def test_wdi_get_countries(self):
result = get_countries()
self.assertTrue('Zimbabwe' in list(result['name']))

self.assertTrue(len(result) > 100)

if __name__ == '__main__':
nose.runmodule(argv=[__file__, '-vvs', '-x', '--pdb', '--pdb-failure'],
exit=False)
exit=False)
Loading