Skip to content

ENH: Add ISO3 ctry codes and error arg. Fix tests, warn/exception logic #8482 #8551

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
Oct 28, 2014

Conversation

jnmclarty
Copy link
Contributor

This PR adds ISO 3-digit country codes, support for World Bank Country Code exceptions, by changing the error handling and warning logic and introducing an error argument which toggles between ignore/warn/raise . Validation no longer filters bad country codes, prior to submitting the API request. Instead, it attempts to catch errors in the WB response - which is a very hairy thing. See notes added in remote_data.tst for issues.

During this PR, we attempted to clean up the JSON response, and make the module as simple as possible, without restricting the user. In the end, it proved somewhat awkward, but the solution is a good one given the constraints of the WB API.

closes #8482 .

Updated right before merging to remove noise, and provide a summary.

@jreback jreback added IO Data IO issues that don't fit into a more specific label Data Reader labels Oct 15, 2014
@jreback jreback added this to the 0.15.1 milestone Oct 15, 2014
country_codes = {}

country_codes['ISO 3166-1 alpha-3'] = \
{'ABW' : 'Aruba',
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why is this map actually necessary, aren't you just validating against the codes?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The map, is basically just there for reference. Improve maintainability.

The issue is that World Bank says one thing, then does another. They say "ISO", but then add/adapt that, as they see fit sporadically. Then, even the ISO standards, don't do a great job of creating a static standard. Because countries squabble and struggle to find a common ground, they create exceptions, borders change, it's dynamic.

This way, we hardcode the "latest" World Bank Standard, but we give the option to the users to work around World Bank's flakey logic if changes.

See "KSV", for example. Not in the ISO standard, but its the only way to get data for Kosovo. The user would set safe_cc_only to False, and be able to get data, without the available warnings.

Also, users could very easily adjust the list, and easily look up country codes in wb.io.country_codes.

@jreback
Copy link
Contributor

jreback commented Oct 15, 2014

I think this should be much simpler. Just take the input codes and validate against the 2 or 3 code map. Prob best to simply raise if their are codes that are not valid for the specified map (and make sure its only against the 2 OR 3 map).

@jnmclarty
Copy link
Contributor Author

Question 1

its only against the 2 OR 3 map

Did you mean, that we should remove the 2-digit support? Wouldn't that break backwards compatibility? Surely, this isn't what you meant.

Question 2

Should I create a 0.15.1.txt? Confusion? Duplicated effort? Separate PR?

Comment 1

b0ed056 removes the maps, and safe_cc_only, as requested.

I'll leave it as a separate commit, until you or somebody else acknowledges the dynamism that World Bank introduces to the ISO "standard". If you acknowledge the example of KSV, or other unknowable exceptions, and want to still stick to the more hard coded list, that's okay by me.

if len(bad_countries) > 0:
print('Invalid ISO-2 codes: %s' % ' '.join(bad_countries))
bad_countries = '; '.join(bad_countries)
print('Invalid ISO country codes: %s' % bad_countries)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

raise as a ValueError

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a mistake, should be a warning at most. If I raise an exception, the user will have no way of accessing world-bank custom non-ISO codes. Eg. KSV.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fine, then really no reason to even validate the codes AT ALL.

If the returned values are not there, just make them NaN

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Except, you want to validate, because if you request something like, say, UK, instead of GB (No idea what actually happens, or which one is right), how does one check that they got "the right" country? World bank does tricky things. For instance, request a stale country code, and they'll they send a response back with the updated new one. A user could manually look at the response from World Bank, but automating that check is very tricky, if you don't start with a known list of country codes.

The static, hard coded, map, works/worked really really well. Exceptionally, explicitly, well.

For the sake of a few hundred, really, really, easy to understand lines of code (the map I already deleted), you're pushing to make pandas, and any user's code, less maintainable, as well as the response from World Bank, and the notifications Pandas provides with it, less explicit than I managed to create.

I'm a rookie at git, and the etiquette on github, but not logic. I've reviewed the problem of adding 3 digit country codes, very thoroughly. Both reading World Bank, looking at their API, and referencing Wikipedia.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh, not concerned about the length of the code :)

I get that you want to validate the user input, and it is fine. However what do you want to do with an error is the question? I really don't like warnings, they don't really make someone pay attention, a nan is the generally accepted soln. However, since validation is a good thing, I would then raise ValueError if the code does not pass muster.

Another possibility is to add a keywork argument, e.g. errors='raise' (and allow 'warn','nan') for the other cases) if you are ambituous, this makes it pretty general (lots of pandas top-level functions do this, e.g. pd.to_datetime)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like the errors argument. I can work with that. (But 'nan' can't work. We can't know now, what World Bank might return tomorrow.)

I didn't like the messages as they were either. They are amateurish.

If I have permission to change the warning/message logic, then

this..

Old Default Behaviour (v0.14)
  • Validate [against list]
  • Drop Invalid From Request
  • Dumb Print Warnings

...could become (maintain 0.14 logic):

New Default Behaviour (errors='warn') ['KSV' dropped w/Warning]
  • Validate Input [against map]
  • Drop Invalid From Request
  • Warning('Dropped Country Code(s)')
Force Query (errors='ignore') ['KSV' works]
  • Blind Include Every Input Code in Request
Real Validity Check (errors='raise') ['KSV' throws exception]
  • Validate [against map]
  • Exception('Invalid Country Code(s)')

...but this would be even better:

New Default Behaviour (errors='warn') ['KSV' works, but throws Warning]
  • Blind Include Every Input Code in Request
  • Warning('Non-Standard Country Code(s)')
Force Query (errors='ignore') ['KSV' works]
  • Blind Include Every Input Code in Request
Real Validity Check (errors='raise') ['KSV' throws exception]
  • Validate [against map]
  • Exception('Invalid Country Code(s)')

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

your last case looks fine (see what happens with the existing tests) - eg plug it in and see what if anything fails

@jreback
Copy link
Contributor

jreback commented Oct 16, 2014

@jnmclarty I meant you can support both 2 and 3 digit country codes (though was sort of -1 on allowing the mixing, but on 2nd thought that is unambiguous)

will be for 0.15.1 (the file doesn't exist yet), check back this weekend

@jreback
Copy link
Contributor

jreback commented Oct 18, 2014

that sounds fine then

@jnmclarty
Copy link
Contributor Author

Wait, what? What "sounds fine"?

a) leave it the old way. using json.loads + Dataframe + hacks

or

b) using json.loads + json_normalize + hacks?

Edit: I'm indifferent. Technically, B, is probably more robust, cause the hacks are better, but harder to read.

@jreback
Copy link
Contributor

jreback commented Oct 18, 2014

oh the old way is fine - least hacks as possible

if u would like to clean that part up ok! can do in another pr if u would like

@jnmclarty
Copy link
Contributor Author

So, there was one more surprise I stumbled on during testing. In all of my previous posts, and up until and including 80dd8fd, I thought there were three kinds of country codes:

  1. Standard ones
  2. Unknowable Non-standard ones
  3. Bad ones

Turns out, there are 4.

  1. Standard ones (Eg. USA)
  2. Unknowable Non-standard ones with data (Eg. KSV)
  3. Unknowable Bad ones - ignored by World Bank (Eg. BLA)
  4. Unknowable Bad ones - which cause API requests to fail completely (Eg. XXX)

My code, and logic, now handles all of it, IMO, extremely well and passes things appropriately to the user based on ignore/warn/raise at both the country code and indicator level.

bad_ind_msgs = "\n\n".join(bad_ind_msgs)
bad_ind_msgs = "\n\nInvalid Indicators:\n\n%s" % bad_ind_msgs
if errors == 'raise':
raise Exception(bad_ind_msgs)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

raise ValueError rather than an Exception

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

raise ValueError

Done. On this one, and the other spots where relevant. cfd1baa

@jreback
Copy link
Contributor

jreback commented Oct 21, 2014

you'll need to squash the commits to 1-2 https://github.com/pydata/pandas/wiki/Using-Git

add a doc note in v0.15.1 (enhancements section), referencing the issue

@jreback
Copy link
Contributor

jreback commented Oct 22, 2014

I suspect some tests were failing and the prior author didn't figure it out - or they were do an odd data comparison that then was failing

whatever u can fix would be gr8

u can delete them (rather than comment out) as u fix

@jnmclarty
Copy link
Contributor Author

whatever u can fix would be gr8

I already had fixed the issues with the tests.

Obviously, I rolled in things that test this ENH.

@jnmclarty
Copy link
Contributor Author

u can delete them (rather than comment out) as u fix

Done. 7d11d12

@jnmclarty
Copy link
Contributor Author

Doc note and squash is last thing I need to do...stay tuned.

try:
result = download(country=cntry_codes, indicator=inds,
start=2003, end=2004, errors='ignore')
# If for some reason result actually ever has data, it's cause WB
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so let's maybe drop this test (or does it work 'sometimes'?)

if it usually works but fails ocassionaly, the raise nose.SkipTest on the failure (rather than asserting anything)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I understand your logic. But because I think I do, and cause of my note below, I did 37f90b7 instead.

It will fail consistently, it's just a question of which exception catches it. Since the messages to both contain the same string, and are both ValueErrors, this test should always "work".

The only thing that would cause the test to stop working would be if WB unretired the indicator or their API changed.

More Info:

The indicator is a retired one. If World Bank removed it, from their indicator list, it would get caught at line 161 (all indicators failed), rather than where it gets caught now at 201 (single indicator query failure).

@jreback
Copy link
Contributor

jreback commented Oct 22, 2014

@jnmclarty looking good. just some minor changes, doc note, and squash

@jnmclarty
Copy link
Contributor Author

@jreback I commented on two now "outdated" diffs. Do you see those? Or are they easy to miss?

@jreback
Copy link
Contributor

jreback commented Oct 25, 2014

ok, pls rebase / squash and let me have a final look

@jnmclarty
Copy link
Contributor Author

You can merge now (?)

There are at least 4 kinds of country codes:

1. Standard (2/3 digit ISO) - returns data, will warn and error properly.
2. Non-standard (WB Exceptions) - returns data, but will falsely warn.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jorisvandenbossche how is this doc formatting?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I made a rendering, after some rst syntax fixes. Posted the PNG below.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is fine

@jreback
Copy link
Contributor

jreback commented Oct 27, 2014

@jnmclarty you are doing gr8 actually!

I pretend that I am a user reading your comments and would be 1) no nothing, but this looks cool, 2) use this feature and care about changes. So have to appeal to both, with the 2nd being more important. You basically want to tell them hey if you are doing this, it changed, so now do this.

@jnmclarty
Copy link
Contributor Author

I can't get sphinx to build the entire docs. Sphinx/python crashes. I can do just remote_data. But, sphinx still crashes right before it completes. It builds, as below. Maybe the crash, or the single page build, is the reason it doesn't pick up the :func: ?

screencapture-file-c-users-jeffrey-desktop-projects-pandas34-doc-build-html-remote_data-html

@jnmclarty jnmclarty force-pushed the wbISO3 branch 2 times, most recently from a3fb999 to 2811141 Compare October 27, 2014 02:27
@jnmclarty
Copy link
Contributor Author

Squashed everything. Only changed things identified by @jreback , and some syntax stuff related to rst. Nothing beyond that changed since the last time I squashed.

I'm having issues with my stack/sphinx, I can't render the entire docs.

If somebody ( @jorisvandenbossche ?) could render the entire docs, and take a look at whatsnew and remote_data, as well as the docstring to io.wb.download() prior to this being merged, that would be helpful. Or, even a detail-oriented rst-pro would be able to take one glance at the diffs and OK it. Either way, I'll review the output rendered by the server.

.. note::

The World Bank's country list and indicators are dynamic. As of 0.15.1,
:func:``wb.download()`` is more flexible. To achieve this, the warning
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This has to be single backtick around the function: :func:io.wb.download()``

@jorisvandenbossche
Copy link
Member

Formatting of the docs is looking OK, apart from the one comment I gave.

But another thing: you added a note in the API changes section, but normally we shouldn't do behaviour changing API changes for 0.15.1.
But the api changes note is not fully clear to me: does it just now raise warnings/errors where previously happened nothing (empty result), or will you really get a different returned result in some cases?

@jreback
Copy link
Contributor

jreback commented Oct 27, 2014

this is one of those
adding arguments cases

so it's an enhancement (but I usually put these in api change section)

@jnmclarty
Copy link
Contributor Author

I tried to improve clarity in whatsnew, and fixed the single/double quote thing, then re-squashed.



Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Two minor comments:

  • try not to add trailing whitespace as in this line (normally you should be able to configure your editor to remove that automatically on saving, or should there be a menu action to remove all trailing whitespace in the file)
  • try to hard wrap the lines (in principle to 80 chars, but 80-100 is also OK here I think), otherwise this is very difficult to review on github (as the line above)

@jorisvandenbossche
Copy link
Member

OK on the API change, it sounds indeed more like a bug fix/enhancement.

@jreback
Copy link
Contributor

jreback commented Oct 27, 2014

@jorisvandenbossche yep, should be in the Enhancements section (we put adding args stuff their right?)

@jnmclarty can you move to Enhancements then

@jnmclarty
Copy link
Contributor Author

Confused, I had split it, and put a note in each. (ie, I had a note for the errors in API, and functional change in Enhancements)

...but, I did what you asked.

@jnmclarty
Copy link
Contributor Author

...and resquashed.

@jreback jreback merged commit 6a7ff40 into pandas-dev:master Oct 28, 2014
@jreback
Copy link
Contributor

jreback commented Oct 28, 2014

@jnmclarty thanks for this!

pls take a look at the built docs: http://pandas-docs.github.io/pandas-docs-travis/ (could be a little while before they are actually built); if necessary pls do a follow-up PR if docs not clear.

@jreback jreback modified the milestones: 0.15.2, 0.15.1 Oct 30, 2014
@jreback jreback mentioned this pull request Nov 10, 2014
@jnmclarty jnmclarty deleted the wbISO3 branch February 9, 2015 00:43
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Enhancement IO Data IO issues that don't fit into a more specific label
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Remote data access: World Bank
3 participants