Skip to content

Fix errors='ignore' being ignored in astype #30324 #30670

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 13 commits into from

Conversation

naomi172839
Copy link
Contributor

@naomi172839 naomi172839 commented Jan 4, 2020

Copy link
Member

@WillAyd WillAyd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you add test(s)? Should be the first part of any PR

@pep8speaks
Copy link

pep8speaks commented Jan 4, 2020

Hello @naomi172839! Thanks for updating this PR. We checked the lines you've touched for PEP 8 issues, and found:

There are currently no PEP 8 issues detected in this Pull Request. Cheers! 🍻

Comment last updated at 2020-03-05 03:46:10 UTC

@naomi172839
Copy link
Contributor Author

@WillAyd I added unit tests to test_generic.py. I have never used pytest before and am having trouble building the c components to test it myself. Hopefully it was done right.

@alimcmaster1 alimcmaster1 added the Error Reporting Incorrect or improved errors from pandas label Jan 4, 2020
@jreback jreback changed the title Fix errors='ignore' being ignored #30324 Fix errors='ignore' being ignored in astype #30324 Jan 4, 2020
@naomi172839 naomi172839 closed this Jan 4, 2020
@naomi172839 naomi172839 reopened this Jan 4, 2020
@naomi172839
Copy link
Contributor Author

I seemed to have messed up the GIT... Let me know if I need to start over:/

@alimcmaster1
Copy link
Member

I seemed to have messed up the GIT... Let me know if I need to start over:/

You can still use this PR/branch - fix it up locally and use —force if needed

@naomi172839
Copy link
Contributor Author

OK so I fixed the git. The tests actually seemed to already exist to make sure that the error was thrown. I am not sure why it did not fail before.

@jreback
Copy link
Contributor

jreback commented Jan 9, 2020

can you add a whatsnew note

@naomi172839
Copy link
Contributor Author

@jreback When I add my own whatsnew entry it seems to create a conflict that I am unsure how to resolve. I already updated my local whatsnew file to the latest version.

@TomAugspurger
Copy link
Contributor

Hmm, looking back at the original issue, I'm not sure if this is what we want to do.

The errors keyword is used to handle raising when there's in issue with the conversion of values. This PR overloads errors to handle raising when there are non-conversion issues (incorrect keys in dtype dict for example). I don't think we want to overload the meaning of errors like that.

@TomAugspurger
Copy link
Contributor

Apologies for not catching that earlier @naomi172839.

@naomi172839
Copy link
Contributor Author

naomi172839 commented Jan 9, 2020

@TomAugspurger would you then suggest adding an additional parameter to the astype method to allow the user to ignore the KeyErrors?

@TomAugspurger
Copy link
Contributor

That's one option, though I'm not sure I'd find it useful enough. Others may though. Perhaps @climatebrad can share his original usecase.

@climatebrad
Copy link

My usecase is that I was joining several dataframes generated from different years of NYPD stop-and-frisk datasets into one multiyear dataframe. A lot of the work to keep the dataframes manageable was making sure that categorical data was saved as such. So I had a list of all of the columns that needed to be categorical data and was setting them. (Technically I converted to object type, did more processing, then converted to category type.) This code failed:

cat_set = set(CLEAN_CAT_VALUES) | set(CAT_FILL_NA_VALUES) 
data = data.astype({cat : 'object' for cat in cat_set}, errors='ignore')

Here's the diff where I dealt with what I consider the incorrect errors='ignore' behavior. (I had to explicitly exclude columns that didn't exist in the dataframe I was editing.) Not the hardest workaround, but I do still think that errors='ignore' should, you know, ignore errors.

climatebrad/stop-and-frisk@d755084#diff-a0ebe5d9008a8b884ea03629345f9735R178

cat_set = set(CLEAN_CAT_VALUES) | set(CAT_FILL_NA_VALUES) 
cat_set = cat_set.intersection(set(data.columns))
data = data.astype({cat : 'object' for cat in cat_set}, errors='ignore')

@naomi172839
Copy link
Contributor Author

@climatebrad I do agree that errors='ignore' at least implies that errors will be ignored but @TomAugspurger does have a point. It would overload the method and that's probably not the best solution.

@WillAyd
Copy link
Member

WillAyd commented Feb 12, 2020

Looks like the 1.0 whatsnew got mangled in this - can you check that out from master?

@naomi172839
Copy link
Contributor Author

Sorry that it took so long, I had to figure out how to check out a single file.

@WillAyd
Copy link
Member

WillAyd commented Mar 5, 2020

@naomi172839 not sure why the whatsnew files are collected here, but can you try the following on your branch?

git fetch upstream/master
git checkout upstream/master pandas/doc/source/whatsnew/v1*

Then re-push? I think should fix the issues here (can add a whatsnew note after we fix these up)

@naomi172839
Copy link
Contributor Author

@WillAyd So I tried that with no real luck. I tried deleting the files and rechecking them out which got me to where we are now. I dont have high hopes.

@WillAyd
Copy link
Member

WillAyd commented Mar 5, 2020

Your last push looks a lot better. The only outstanding issue is that the file permissions have changed for the v1.0.0 whatsnew.

If you do git checkout upstream/master pandas/doc/source/whatsnew/v1.0.0.rst now does git status show anything as staged?

@naomi172839
Copy link
Contributor Author

Output is below:
nbonnin@Naomis-MacBook-Pro pandas % git checkout upstream/master doc/source/whatsnew/v1.0.0.rst
Updated 0 paths from 247718065
nbonnin@Naomis-MacBook-Pro pandas % git status
On branch #30324
Your branch is up to date with 'origin/#30324'.

nothing to commit, working tree clean

So I don't think that there is anything staged

@WillAyd
Copy link
Member

WillAyd commented Mar 5, 2020

Hmm OK. I guess git checkout might not grab file permissions.

Since you are on macOS you can just do chmod 1644 pandas/doc/source/whatsnew/v1.0.0.rst instead to restore back to where it was

@naomi172839
Copy link
Contributor Author

Done, though its not really showing any changes and zero differences from upstream/master for that file

@WillAyd
Copy link
Member

WillAyd commented Mar 5, 2020

Is your upstream pointing to pandas? If not sure, can you post the output of git remote -v show?

@naomi172839
Copy link
Contributor Author

Output
nbonnin@Naomis-MacBook-Pro pandas % git remote -v show
origin https://github.com/naomi172839/pandas.git (fetch)
origin https://github.com/naomi172839/pandas.git (push)
upstream https://github.com/pandas-dev/pandas.git (fetch)
upstream https://github.com/pandas-dev/pandas.git (push)

@WillAyd
Copy link
Member

WillAyd commented Mar 5, 2020

Just for posterity try this one more time - it worked for the other whatsnew files so not sure what makes the 1.0.0 any different but maybe the changed permissions prevented this from working

git fetch upstream/master
git checkout upstream/master pandas/doc/source/whatsnew/v1.0.0.rst

@naomi172839
Copy link
Contributor Author

Doing that, git thought that there were changes so I went ahead and pushed it.

@WillAyd
Copy link
Member

WillAyd commented Mar 5, 2020

Hmm still didn't work. If you run git diff upstream/master --name-only locally none of the whatsnew files should show, but they seem to here on GitHub

If all else fails maybe just download the file directly from GitHub

@naomi172839
Copy link
Contributor Author

Output:
...
doc/source/whatsnew/v0.21.0.rst
doc/source/whatsnew/v0.23.0.rst
doc/source/whatsnew/v0.25.0.rst
doc/source/whatsnew/v0.25.3.rst
doc/source/whatsnew/v1.0.1.rst
doc/source/whatsnew/v1.0.2.rst
doc/source/whatsnew/v1.1.0.rst
...

v1.0.0.rst isn't listed. Nevertheless I tried downloading straight from github. It still seems to have made no difference.

@WillAyd
Copy link
Member

WillAyd commented Mar 5, 2020

Hmm OK. Actually just needed to do a git merge upstream/master and repush

I did that to fix things up here, so just git pull from your local branch to grab the changes and should be able to move forward

@naomi172839
Copy link
Contributor Author

Ok great! that did alot. At this point it should be just adding the what's new message correct? I'm not super great at git (obviously). Just want to make sure before I push another commit.

@WillAyd
Copy link
Member

WillAyd commented Mar 5, 2020

Hmm just reading back through the comment history I do agree with @TomAugspurger that this blurs the purpose of the errors keyword; any chance you thought about alternative implementations?

@naomi172839
Copy link
Contributor Author

The only real alternative that I came up with was to add another optional argument to the method, but I am not sure that that is really the best solution.

Looking into it now, it could also be an option like mode.sim_interactive is. I don't really know if that is any better though.

@TomAugspurger
Copy link
Contributor

I think I'm still against adding this change. I think the errors keyword should just be for controlling the values conversion.

If we want to allow unused keys to be provided, we can add a keyword dedicated for that. There are other context like .rename where people have expressed a desire for extra / unused keys to raise (they're currently ignored in rename).

df = pd.DataFrame({"A": [1, 2]})
df.astype({"A": "int", "B": "float"}, ignore_unused_keys=True)

df.rename(columns={"A": "a", "B", "b"}, ignore_unused_keys=False)  # raises

@climatebrad
Copy link

Although I am in the camp of "errors='ignore'" should mean "ignore errors" I'd be perfectly happy with a separate keyword to allow for this functionality.

I'd make it consistent with the errors syntax though. Here is language consistent with the df.rename documentation.

missing_cols : {'raise', 'ignore'}, default 'raise'
If ‘raise’, raise a KeyError when a dict of columns contains labels that are not present. If ‘ignore’, existing keys will be retyped and extra keys will be ignored.

@TomAugspurger
Copy link
Contributor

Your proposal should be extra_columns: {'raise', 'ignore'}, or unused_columns, right? Not "missing"?

@naomi172839
Copy link
Contributor Author

Personally, I feel that ignore_unused_keys is a more descriptive argument. If everyone thinks this is a good/acceptable route to go, I would be happy to work on it.

In addition, if we change it here, it should probably be changed everywhere. I.e. the rename method and anyothers

@WillAyd
Copy link
Member

WillAyd commented Mar 17, 2020

Personally, I feel that ignore_unused_keys is a more descriptive argument. If everyone thinks this is a good/acceptable route to go, I would be happy to work on it.

Yea I think that would be clearer

@WillAyd
Copy link
Member

WillAyd commented Apr 7, 2020

@naomi172839 still active? Want to try to incorporate latest feedback?

@simonjayhawkins
Copy link
Member

@naomi172839 closing as stale. ping if you want to continue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Error Reporting Incorrect or improved errors from pandas
Projects
None yet
Development

Successfully merging this pull request may close these issues.

errors='ignore' does not work on df.replace({col : type}) if col not found
8 participants