-
-
Notifications
You must be signed in to change notification settings - Fork 18.5k
Don't make dropping missing rows a default behavior for HDF append()? #9382
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
I'm a little new to the open-source world -- should I be doing something more than waiting for input at this point, and if none comes, should I do nothing, or make changes? Thanks! |
well you can go ahead and make a pull request if you would like |
OK -- Do you have a position John? I know you did the hard work of creating this, so I don't want to adjust without your input! |
I think changing the default is ok you will have to adjust some tests |
OK, great. This will be my first edit on a big project, so will likely take a few days to figure out how to do it right, but i'm on it! |
Submitted as Pull Request #9484 Where do I add notes for API change? |
you would need to add a mini section in the whatsnew for 0.16.0 under api changes |
Great, done! Thanks for the hand-holding! |
Hi All,
At the moment, the default behavior for the HDF append() function ( docs: http://pandas.pydata.org/pandas-docs/stable/generated/pandas.HDFStore.append.html?highlight=append#pandas.HDFStore.append ) is to silently drop all rows that are all NaN except for the index.
As I understand it from a PyData exchange with Jeff, the reason is that people working with panels often have sparse datasets, so this is a very reasonable default.
However, while I appreciate the appeal for time-series analysis, I think this is a dangerous default. The main reason is that the assumption is that if an index has a value but the columns do not, there is no meaningful data in the row. But while true in a time series context -- where it's easy to reconstruct the index values that are dropped -- if indexes contain information like userIDs, sensor codes, place names, etc., the index itself is meaningful, and not easy to reconstruct. Thus the default behavior is potentially deleting user data without a warning.
Given the trade-off between a default that may lead to inefficient storage (dropna = False) and one that potentially erases user data (dropna = True), I think we should error on the side of data preservation.
The text was updated successfully, but these errors were encountered: