Skip to content

Default values for dropna to "False" (issue 9382) #9484

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 10 commits into from

Conversation

nickeubank
Copy link
Contributor

PLEASE REVIEW: This is my commit to a major project, and would appreciate a quick once over!

As per discussion in Issue 9382, changes all HDF functions from having default of dropping all rows with NA in all non-index rows.

closes #9382

PLEASE REVIEW: This is my commit to a major project, and would appreciate a quick once over! 

As per discussion in Issue 9382, changes all HDF functions from having default of dropping all rows with NA in all non-index rows.
@jreback
Copy link
Contributor

jreback commented Feb 13, 2015

all updates should be in this PR.

@jreback
Copy link
Contributor

jreback commented Feb 13, 2015

odd that no tests fail. pls come up with a tests that fails w/o your change and succeeds with it. I guess this was not tested.

@nickeubank
Copy link
Contributor Author

OK, both on this branch.

Will look into testing -- I'm not really sure what that means, sorry.

@nickeubank
Copy link
Contributor Author

Sorry Jeff, I think testing may be beyond my pay-grade. I thought I knew enough to contribute, but the testing wiki ( https://github.com/pydata/pandas/wiki/Testing ) is all greek to me. This is obviously not a big commit, so please feel free to just close the pull request and issue and ignore all this. Sorry. :/

@bashtage
Copy link
Contributor

@nickeubank

Just write a function that matches the patters

def test_all_missing_values

That saves an reloads an array that would drop under the previous and won't drop now.

Test should be here:

https://github.com/pydata/pandas/blob/master/pandas/io/tests/test_pytables.py#L4680

@nickeubank
Copy link
Contributor Author

Thanks @bashtage -- and where would I place this function?

And sounds like it needs to raise an AssertionError with descriptions for source of problem if it the calls lead to the rows being dropped when they shouldn't be?

@bashtage
Copy link
Contributor

The test should have the followign properties:

  1. The df should satisfy the condition that some rows woudl be dropped if saved with master
  2. You should save the df with the code in your PR, using the default options for saving.
  3. Verify that all rows are present using assert_frame_equal(loaded_df, original_df)

Item 3 would fail on master since the loaded df will be missing the all nan row in the original df. This will, however, not fail with your PR.

See my link about where your function goes. If you want to check that it passes, you can run

nosetests pandas.io.tests.test_pytables

Assuming you are using python setup.py develop to install your local pandas.

@nickeubank
Copy link
Contributor Author

Ah, ok great. And do I just stick this into pytables.py, or is there a special place for these?

(Again, so sorry to need the hand holding!)

@bashtage
Copy link
Contributor

Ah, ok great. And do I just stick this into pytables.py, or is there a special place for these?

pandas/io/tests/test_pytables.py

There is almost always a test folder and a corresponding file names test_xxx for every .py file.

@nickeubank
Copy link
Contributor Author

Excellent, thanks. Will do now.

Previously,

.. ipython:: python
In [1]: myFile = HDFStore('file.hdf')
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • show these examples with a DataFrame with a row that has all-nans, and one that doesnt.
  • don't use camel case.
  • use to_hdf/read_hdf (because for example you are not closing the store)
  • this needs to be a code-block

@nickeubank
Copy link
Contributor Author

OK -- I've made all the suggested updates, and I have a build failure. I'm not sure how to check if that's due to the test I added, but hopefully that's what we were looking for?

@bashtage
Copy link
Contributor

Your code. pd is not imported. Just use read_hdf directly.

ERROR: test_all_missing_values (pandas.io.tests.test_pytables.TestHDFStore)
----------------------------------------------------------------------

Traceback (most recent call last):

File "/home/travis/build/pydata/pandas/pandas/io/tests/test_pytables.py", line 4682, in test_all_missing_values

df_with_missing = pd.DataFrame({'col1':[np.nan]})

NameError: global name 'pd' is not defined

@bashtage
Copy link
Contributor

You should probably also include a test with a slightly more complicated layout, e.g. with a full row, a row with some mulls and a row with all null.

@nickeubank
Copy link
Contributor Author

@bashtage - thanks. Just found the traceback in Travis -- will check next build myself!

df_without_missing = pd.DataFrame({'col1':[0, -1, 2], 'col2':[1, -1, 3]})
df_without_missing.to_hdf('file.h5', 'df_without_missing')

print(pd.read_hdf('file.h5', 'df_with_missing'))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you don't need print statements, paste the actual ipython output (which will have numbered In/Outs)

@jreback
Copy link
Contributor

jreback commented May 9, 2015

closing pls reopen if/when updated

@jreback jreback closed this May 9, 2015
@nickeubank
Copy link
Contributor Author

Sorry, forgot about this. Now updated and rebased. How do I reopen? Or do I need to start a new PR?

@bashtage
Copy link
Contributor

bashtage commented Jun 2, 2015

Seems straightforward and ready.

@nickeubank
Copy link
Contributor Author

@bashtage moved to #10097 -- updating for a @jreback comment now.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
API Design IO HDF5 read_hdf, HDFStore
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Don't make dropping missing rows a default behavior for HDF append()?
3 participants