-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
BUG: fix categories in HDFStore not filtering correctly (#13322) #13792
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
always start with the tests |
Sorry Jeff, I'm not sure what you mean. Are you saying I should add a unit test for this? |
ANY change requires a test. |
at a minimum, tests that reproduce the original issue. You FIRST right the tests, make sure they replicate the failure, THEN write code to fix. You have fixed it when your tests then pass and nothing else breaks. |
Thanks for filling me in Jeff. Sorry for having to spell it out for me, but this is the first non-documentation PR I've submitted for pandas. I couldn't find any tests related to pytables but it seems like tests/frame/test_misc_api might be a good spot. Any insight would be appreciated. Thanks. This is the code I wrote to test the fix before I submitted the pull: obsids = ['ESP_012345_6789', 'ESP_987654_3210']
imgids = ['APF00006np', 'APF0001imm']
data = [4.3, 9.8]
df = pd.DataFrame(dict(obsids=obsids, imgids=imgids, data=data))
df.to_hdf('testdf_no_cats.hdf', 'df',format='t', data_columns=True)
df.obsids = df.obsids.astype('category')
df.imgids = df.imgids.astype('category')
df.to_hdf('testdf_with_cats.hdf', 'df',format='t', data_columns=True)
# df without categories
df2 = pd.read_hdf('testdf_no_cats.hdf', 'df', where='obsids=B')
assert(len(df2) == 0)
# df with categories
df3 = pd.read_hdf('testdf_with_cats.hdf', 'df', where='obsids=B')
assert(len(df3) == 0) |
Thanks Jeff, I completely missed that, I incorrectly assumed everything was in the one tests folder. Just submitted a new commit after confirming the test passed with my changes. |
Current coverage is 85.23% (diff: 100%)@@ master #13792 diff @@
==========================================
Files 140 140
Lines 50420 50422 +2
Methods 0 0
Messages 0 0
Branches 0 0
==========================================
+ Hits 42975 42977 +2
Misses 7445 7445
Partials 0 0
|
@@ -237,6 +237,30 @@ def roundtrip(key, obj, **kwargs): | |||
finally: | |||
safe_remove(path) | |||
|
|||
def test_conv_categorical(self): | |||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
move this right after test_categorical
, and rename to test_categorical_conversion
pls add a whatsnew in bug fix section. |
@@ -197,7 +197,10 @@ def stringify(value): | |||
return TermValue(int(v), v, kind) | |||
elif meta == u('category'): | |||
metadata = com._values_from_object(self.metadata) | |||
result = metadata.searchsorted(v, side='left') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would do this like:
result = metadata.searchsorted(v, side='left')
if not result and v not in metadata:
result = -1
then the 'in' is only done if the result is not found (as opposed to always)
@jreback Thanks for all of the feedback. I really appreciate you showing me how to do things the right way. I made the changes you requested and rebased the commit. |
@@ -786,3 +786,4 @@ Bug Fixes | |||
- Bugs in ``Index.difference`` and ``DataFrame.join`` raise in Python3 when using mixed-integer indexes (:issue:`13432`, :issue:`12814`) | |||
|
|||
- Bug in ``.to_excel()`` when DataFrame contains a MultiIndex which contains a label with a NaN value (:issue:`13511`) | |||
- Bug in ``pd.read_hdf()`` returns incorrect result when HDF Store contains a DataFrame with a categorical column (:issue:`13792`) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
be more specific; this only doesn't work when the query has NO values that are found in the categorical.
thanks @shawnheide |
git diff upstream/master | flake8 --diff
Fixed bug in pytables.py where series with categories were not being filtered properly.