-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
BUG: to_hdf and HDFStore for subclasses #38262
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
3690de4
to
a98e2c2
Compare
a98e2c2
to
e38fae8
Compare
c831cab
to
6633145
Compare
with HDFStore(path) as store: | ||
store.put("ser", sser) | ||
result = read_hdf(path, "ser").values | ||
tm.assert_numpy_array_equal(result, expected) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is it worth leaving TODOs here about switching to assert_series_equal later
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I figured out that the problem with the assertion was that actually read_hdf
would return DataFrame
, not SubDataFrame
. As a result assertion failed at the stage of comparing types.
Probably need to figure out how one can extract the same subclass of the DataFrame by using read_hdf
...
But I guess, this is slightly beyond the scope of the present PR.
So, now I expect that read_hdf
would return DataFrame
with the given values and datatypes and use it for the assertion via assert_frame_equal
.
with ensure_clean_path("temp.h5") as path: | ||
with HDFStore(path) as store: | ||
store.put("ser", sser) | ||
result = read_hdf(path, "ser").values |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i remember something about preferring to_numpy
over values
but obv both work
|
||
expected = np.array([[1, 3], [2, 4]], dtype=np.intp) | ||
|
||
with ensure_clean_path("temp.h5") as path: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same comments as for series
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this doesn't actually recreate the subclass correct?
@@ -4888,6 +4888,50 @@ def test_unsuppored_hdf_file_error(self, datapath): | |||
with pytest.raises(ValueError, match=message): | |||
pd.read_hdf(data_path) | |||
|
|||
def test_supported_for_subclasses_dataframe(self): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
put in a new file test_subclass.py
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Moved to a separate file.
Right, read_hdf
does not recreate the subclass. It seems that it would return the DataFrame
with the same content.
47bc2f5
to
6088f4c
Compare
from pandas.io.pytables import HDFStore, read_hdf | ||
|
||
|
||
class TestHDFStoreSubclass: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
in pandas._testing
we have SubclassedSeries
and SubclassedDataFrame
. Should we be using those here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for pointing this out!
Changed that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
@@ -1646,8 +1646,10 @@ def error(t): | |||
"nor a value are passed" | |||
) | |||
else: | |||
_TYPE_MAP = {Series: "series", DataFrame: "frame"} | |||
pt = _TYPE_MAP[type(value)] | |||
if isinstance(value, Series): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
total nit but black
permitting we could do
pt = "series" if isinstance(value, Series) else "frame"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok this is fine, can you do 3 things
- add a whatsnew 1.3.0 bug fixes in I/O
- add a warning section in the io.rst HDF docs about this (saying you can store a subclass but the subclass type is lost upon storing)
- add similar to the doc-string in to_hdf.
thanks @ivanovmg |
black pandas
git diff upstream/master -u -- "*.py" | flake8 --diff
Fix the bug that the subclasses of a dataframe and a series cannot be saved
to_hdf
.The tests were added.
When testing I found that
assert_frame_equal
does not work on subclasses as well - probably need to fix it in a separate PR.For now in the tests I compare values only.