Skip to content

BUG: also use Float Nullable dtypes in read_parquet when use_nullable_dtypes=True #45700

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 2 commits into from

Conversation

Jaafarben2
Copy link

@Jaafarben2 Jaafarben2 commented Jan 29, 2022

fixes #45694
Add float types in the mapping dictionary

…_dtypes=True in pd.read_parquet pandas-dev#45694

Add float types in mapping dictionary
@rhshadrach
Copy link
Member

rhshadrach commented Jan 29, 2022

Thanks for the PR! Please add tests, whatsnew line, and link this PR to the original issue.

@Jaafarben2 Jaafarben2 changed the title BUG: Float Nullable dtypes not being returned when use_nullable_dtypes=True in pd.read_parquet #45694 Fix BUG: Float Nullable dtypes not being returned when use_nullable_dtypes=True in pd.read_parquet Jan 29, 2022
@Jaafarben2
Copy link
Author

Before Fix : float64 is returned instead of Float64 #45700

>>> import pandas as pd
>>> df = pd.DataFrame({'a':[1, 2, 3], 'b':[1.0, 2.0, 3]})
>>> df['c'] = pd.Series([1, 2, 3], dtype='uint8')
>>> df.to_parquet('a')
>>> pd.read_parquet('a').dtypes
a      int64
b    float64
c      uint8
dtype: object
>>> pd.read_parquet('a', use_nullable_dtypes=True).dtypes
a      Int64
b    float64
c      UInt8
dtype: object

After Fix : Float64 is returned instead of float64 and bug is fixed

>>> import pandas as pd 

>>> df = pd.DataFrame({'a':[1, 2, 3], 'b':[1.0, 2.0, 3]})

>>> df['c'] = pd.Series([1, 2, 3], dtype='uint8')

>>> pd.read_parquet('a').dtypes
a      int64
b    float64
c      uint8
dtype: object

>>> df.to_parquet('a')

>>> pd.read_parquet('a', use_nullable_dtypes=True).dtypes
a      Int64
b    Float64
c      UInt8
dtype: object

@rhshadrach
Copy link
Member

@Jaafarben2 - the tests need to be added to pandas.tests, see the dev docs for more details.

@jreback jreback added the ExtensionArray Extending pandas with custom dtypes or arrays. label Jan 30, 2022
Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yup, w/o tests we can't even tell if this works

@Jaafarben2 Jaafarben2 requested a review from jreback January 30, 2022 19:07
@jorisvandenbossche jorisvandenbossche changed the title Fix BUG: Float Nullable dtypes not being returned when use_nullable_dtypes=True in pd.read_parquet BUG: also use Float Nullable dtypes in read_parquet when use_nullable_dtypes=True Feb 3, 2022
@jorisvandenbossche
Copy link
Member

Thanks for the PR!

To be explicit for the required test, there is an existing test for this functionality at

def test_use_nullable_dtypes(self, engine, request):
import pyarrow.parquet as pq
if engine == "fastparquet":
# We are manually disabling fastparquet's
# nullable dtype support pending discussion
mark = pytest.mark.xfail(
reason="Fastparquet nullable dtype support is disabled"
)
request.node.add_marker(mark)
table = pyarrow.table(
{
"a": pyarrow.array([1, 2, 3, None], "int64"),
"b": pyarrow.array([1, 2, 3, None], "uint8"),
"c": pyarrow.array(["a", "b", "c", None]),
"d": pyarrow.array([True, False, True, None]),
# Test that nullable dtypes used even in absence of nulls
"e": pyarrow.array([1, 2, 3, 4], "int64"),
}
)
with tm.ensure_clean() as path:
# write manually with pyarrow to write integers
pq.write_table(table, path)
result1 = read_parquet(path, engine=engine)
result2 = read_parquet(path, engine=engine, use_nullable_dtypes=True)
assert result1["a"].dtype == np.dtype("float64")
expected = pd.DataFrame(
{
"a": pd.array([1, 2, 3, None], dtype="Int64"),
"b": pd.array([1, 2, 3, None], dtype="UInt8"),
"c": pd.array(["a", "b", "c", None], dtype="string"),
"d": pd.array([True, False, True, None], dtype="boolean"),
"e": pd.array([1, 2, 3, 4], dtype="Int64"),
}
)
if engine == "fastparquet":
# Fastparquet doesn't support string columns yet
# Only int and boolean
result2 = result2.drop("c", axis=1)
expected = expected.drop("c", axis=1)
tm.assert_frame_equal(result2, expected)

This tests the roundtrip of a dataframe with some columns that have a nullable dtype variant in pandas. I think it should be sufficient to add two extra columns to the test dataframe (float32 and float64) in that existing test.

@jorisvandenbossche jorisvandenbossche added IO Parquet parquet, feather NA - MaskedArrays Related to pd.NA and nullable extension arrays and removed ExtensionArray Extending pandas with custom dtypes or arrays. labels Feb 3, 2022
@github-actions
Copy link
Contributor

github-actions bot commented Mar 6, 2022

This pull request is stale because it has been open for thirty days with no activity. Please update and respond to this comment if you're still interested in working on this.

@github-actions github-actions bot added the Stale label Mar 6, 2022
@Jaafarben2 Jaafarben2 closed this Mar 6, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
IO Parquet parquet, feather NA - MaskedArrays Related to pd.NA and nullable extension arrays Stale
Projects
None yet
Development

Successfully merging this pull request may close these issues.

BUG: Float Nullable dtypes not being returned when use_nullable_dtypes=True in pd.read_parquet
4 participants