Skip to content

ENH: Add nullable keyword to read_sql #50048

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 28 commits into from
Dec 13, 2022
Merged

ENH: Add nullable keyword to read_sql #50048

merged 28 commits into from
Dec 13, 2022

Conversation

phofl
Copy link
Member

@phofl phofl commented Dec 3, 2022

Sits on top of #50047

Functionality wise, this should work now.

More broadly, I am not sure that this is the best approach we could take here. Since the convert_to_nullable_type in lib.maybe_convert_objects is not used right now except here, we could also make this strict and return the appropriate Array from the Cython code, not only when nulls are present. This would avoid the re-cast in the non-cython code part.

@phofl phofl marked this pull request as draft December 3, 2022 22:01
@phofl phofl changed the title ENH: Add nullable keyword to read_sql/ DRAFT ENH: Add nullable keyword to read_sql Dec 5, 2022
@phofl phofl marked this pull request as ready for review December 5, 2022 14:40
@phofl phofl added the IO SQL to_sql, read_sql, read_sql_query label Dec 5, 2022
@phofl phofl added the NA - MaskedArrays Related to pd.NA and nullable extension arrays label Dec 5, 2022
@WillAyd
Copy link
Member

WillAyd commented Dec 10, 2022

I think makes sense and like your vision on the interaction with Cython. Is there any performance difference here moving away from the .from_records construction of the DataFrame?

@phofl
Copy link
Member Author

phofl commented Dec 10, 2022

Also discussed this with @mroeschke in another pr, he leaned more towards keeping maybe_convert_objects as is.

Maybe a bit faster? The previous implementation was running through the same function that we are calling directly now. Definitively no slowdown

Copy link
Member

@WillAyd WillAyd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm OK with this as is - @mroeschke

@mroeschke
Copy link
Member

Since the convert_to_nullable_type in lib.maybe_convert_objects is not used right now except here, we could also make this strict and return the appropriate Array from the Cython code, not only when nulls are present

I think in a future PR it might make sense to do this; I didn't realized this was the only location where this is used. I guess the only slight downside is run time importing the Arrays in the Cython code?

@mroeschke mroeschke added this to the 2.0 milestone Dec 13, 2022
@mroeschke mroeschke merged commit 7c5017c into pandas-dev:main Dec 13, 2022
@mroeschke
Copy link
Member

Thanks @phofl

@phofl phofl deleted the sql branch December 13, 2022 22:32
@asishm asishm mentioned this pull request May 2, 2023
3 tasks

if dtype is None:
if arr.dtype == np.dtype("O"):
# i.e. maybe_convert_objects didn't convert
arr = maybe_infer_to_datetimelike(arr)
if use_nullable_dtypes and arr.dtype == np.dtype("O"):
arr = StringDtype().construct_array_type()._from_sequence(arr)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this seems weird to me. why are we casting potentially non-strings to strings?

arr = IntegerArray(arr, np.zeros(arr.shape, dtype=np.bool_))
elif is_bool_dtype(arr.dtype):
arr = BooleanArray(arr, np.zeros(arr.shape, dtype=np.bool_))
elif is_float_dtype(arr.dtype):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could re-use pd.array for L1011-L1016?

dtype=None,
coerce_float=coerce_float,
use_nullable_dtypes=use_nullable_dtypes,
)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks like this has gotten some Arrow-specific logic added here in the interim. can we move the backend-specific stuff from core.internals here?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
IO SQL to_sql, read_sql, read_sql_query NA - MaskedArrays Related to pd.NA and nullable extension arrays
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants