Skip to content

REF: sql insert_data operate column-wise to avoid internals #33229

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 7 commits into from
Apr 7, 2020
40 changes: 14 additions & 26 deletions pandas/io/sql.py
Original file line number Diff line number Diff line change
Expand Up @@ -692,37 +692,25 @@ def insert_data(self):
column_names = list(map(str, temp.columns))
ncols = len(column_names)
data_list = [None] * ncols
blocks = temp._mgr.blocks

for b in blocks:
if b.is_datetime:
# return datetime.datetime objects
if b.is_datetimetz:
# GH 9086: Ensure we return datetimes with timezone info
# Need to return 2-D data; DatetimeIndex is 1D
d = b.values.to_pydatetime()
d = np.atleast_2d(d)
else:
# convert to microsecond resolution for datetime.datetime
d = b.values.astype("M8[us]").astype(object)
elif b.is_timedelta:
# numpy converts this to an object array of integers,
# whereas b.astype(object).values would convert to
# object array of Timedeltas
d = b.values.astype(object)

for i, (_, ser) in enumerate(temp.items()):
vals = ser._values
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we start with not using iloc for this pattern (that gives a lot of overhead).
Not that it will matter very much in this case, though, I suppose, since we are converting to object dtype below which will be more costly.

See eg the helper function you asked me to remove in 07372e3

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sure, _ixs works here

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Although _ixs helps, the big win actually comes from avoiding creating a series if all you need are the values (as I did in the linked snippet)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK. Until #33252 goes through, any objection to _ixs here?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i agree with jeff on this one

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's exactly the goal of the helper function I am adding in #33252 to be used in situations where only the array values are needed, like here is the case.

So if we are adding such helper method, why not use it? (the reason I did the PR is because I want to see it used in several places, and many of those places will be outside of frame.py, eg in the ops code)

If the privateness of the method is a problem, let's discuss making it public. Although I personally think that is not needed right now (and we are using plenty of semi-private methods on DataFrame/Series outside of frame.py already)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lets revisit that once #33252 is merged

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This PR certainly does not need to wait on #33252 being merged, but I still first want to have the discussion whether we find this an appropriate place to use that function here, once it exists.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we should be using internal functions except in the internals. This is not internal by any stretch (would say that pandas.io is offlimits entirely)

if vals.dtype.kind == "M":
d = vals.to_pydatetime()
elif vals.dtype.kind == "m":
# store as integers, see GH#6921, GH#7076
d = vals.view("i8").astype(object)
else:
# TODO(2DEA): astype-first can be avoided with 2D EAs
# astype on the block instead of values to ensure we
# get the right shape
d = b.astype(object).values
d = vals.astype(object)

assert isinstance(d, np.ndarray), type(d)

# replace NaN with None
if b._can_hold_na:
if ser._can_hold_na:
# Note: this will miss timedeltas since they are converted to int
mask = isna(d)
d[mask] = None

for col_loc, col in zip(b.mgr_locs, d):
data_list[col_loc] = col
data_list[i] = d

return column_names, data_list

Expand Down