-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
REF: sql insert_data operate column-wise to avoid internals #33229
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from 1 commit
930a6ec
b69f466
8834ac1
7328cec
a480ddd
6ce360b
eae5b1e
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -692,37 +692,26 @@ def insert_data(self): | |
column_names = list(map(str, temp.columns)) | ||
ncols = len(column_names) | ||
data_list = [None] * ncols | ||
blocks = temp._data.blocks | ||
|
||
for b in blocks: | ||
if b.is_datetime: | ||
# return datetime.datetime objects | ||
if b.is_datetimetz: | ||
# GH 9086: Ensure we return datetimes with timezone info | ||
# Need to return 2-D data; DatetimeIndex is 1D | ||
d = b.values.to_pydatetime() | ||
d = np.atleast_2d(d) | ||
else: | ||
# convert to microsecond resolution for datetime.datetime | ||
d = b.values.astype("M8[us]").astype(object) | ||
elif b.is_timedelta: | ||
# numpy converts this to an object array of integers, | ||
# whereas b.astype(object).values would convert to | ||
# object array of Timedeltas | ||
d = b.values.astype(object) | ||
|
||
for i in range(len(temp.columns)): | ||
ser = temp.iloc[:, i] | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. you should just use .items() here no? which yields the name and column There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. we're going to need There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Hinted at this in my first comment but I do also agree it would be nice to use If we didn't do that maybe we could just enumerate over There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Oh! you meant DataFrame.items! I thought you were referring to BlockManager.items, which matches DataFrame.columns There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. yes (with an enumerate on top if you need the i) There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. updated to use enumerate and .items There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This looks good. Only comment I have is that this probably will fail with duplicate column labels if not already tested. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. DataFrame.items has a check for columns.is_unique; is that check not enough? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Ah thanks; misunderstanding on my end |
||
vals = ser._values | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I think we start with not using iloc for this pattern (that gives a lot of overhead). See eg the helper function you asked me to remove in 07372e3 There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. sure, _ixs works here There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Although _ixs helps, the big win actually comes from avoiding creating a series if all you need are the values (as I did in the linked snippet) There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. OK. Until #33252 goes through, any objection to _ixs here? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. i agree with jeff on this one There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. It's exactly the goal of the helper function I am adding in #33252 to be used in situations where only the array values are needed, like here is the case. So if we are adding such helper method, why not use it? (the reason I did the PR is because I want to see it used in several places, and many of those places will be outside of frame.py, eg in the ops code) If the privateness of the method is a problem, let's discuss making it public. Although I personally think that is not needed right now (and we are using plenty of semi-private methods on DataFrame/Series outside of There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. lets revisit that once #33252 is merged There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This PR certainly does not need to wait on #33252 being merged, but I still first want to have the discussion whether we find this an appropriate place to use that function here, once it exists. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I don't think we should be using internal functions except in the internals. This is not internal by any stretch (would say that pandas.io is offlimits entirely) |
||
if vals.dtype.kind == "M": | ||
d = vals.to_pydatetime() | ||
elif vals.dtype.kind == "m": | ||
# store as integers, see GH#6921, GH#7076 | ||
d = vals.view("i8").astype(object) | ||
else: | ||
# TODO(2DEA): astype-first can be avoided with 2D EAs | ||
# astype on the block instead of values to ensure we | ||
# get the right shape | ||
d = b.astype(object).values | ||
d = vals.astype(object) | ||
|
||
assert isinstance(d, np.ndarray), type(d) | ||
|
||
# replace NaN with None | ||
if b._can_hold_na: | ||
if ser._can_hold_na: | ||
# Note: this will miss timedeltas since they are converted to int | ||
mask = isna(d) | ||
d[mask] = None | ||
|
||
for col_loc, col in zip(b.mgr_locs, d): | ||
data_list[col_loc] = col | ||
data_list[i] = d | ||
|
||
return column_names, data_list | ||
|
||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is the issue with using
.items
that it can provide a frame back in case of duplicate labels? If so wonder if we shouldn't just update.items
to do what you have done hereThere was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you mean BlockManager.items? i dont see how thats relevant here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
temp
is a DataFrame, no?There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes. this is just doing this column-wise instead of block-wise to avoid accessing
._data