Skip to content

REF: sql insert_data operate column-wise to avoid internals #33229

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 7 commits into from
Apr 7, 2020

Conversation

jbrockmendel
Copy link
Member

No description provided.

pandas/io/sql.py Outdated
# object array of Timedeltas
d = b.values.astype(object)

for i in range(len(temp.columns)):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the issue with using .items that it can provide a frame back in case of duplicate labels? If so wonder if we shouldn't just update .items to do what you have done here

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you mean BlockManager.items? i dont see how thats relevant here

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

temp is a DataFrame, no?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes. this is just doing this column-wise instead of block-wise to avoid accessing ._data

@WillAyd WillAyd added Clean IO SQL to_sql, read_sql, read_sql_query labels Apr 2, 2020

for i in range(len(temp.columns)):
ser = temp.iloc[:, i]
vals = ser._values
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we start with not using iloc for this pattern (that gives a lot of overhead).
Not that it will matter very much in this case, though, I suppose, since we are converting to object dtype below which will be more costly.

See eg the helper function you asked me to remove in 07372e3

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sure, _ixs works here

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Although _ixs helps, the big win actually comes from avoiding creating a series if all you need are the values (as I did in the linked snippet)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK. Until #33252 goes through, any objection to _ixs here?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i agree with jeff on this one

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's exactly the goal of the helper function I am adding in #33252 to be used in situations where only the array values are needed, like here is the case.

So if we are adding such helper method, why not use it? (the reason I did the PR is because I want to see it used in several places, and many of those places will be outside of frame.py, eg in the ops code)

If the privateness of the method is a problem, let's discuss making it public. Although I personally think that is not needed right now (and we are using plenty of semi-private methods on DataFrame/Series outside of frame.py already)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lets revisit that once #33252 is merged

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This PR certainly does not need to wait on #33252 being merged, but I still first want to have the discussion whether we find this an appropriate place to use that function here, once it exists.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we should be using internal functions except in the internals. This is not internal by any stretch (would say that pandas.io is offlimits entirely)


for i in range(len(temp.columns)):
ser = temp.iloc[:, i]
vals = ser._values
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I really don't like using private methods like this, what is wrong with .iloc here?


for i in range(len(temp.columns)):
ser = temp.iloc[:, i]
vals = ser._values
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This PR certainly does not need to wait on #33252 being merged, but I still first want to have the discussion whether we find this an appropriate place to use that function here, once it exists.

pandas/io/sql.py Outdated
d = b.values.astype(object)

for i in range(len(temp.columns)):
ser = temp.iloc[:, i]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you should just use .items() here no? which yields the name and column

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we're going to need i below

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hinted at this in my first comment but I do also agree it would be nice to use items - we seem to special case it a lot (I've done so in groupby) I think just to work around the fact that it can return a DataFrame for duplicate labels

If we didn't do that maybe we could just enumerate over df.items() and have one way of doing things

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hinted at this in my first comment

Oh! you meant DataFrame.items! I thought you were referring to BlockManager.items, which matches DataFrame.columns

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes (with an enumerate on top if you need the i)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

updated to use enumerate and .items

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks good. Only comment I have is that this probably will fail with duplicate column labels if not already tested.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

DataFrame.items has a check for columns.is_unique; is that check not enough?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah thanks; misunderstanding on my end

@jorisvandenbossche
Copy link
Member

I don't think we should be using internal functions except in the internals. This is not internal by any stretch (would say that pandas.io is offlimits entirely)

Can you please explain a bit more why you don't want a method like DataFrame._iter_column_arrays to be used outside of frame.py ?

Yes, we want to use as much public functions in the other pandas modules. But we do have a set of "private" methods that we use ourselves, that I would see as "internal developer APIs". For example, we also use Series._values, Series._can_hold_na, Index._get_level_values (this are a few examples currently being used in sql.py), and many others, outside core/series.py or core/indexes/.
Or do you find that those should also not be used?

Or, do you want to limit the use here because it is outside pandas/core? (initially, you commented you didn't want private DataFrame methods to be used outside of frame.py, but maybe you meant outside pandas/core, and within all of /core it is fine to use those? In which case the argument would only be about pandas.io, pandas.plotting, pandas.tseries, more or less)
Just trying to understand your argument.

@jbrockmendel
Copy link
Member Author

Can you please explain a bit more why you don't want a method like DataFrame._iter_column_arrays to be used outside of frame.py ?

Can we please bike-shed this elsewhere?

Copy link
Member

@WillAyd WillAyd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm I think a nice cleanup

@jorisvandenbossche jorisvandenbossche merged commit b195a67 into pandas-dev:master Apr 7, 2020
@jbrockmendel jbrockmendel deleted the no-data-sql branch April 7, 2020 18:26
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Clean IO SQL to_sql, read_sql, read_sql_query
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants