Skip to content

Status of io.sql.get_schema as public function? #9960

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
jorisvandenbossche opened this issue Apr 21, 2015 · 25 comments
Open

Status of io.sql.get_schema as public function? #9960

jorisvandenbossche opened this issue Apr 21, 2015 · 25 comments
Labels
Enhancement IO SQL to_sql, read_sql, read_sql_query Needs Discussion Requires discussion from core team before further action

Comments

@jorisvandenbossche
Copy link
Member

At this moment, pd.io.sql.get_schema is not documented (not in the API docs, and not in the io docs). But, it is a potential useful function, so I think it would be good to be more explicit about its status (by mentioning it in the docs).

However, there are some quirks about the function:

  • The signature: pd.io.sql.get_schema(frame, name, flavor='sqlite', keys=None, con=Non e, dtype=None) -> flavor keyword in the third place, while we want to deprecate it (and this means you cannot do get_schema(df, 'name', engine), but always have to do get_schema(df, 'name', con=engine).
    Ideally this should have the same arguments (order) as to_sql (pd.io.sql.to_sql(frame, name, con, flavor='sqlite', schema=None, if_exists='fail', index=True, index_label=None, chunksize=None, dtype=None)) (only chunksize is not relevant)
  • It should have all options to modify the resulting sql table as to_sql
  • Maybe also the option to return the sqlalchemy Table instead of the string itself?

That we maybe should first solve before making it more explicitely public?

Triggered by http://stackoverflow.com/questions/29749356/python-pandas-export-structure-only-no-rows-of-a-dataframe-to-sql/

@jorisvandenbossche
Copy link
Member Author

@jorisvandenbossche jorisvandenbossche added IO SQL to_sql, read_sql, read_sql_query API Design labels Apr 21, 2015
@TomAugspurger
Copy link
Contributor

I've used this (or something similar) in my to_redshift(https://gist.github.com/TomAugspurger/1b6ed51edfbc17c26d85) method; so I think it would be good to document.

As far as returning a Table vs the schema as a string, the function name indicates a string. I haven't had any cases where having the SQLA Table would have been useful, especially if it isn't reflecting the destination (I'm not sure if we'd reflect or not).

@artemyk
Copy link
Contributor

artemyk commented Apr 21, 2015

It seems like the more fundamental issue --- at least in regards to the stackoverflow question --- is that to_sql is not setting the right data types for an empty dataframe. As far as I can tell, this comes from the fact that:

>>> df = pd.DataFrame({'a':[1,2,3],'b':['b','b2','b3']})
>>> df.ix[[0,],:].ix[:,0].dtype
dtype('int64')

while

>>> df = pd.DataFrame({'a':[1,2,3],'b':['b','b2','b3']})
>>> df.ix[[],:].ix[:,0].dtype
dtype('O')

Is there a reason for this discrepancy?

@artemyk
Copy link
Contributor

artemyk commented Apr 21, 2015

So confused.

In [1]: import pandas as pd
In [2]: df = pd.DataFrame({'a':[1,2,3],'b':['b','b2','b3']})

In [3]: df.ix[[],:].loc[:,'a']
Out[3]: Series([], name: a, dtype: int64)

In [4]: df.ix[[],:].iloc[:,0]
Out[4]: Series([], name: a, dtype: object)

Why the difference in dtypes?

@jorisvandenbossche
Copy link
Member Author

@artemyk Can you open a new issue for that? (to not hijack this thread)

@jorisvandenbossche
Copy link
Member Author

I think the point here is that there should be a way to create the database table from a DataFrame, without having to do a non-selection to obtain an empty dataframe to write.
And therefore, get_schema could possibly be helpful (as you can execute the output).

@artemyk
Copy link
Contributor

artemyk commented Apr 22, 2015

@jorisvandenbossche OK, i will open a new issue. I did come across this because I think its at the root of the dtypes being lost on an empty dataframe. And I do think that inserting an empty dataframe should insert the dtypes correctly. Though get_schema could still be useful, I imagine.

@jorisvandenbossche
Copy link
Member Author

@artemyk yes, I agree writing an empty frame should also work (but more an issue with indexing, than with the sql code)

@TomAugspurger I see you didn't use get_schema in the code you linked to, but the SQLTable.sql_schema() method? (but that uses the same code under the hood)

On returning an sqlalchemy Table, I was thinking this could be potentially useful if you want to adjust some things (eg constraints on columns), that are now not yet possible through our interface.

@TomAugspurger
Copy link
Contributor

@jorisvandenbossche Yeah, I didn't know about the get_schema method, hence the need for docs :)

That's a fair point about making adjustments to the generated schema. And getting the actual schema should just be a str(Table) then right?

The order of arguments should be aligned with to_sql (maybe warn for a cycle).

@artemyk
Copy link
Contributor

artemyk commented Apr 22, 2015

@jorisvandenbossche I would consider making a separate function to return a Table rather than the schema string. Seems a bit cleaner...

@jorisvandenbossche
Copy link
Member Author

yes, I was also think that using a new name for get_schema could also solve the backwards-compatibility problem with its signature.

Name suggestions?

@jorisvandenbossche
Copy link
Member Author

maybe something alluding on 'create statement', as 'schema' is actually a confusing name (in some databases this has another meaning, eg see also the schema kwarg in to_sql)

Some possibilities (but I am not yet convinced ..)

to_create_statement
get_create_statement

@jorisvandenbossche jorisvandenbossche added this to the 0.17.0 milestone Apr 22, 2015
@artemyk
Copy link
Contributor

artemyk commented Apr 22, 2015

other ideas:
get_ddl_schema
get_table_ddl

for returning the Table object:
get_sqlalchemy_table
get_table_sqlalchemy

@artemyk
Copy link
Contributor

artemyk commented Apr 30, 2015

Another inconsistency to think about is that get_schema takes a keys parameter (to specify primary keys), but to_sql doesn't. So maybe this is a chance to improve some of the DDL creation code. Ideally, pandas should be able to create a table with unique/non-unique indexes, primary keys, foreign keys, etc. without the user having to execute raw SQL themselves.

What about the following set of arguments to both to_sql and get_schema (or whatever that method is eventually called):

  • frame - DataFrame
  • name - table name
  • con - connection object
  • schema - database schema (yes, definitely confusing viz-a-viz get_schema --- let's rename for sure)
  • if_exists - Not sure if this should be a parameter to get_schema also, or if that should just always return the creation code (even if table exists)
  • index - Whether to add index as a column or not. Default True
  • primary_keys - Which column(s) to make the primary key. Can be a single string or a list of strings. Defaults to no primary keys.
  • sql_indexes - Which columns to index. Can be a single string (e.g. 'name' -> name column is indexed), a list of strings (e.g. ['name','title'] -> name column is indexed, title column is indexed) or a list of lists of strings (e.g. [ ['name','title'], ] -> (name, title) are indexed jointly). If index=True, defaults to making the DataFrame index column. If DataFrame has multi-index, all multi-index series are indexed jointly (notice that this is different from current behavior: IIRC, currently the columns of a multi-index would be indexed separately)
  • unique_sql_indexes - Same as sql_indexes but for creating indexes w/ UNIQUE constraint. Defaults to no unique indexes.
  • foreign_keys - Dict of form {'column': ('OtherTable','id'), 'column2':('OtherTable2','id'), ...}
  • dtype - Dictionary of column to dtype mappings.
  • flavor - Is fallback mysql still going to be around in 0.17? If so, this should be included also.

to_sql will also have the chunksize parameter.

I propose eliminating/depreciating the index_label parameter (see currently to_sql implementation). The description says "Column label for index column(s). If None is given (default) and index is True, then the index names are used". It seems like the user could simply rename the DataFrame indexes in order to achieve this.

BTW, this could also close #9084 and #7984 .

@jreback
Copy link
Contributor

jreback commented Aug 15, 2015

@jorisvandenbossche status?

@jreback
Copy link
Contributor

jreback commented Aug 20, 2015

@jorisvandenbossche satus?

@jorisvandenbossche jorisvandenbossche modified the milestones: Next Major Release, 0.17.0 Aug 20, 2015
@jorisvandenbossche
Copy link
Member Author

Will try to look at it next week, but no blocker for 0.17.0

@qytz
Copy link

qytz commented Nov 15, 2019

status of this issue ?

@totalhack
Copy link

Are there any plans to implement @artemyk's suggestion of unifying the APIs of to_sql and get_schema? In particular I'm wondering why the keys param isn't made available in to_sql such that created tables end up with a primary key.

@mroeschke mroeschke added Needs Discussion Requires discussion from core team before further action and removed API Design labels Apr 18, 2021
@henryh9n
Copy link

henryh9n commented Mar 8, 2022

@mroeschke @jorisvandenbossche @artemyk Any updates on this? Looks like there's been no progress with this for a while now.
Came across this a few days ago. get_schema is really useful, but seems like it's not complete and not documented yet.
If there are no blockers and you did not change your mind, I'd love to work on this.

@jreback
Copy link
Contributor

jreback commented Mar 8, 2022

someone would need to present a concrete proposal

this has not been discussed in years

@henryh9n
Copy link

@jreback If that's possible, I'd like to take this issue and present a proposal.

@mroeschke mroeschke removed this from the Contributions Welcome milestone Oct 13, 2022
@wtfzambo
Copy link

Semi related question: is this issue the reason for which Pylance can't see this function in the sql module, but Jedi can?

E.g: if I do

import pandas.io.sql as sql

foo = sql.get_schema(...)

Pylance throws an error saying: "get_schema" is not a known member of module.
The only members that it's able to see are: PandasSQL, read_sql, read_sql_query, read_sql_table.

Why is that?

@Shinnnyshinshin
Copy link

Also asking the same question as @wtfzambo. get_schema() is still very useful to me, especially for creating a temp_table that will be appended to using df.to_sql(). Is there any update on how this method should properly be used? Thank you in advance

import pandas.io.sql as sql

bar = sql.get_schema(...)

@jlynchMicron
Copy link

+1 for public function to get SQLalchemy table schema from existing dataframe (without trying to inserting rows to literal table).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Enhancement IO SQL to_sql, read_sql, read_sql_query Needs Discussion Requires discussion from core team before further action
Projects
None yet
Development

No branches or pull requests