Skip to content

ENH: The need to re-design to_sql() method #49246

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
1 of 3 tasks
redreamality opened this issue Oct 22, 2022 · 4 comments
Closed
1 of 3 tasks

ENH: The need to re-design to_sql() method #49246

redreamality opened this issue Oct 22, 2022 · 4 comments
Labels
Enhancement Needs Triage Issue that has not been reviewed by a pandas team member

Comments

@redreamality
Copy link

redreamality commented Oct 22, 2022

Feature Type

  • Adding new functionality to pandas

  • Changing existing functionality in pandas

  • Removing existing functionality in pandas

Problem Description

There exists too much issues about this functionality, some even not close for 5 years. The original design of to_sql didn't fully take different db_dialects into consideration. Overly high level design of this function also prohibits the testing and functionality changing of this function, some PR #48331 was even being blocked due to unrelated test issues, which largely discourages the community.

To this extent, I would like to suggest breaking it down to some more specific and lower-level functions, like _to_sql_oracle, etc
making this less.
Another suggestion is that: It is acceptable if only partial functionality is implemented, say, only for a certain kind of db is designed. If some funcitonality is not officially supported, presenting example gist/code snippets in the document is also a good choice.

Refered issues:
#15988
#48331
#15988
#40647
#35347
#41335

Feature Description

Breaking down the to_sql() into lower level dbs, according to the connection type.

Alternative Solutions

see description.

Additional Context

No response

@redreamality redreamality added Enhancement Needs Triage Issue that has not been reviewed by a pandas team member labels Oct 22, 2022
@jreback
Copy link
Contributor

jreback commented Oct 22, 2022

-1 this is not pandas responsibility - most of this is deferred to sqlalchemy already

@redreamality
Copy link
Author

redreamality commented Oct 22, 2022

There will exist some operations that is very different among different dbs, for example the upsert operation.
If the pandas community regard this as unrelated issues, I suggest removing this functionality at all.
After all, it is not difficult to iter over all the lines and insert into a db and it is better to tell the user that the functionality is not available than telling them a nice picture along with an awful experience of searching around and finding out the functionality is not ready for production at all!

@ParfaitG
Copy link
Contributor

While not an author of this particular method, the philosophy of most pandas IO methods (to_csv, to_excel, to_json, to_xml, etc.) underscores general use cases to migrate data from tabular, two-dimension DataFrames (rows by columns) to external formats in similar flat two-dimension structures. However, these formats can have many nested qualities, hierarchies, and dimensions of which pandas' to_* methods usually serves one type.

Likewise, DataFrame.to_sql is meant to be database-agnostic and generalizable to any SQL Alchemy connected backend. For the base case of migrating data from a pandas DataFrame to a database table, this IO method does work and works well across RDBMSs. However, database tables can have many constraints like primary keys, special data types, validation rules, and other properties.

The issues you link are nuanced needs for particular schema. The to_sql method cannot solve every particular need of every RDBMS connection. Users need to develop tailored solutions with their DB-APIs to handle their specific use cases such as duplicate keys and temp table staging and work with their dialects that support non-industry standard SQL such as UPSERT, INSERT IGNORE, or CREATE TEMP TABLE. Again, pandas to_* methods are really for general use cases of tabular data migration.

@mroeschke
Copy link
Member

Thanks for the suggestion, but I would be -1 on refactoring or removing to_sql in its current state given the reasoning above. Closing as unlikely to be pursued.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Enhancement Needs Triage Issue that has not been reviewed by a pandas team member
Projects
None yet
Development

No branches or pull requests

4 participants