Skip to content

API to_sql method doesn't provide the option to specify unique indexes #7984

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
manistal opened this issue Aug 11, 2014 · 11 comments
Open
Labels
Enhancement IO SQL to_sql, read_sql, read_sql_query

Comments

@manistal
Copy link

This could very well be something that just isn't clear to me from reading the docs, but I would like to have my index be unique when I dump the dataframe to SQLite table so that when I append more data I don't get duplicate rows. Here's what my resulting scheme looks like after using the to_sql function

sqlite> .schema
CREATE TABLE "Sniffs" (
"ID" TEXT,
"Day" TEXT,
"ModelName" TEXT,
"PassRate" FLOAT,
"RtxName" TEXT,
"Shadow" TEXT,
"Time" TEXT,
"Unit" TEXT
);
CREATE INDEX "ix_Sniffs_ID" ON "Sniffs" ("ID");

Looking at these docs:
http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.to_sql.html#pandas.DataFrame.to_sql

My resulting call looks something like this:
sniff_df.to_sql('Sniffs', sqleng, if_exists='append', index_label='ID')

Am I reading the docs wrong or anything/is there currently a method to get unique indexes in the output sql to avoid overlapping data? Could we get a kwarg like: unique_index=True ?

@ja-b
Copy link

ja-b commented Dec 19, 2014

+1

@artemyk
Copy link
Contributor

artemyk commented Dec 19, 2014

So, I believe pandas already provides df.index.is_unique, which could be used to automatically set UNIQUE for indices. Perhaps a kwarg to suppress such behavior would be more helpful.

Or perhaps there should be a more powerful mechanism (with a sensical set of defaults) to specify in a unified manner which indices to create, whether they should be unique, and if so whether they should be PRIMARY KEYs. Perhaps an argument that can specify the set of constraints/indices/primary keys/foreign keys to pass onto the SQLAlchemy Table() constructor? (we could also have an argument with a dict that, for diff columns, specifies the kwargs to pass onto the SQLAlchemy Column() constructor; this could generalize what we do with dtypes now)

@jorisvandenbossche Thoughts?

@jreback jreback modified the milestones: 0.16.0, Next Major Release Mar 6, 2015
@femtotrader
Copy link

+1 this feature will be great

@xingren23
Copy link

+1

@icyblade
Copy link

Please add support for primary key and indexes.
I'm using Pandas for data processing and Tableau for visualization, but Tableau sucks when processing data without an index.

@jefferythewind
Copy link

jefferythewind commented Nov 8, 2016

This issue just came up for a project. I thought a logical default functionality would be translate the DataFrame's index to primary key of the DB table. So to_sql would always create tables with a primary_key defined by the index of the DataFrame. This way, the to_sql could just pass the DB error back to the client on attempting to insert duplicate primary_keys.

Currently the to_sql method always creates tables without primary keys which could be improved upon for sure.

@TomAugspurger
Copy link
Contributor

I don't think we can do this automatically since pd.Index doesn't have to be unique. It would need to be a keyword.

@manistal
Copy link
Author

manistal commented Nov 8, 2016

So my original thoughts on this issue was that SQL Alchemy has something like this built in - where you can basically say I want this arbitrary column going from 0 - N in sequence as my primary key for the table.

project_id = db.Column(
    db.Integer, db.Sequence('project_id'), primary_key=True
)

Pandas achieves something pretty similar by just resetting the index - the result is a new index 0-N:

table = old_table.reset_index()

So my thinking is because we'd be creating a new "index"-like object for the primary key, we could have a kwarg to turn this functionality on (resetting the index as a default seems un-intuitive to me) so that you could tell pandas, I want just a vanilla 0 - N sequence as a primary key.

So basically what @jefferythewind suggested but with the added requirement that before using the index for the primary key you reset the index or construct a new sequential index internally.

So maybe something like: primary_key='string' as a kwarg on the to_sql function where string can be:

column_name - select a column to become the primary key, throws an error if is_unique is not true
sequential - generate a new sequential primary key
None - current functionality of no primary key

As I write this I realize sequential could match a column name in some crazy scenario so someone more clever than me might have to come up with the arg usage, but what do you guys think of the general functionality there?

@jefferythewind
Copy link

jefferythewind commented Nov 8, 2016

I like the primary_key="column_name" argument. But require that column_name is a column name, or error.

Add another boolean key_word argument ( ie auto_key=True ), which when set to True, assumes an auto-incrementing column named "index".

These 2 arguments could not be used together.

@arita37
Copy link

arita37 commented Nov 19, 2016

This is very fundamental issues, since a lot of data needs primary key....
We need to retrieve the key from db and update the pandas table accordingly before update
or SL Alchemy

@gbrault
Copy link

gbrault commented Aug 18, 2020

What is the status of this "Primary key" creation?

@mroeschke mroeschke removed this from the Contributions Welcome milestone Oct 13, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Enhancement IO SQL to_sql, read_sql, read_sql_query
Projects
None yet
Development

No branches or pull requests