Skip to content

Write "to_sql" in chunks so that the database doesn't timeout #7347

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
krishangupta opened this issue Jun 5, 2014 · 2 comments · Fixed by #8062
Closed

Write "to_sql" in chunks so that the database doesn't timeout #7347

krishangupta opened this issue Jun 5, 2014 · 2 comments · Fixed by #8062
Labels
IO SQL to_sql, read_sql, read_sql_query

Comments

@krishangupta
Copy link

When I have to write a frame to the database that has 20,000+ records I get a timeout from MySQL. I've written some code to write the data 20,000 records at a time. I think it would be a useful function to have built into Pandas.

The code works but needs some cleanup. If others agree this is useful enhancement request I can work on this.

def write_to_db(engine, frame, table_name, chunk_size):
    start_index = 0
    end_index = chunk_size if chunk_size < len(frame) else len(frame)

    frame = frame.where(pd.notnull(frame), None)
    if_exists_param = 'replace'

    while start_index != end_index:
        print "Writing rows %s through %s" % (start_index, end_index)
        frame.iloc[start_index:end_index, :].to_sql(con=engine, name=table_name, if_exists=if_exists_param)
        if_exists_param = 'append'

        start_index = min(start_index + chunk_size, len(frame))
        end_index = min(end_index + chunk_size, len(frame))

engine = sqlalchemy.create_engine('mysql://...') #database details omited
write_to_db(engine, frame, 'retail_pendingcustomers', 20000)
@hayd hayd added the SQL label Jun 5, 2014
@jorisvandenbossche
Copy link
Member

This is coming from a SO question: http://stackoverflow.com/questions/24007762/python-pandas-using-to-sql-with-large-data-frames

With postgresql I never personally encountered this issue (even with more records), but if it is a problem, then we should solve it!

For the API, to_csv has also this functionality with the keyword chunksize (and read_csv and read_hdf to read), so best to take the same keyword.
The implementation of to_csv is here: https://github.com/pydata/pandas/blob/master/pandas/core/format.py#L1401, and for to_sql, your approach above is OK I think, but it can happen at a lower level, namely inside the insert function (https://github.com/pydata/pandas/blob/master/pandas/io/sql.py#L579).

If you want to work on this, certainly do!

@artemyk
Copy link
Contributor

artemyk commented Aug 18, 2014

I created an initial pull request trying to address this issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
IO SQL to_sql, read_sql, read_sql_query
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants