Write "to_sql" in chunks so that the database doesn't timeout #7347

krishangupta · 2014-06-05T00:13:20Z

When I have to write a frame to the database that has 20,000+ records I get a timeout from MySQL. I've written some code to write the data 20,000 records at a time. I think it would be a useful function to have built into Pandas.

The code works but needs some cleanup. If others agree this is useful enhancement request I can work on this.

def write_to_db(engine, frame, table_name, chunk_size):
    start_index = 0
    end_index = chunk_size if chunk_size < len(frame) else len(frame)

    frame = frame.where(pd.notnull(frame), None)
    if_exists_param = 'replace'

    while start_index != end_index:
        print "Writing rows %s through %s" % (start_index, end_index)
        frame.iloc[start_index:end_index, :].to_sql(con=engine, name=table_name, if_exists=if_exists_param)
        if_exists_param = 'append'

        start_index = min(start_index + chunk_size, len(frame))
        end_index = min(end_index + chunk_size, len(frame))

engine = sqlalchemy.create_engine('mysql://...') #database details omited
write_to_db(engine, frame, 'retail_pendingcustomers', 20000)

The text was updated successfully, but these errors were encountered:

jorisvandenbossche · 2014-06-05T07:08:22Z

This is coming from a SO question: http://stackoverflow.com/questions/24007762/python-pandas-using-to-sql-with-large-data-frames

With postgresql I never personally encountered this issue (even with more records), but if it is a problem, then we should solve it!

For the API, to_csv has also this functionality with the keyword chunksize (and read_csv and read_hdf to read), so best to take the same keyword.
The implementation of to_csv is here: https://github.com/pydata/pandas/blob/master/pandas/core/format.py#L1401, and for to_sql, your approach above is OK I think, but it can happen at a lower level, namely inside the insert function (https://github.com/pydata/pandas/blob/master/pandas/io/sql.py#L579).

If you want to work on this, certainly do!

artemyk · 2014-08-18T20:38:06Z

I created an initial pull request trying to address this issue.

hayd added the SQL label Jun 5, 2014

artemyk mentioned this issue Aug 18, 2014

Added chunksize argument to to_sql #8062

Merged

jorisvandenbossche closed this as completed in #8062 Aug 22, 2014

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Write "to_sql" in chunks so that the database doesn't timeout #7347

Write "to_sql" in chunks so that the database doesn't timeout #7347

krishangupta commented Jun 5, 2014

jorisvandenbossche commented Jun 5, 2014

artemyk commented Aug 18, 2014

Write "to_sql" in chunks so that the database doesn't timeout #7347

Write "to_sql" in chunks so that the database doesn't timeout #7347

Comments

krishangupta commented Jun 5, 2014

jorisvandenbossche commented Jun 5, 2014

artemyk commented Aug 18, 2014