DOC: df.to_sql `chunksize` seems to be ignored by default. #35891

gbrova · 2020-08-25T11:26:26Z

Location of the documentation

https://dev.pandas.io/docs/reference/api/pandas.DataFrame.to_sql.html

Documentation problem

Docs for chunksize state that "By default, all rows will be written at once". However, this seems to only be true if method="multi", even though method=None by default.

Experimentally, I notice a substantial speedup if I set method="multi" and chunksize is either unset or large, but chunksize seems to have no effect if I don't set method.

Suggested fix for documentation

Documentation for chunksize should also reference the method argument. For example. we could revise to:

chunksize : int, optional
Specify the number of rows in each batch to be written at a time, if batching is enabled via method. By default, all rows will be written at once.

Also, are there many uses where users should avoid method="multi"? If not, would it make sense to change the default?

The text was updated successfully, but these errors were encountered:

gbrova · 2020-08-25T11:27:25Z

FWIW I tested the above on pandas 1.1.1 and python 3.7.5.

TomAugspurger · 2020-09-04T19:28:44Z

Yeah can you make a PR with your suggested change to the docs? Not sure about avoiding method="multi". It was added later, and we likely kept the default for backwards compatibility.

gbrova · 2020-09-06T20:51:01Z

Thanks! Just opened #36172 to change the docs. I agree it's best to leave the default as it is for backwards compatibility.

jorisvandenbossche · 2020-09-10T08:25:39Z

So the code where chunksize is handled for to_sql is here:

pandas/pandas/io/sql.py

Lines 797 to 831 in 6aa311d

    
           def insert(self, chunksize=None, method=None): 
        
               # set insert method 
        
               if method is None: 
        
                   exec_insert = self._execute_insert 
        
               elif method == "multi": 
        
                   exec_insert = self._execute_insert_multi 
        
               elif callable(method): 
        
                   exec_insert = partial(method, self) 
        
               else: 
        
                   raise ValueError(f"Invalid parameter `method`: {method}") 
        
               keys, data_list = self.insert_data() 
        
               nrows = len(self.frame) 
        
               if nrows == 0: 
        
                   return 
        
               if chunksize is None: 
        
                   chunksize = nrows 
        
               elif chunksize == 0: 
        
                   raise ValueError("chunksize argument should be non-zero") 
        
               chunks = int(nrows / chunksize) + 1 
        
               with self.pd_sql.run_transaction() as conn: 
        
                   for i in range(chunks): 
        
                       start_i = i * chunksize 
        
                       end_i = min((i + 1) * chunksize, nrows) 
        
                       if start_i >= end_i: 
        
                           break 
        
                       chunk_iter = zip(*[arr[start_i:end_i] for arr in data_list]) 
        
                       exec_insert(conn, keys, chunk_iter)

From that code snippet, it doesn't appear that chunksize is handled differently depending on method. So I think the docstring is correct.

But it might certainly be that the effect of the chunksize keyword depends on which insert method is being used, and probably also on which database / database driver is being used etc.

gbrova · 2020-09-10T17:47:48Z

Thanks for looking into this!

If I understand correctly from this snippet, the exec_insert function is called chunks times on line 831. exec_insert by default is self._execute_insert (line 801), but changes to self._execute_insert_multi iff method is set (line 803).

So if method is not set, we call self._execute_insert (not its _multi cousin) in chunks, but our chunking is irrelevant if the function works one row at a time. (Note I confess I'm not an SQLAlchemy expert, but it seems these functions are analogous to those for SQLite, where it's explicit that num_rows=1)

Please correct me if I'm misunderstanding.

gbrova · 2020-10-15T13:56:02Z

@jorisvandenbossche I just came across this problem again today and remembered this thread.

You're absolutely right that my confusion is about effect. As you point out, the chunksize keyword affects number of rows in each batch to be written to the database connection object. This is different from the number of rows to be written per database operation, which is how I initially (mis)read the docs.

Specifically, I was surprised that this snippet:

url = f"postgresql+psycopg2://postgres:password@localhost:5432/testdb"
engine = create_engine(url)

df_size = 10_000
df = pd.DataFrame(np.random.randint(0,100,size=(df_size, 4)), columns=list('ABCD'))
df.to_sql("some_table", engine)

sends ~10k short TCP packets to the database, even though "By default, all rows will be written at once".

I think the docs can be improved by guiding the user to also look at the method keyword, which is IMO the more practical starting point for someone looking to improve query performance by batching inserts.

What do you think about the following docs change?

chunksize : int, optional
Specify the number of rows in each batch to be written to the database connection at a time. By default, all rows will be written at once. Also see the method keyword.

deekapila · 2024-09-03T20:14:58Z

take

gbrova added Docs Needs Triage Issue that has not been reviewed by a pandas team member labels Aug 25, 2020

TomAugspurger added IO SQL to_sql, read_sql, read_sql_query and removed Needs Triage Issue that has not been reviewed by a pandas team member labels Sep 4, 2020

gbrova mentioned this issue Sep 6, 2020

Clarify docs for df.to_sql chunksize #36172

Closed

5 tasks

github-actions bot assigned deekapila Sep 3, 2024

deekapila mentioned this issue Sep 6, 2024

DOC: Clarify docs for df.to_sql #59727

Merged

5 tasks

WillAyd closed this as completed in #59727 Sep 6, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

DOC: df.to_sql `chunksize` seems to be ignored by default. #35891

DOC: df.to_sql `chunksize` seems to be ignored by default. #35891

gbrova commented Aug 25, 2020

gbrova commented Aug 25, 2020

Uh oh!

TomAugspurger commented Sep 4, 2020 •

edited

Loading

Uh oh!

gbrova commented Sep 6, 2020

Uh oh!

jorisvandenbossche commented Sep 10, 2020

Uh oh!

gbrova commented Sep 10, 2020

Uh oh!

gbrova commented Oct 15, 2020

Uh oh!

deekapila commented Sep 3, 2024

Uh oh!

Uh oh!

DOC: df.to_sql chunksize seems to be ignored by default. #35891

DOC: df.to_sql chunksize seems to be ignored by default. #35891

Comments

gbrova commented Aug 25, 2020

Location of the documentation

Documentation problem

Suggested fix for documentation

gbrova commented Aug 25, 2020

Uh oh!

TomAugspurger commented Sep 4, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

gbrova commented Sep 6, 2020

Uh oh!

jorisvandenbossche commented Sep 10, 2020

Uh oh!

gbrova commented Sep 10, 2020

Uh oh!

gbrova commented Oct 15, 2020

Uh oh!

deekapila commented Sep 3, 2024

Uh oh!

DOC: df.to_sql `chunksize` seems to be ignored by default. #35891

DOC: df.to_sql `chunksize` seems to be ignored by default. #35891

TomAugspurger commented Sep 4, 2020 •

edited

Loading