Skip to content

DOC: df.to_sql chunksize seems to be ignored by default. #35891

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
gbrova opened this issue Aug 25, 2020 · 7 comments · Fixed by #59727
Closed

DOC: df.to_sql chunksize seems to be ignored by default. #35891

gbrova opened this issue Aug 25, 2020 · 7 comments · Fixed by #59727
Assignees
Labels
Docs IO SQL to_sql, read_sql, read_sql_query

Comments

@gbrova
Copy link

gbrova commented Aug 25, 2020

Location of the documentation

https://dev.pandas.io/docs/reference/api/pandas.DataFrame.to_sql.html

Documentation problem

Docs for chunksize state that "By default, all rows will be written at once". However, this seems to only be true if method="multi", even though method=None by default.

Experimentally, I notice a substantial speedup if I set method="multi" and chunksize is either unset or large, but chunksize seems to have no effect if I don't set method.

Suggested fix for documentation

Documentation for chunksize should also reference the method argument. For example. we could revise to:

chunksize : int, optional
Specify the number of rows in each batch to be written at a time, if batching is enabled via method. By default, all rows will be written at once.

Also, are there many uses where users should avoid method="multi"? If not, would it make sense to change the default?

@gbrova gbrova added Docs Needs Triage Issue that has not been reviewed by a pandas team member labels Aug 25, 2020
@gbrova
Copy link
Author

gbrova commented Aug 25, 2020

FWIW I tested the above on pandas 1.1.1 and python 3.7.5.

@TomAugspurger
Copy link
Contributor

TomAugspurger commented Sep 4, 2020

Yeah can you make a PR with your suggested change to the docs? Not sure about avoiding method="multi". It was added later, and we likely kept the default for backwards compatibility.

@TomAugspurger TomAugspurger added IO SQL to_sql, read_sql, read_sql_query and removed Needs Triage Issue that has not been reviewed by a pandas team member labels Sep 4, 2020
@gbrova
Copy link
Author

gbrova commented Sep 6, 2020

Thanks! Just opened #36172 to change the docs. I agree it's best to leave the default as it is for backwards compatibility.

@jorisvandenbossche
Copy link
Member

So the code where chunksize is handled for to_sql is here:

pandas/pandas/io/sql.py

Lines 797 to 831 in 6aa311d

def insert(self, chunksize=None, method=None):
# set insert method
if method is None:
exec_insert = self._execute_insert
elif method == "multi":
exec_insert = self._execute_insert_multi
elif callable(method):
exec_insert = partial(method, self)
else:
raise ValueError(f"Invalid parameter `method`: {method}")
keys, data_list = self.insert_data()
nrows = len(self.frame)
if nrows == 0:
return
if chunksize is None:
chunksize = nrows
elif chunksize == 0:
raise ValueError("chunksize argument should be non-zero")
chunks = int(nrows / chunksize) + 1
with self.pd_sql.run_transaction() as conn:
for i in range(chunks):
start_i = i * chunksize
end_i = min((i + 1) * chunksize, nrows)
if start_i >= end_i:
break
chunk_iter = zip(*[arr[start_i:end_i] for arr in data_list])
exec_insert(conn, keys, chunk_iter)

From that code snippet, it doesn't appear that chunksize is handled differently depending on method. So I think the docstring is correct.

But it might certainly be that the effect of the chunksize keyword depends on which insert method is being used, and probably also on which database / database driver is being used etc.

@gbrova
Copy link
Author

gbrova commented Sep 10, 2020

Thanks for looking into this!

If I understand correctly from this snippet, the exec_insert function is called chunks times on line 831. exec_insert by default is self._execute_insert (line 801), but changes to self._execute_insert_multi iff method is set (line 803).

So if method is not set, we call self._execute_insert (not its _multi cousin) in chunks, but our chunking is irrelevant if the function works one row at a time. (Note I confess I'm not an SQLAlchemy expert, but it seems these functions are analogous to those for SQLite, where it's explicit that num_rows=1)

Please correct me if I'm misunderstanding.

@gbrova
Copy link
Author

gbrova commented Oct 15, 2020

@jorisvandenbossche I just came across this problem again today and remembered this thread.

You're absolutely right that my confusion is about effect. As you point out, the chunksize keyword affects number of rows in each batch to be written to the database connection object. This is different from the number of rows to be written per database operation, which is how I initially (mis)read the docs.

Specifically, I was surprised that this snippet:

url = f"postgresql+psycopg2://postgres:password@localhost:5432/testdb"
engine = create_engine(url)

df_size = 10_000
df = pd.DataFrame(np.random.randint(0,100,size=(df_size, 4)), columns=list('ABCD'))
df.to_sql("some_table", engine)

sends ~10k short TCP packets to the database, even though "By default, all rows will be written at once".

I think the docs can be improved by guiding the user to also look at the method keyword, which is IMO the more practical starting point for someone looking to improve query performance by batching inserts.

What do you think about the following docs change?

chunksize : int, optional
Specify the number of rows in each batch to be written to the database connection at a time. By default, all rows will be written at once. Also see the method keyword.

@deekapila
Copy link
Contributor

take

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Docs IO SQL to_sql, read_sql, read_sql_query
Projects
None yet
4 participants