I can't export a very large table #181

M4R4G0N · 2023-07-25T15:06:14Z

I have a very large table, with more than 100 thousand rows and when I try to download it using this method, it gives an error.
I replace the table name because it doesn't matter.
apparently the code is not treated to receive such a large size.

however when I use select * from table LIMIT 5000 it works

AttributeError                            Traceback (most recent call last)
<ipython-input-39-f3c65c38f1a7> in <module>
      9     print(datetime.datetime.now())
     10     with connection.cursor() as cursor:
---> 11         cursor.execute('SELECT * FROM table')
     12         result = cursor.fetchall()
     13         lista = []

c:\users\\appdata\local\programs\python\python37\lib\site-packages\databricks\sql\client.py in execute(self, operation, parameters)
    508             self.thrift_backend,
    509             self.buffer_size_bytes,
--> 510             self.arraysize,
    511         )
    512 

c:\users\\appdata\local\programs\python\python37\lib\site-packages\databricks\sql\client.py in __init__(self, connection, execute_response, thrift_backend, result_buffer_size_bytes, arraysize)
    816         else:
    817             # In this case, there are results waiting on the server so we fetch now for simplicity
--> 818             self._fill_results_buffer()
    819 
    820     def __iter__(self):

c:\users\\appdata\local\programs\python\python37\lib\site-packages\databricks\sql\client.py in _fill_results_buffer(self)
    835             lz4_compressed=self.lz4_compressed,
    836             arrow_schema_bytes=self._arrow_schema_bytes,
--> 837             description=self.description,
    838         )
    839         self.results = results

c:\users\\appdata\local\programs\python\python37\lib\site-packages\databricks\sql\thrift_backend.py in fetch_results(self, op_handle, max_rows, max_bytes, expected_row_start_offset, lz4_compressed, arrow_schema_bytes, description)
    932 
    933         queue = ResultSetQueueFactory.build_queue(
--> 934             row_set_type=resp.resultSetMetadata.resultFormat,
    935             t_row_set=resp.results,
    936             arrow_schema_bytes=arrow_schema_bytes,

The text was updated successfully, but these errors were encountered:

susodapop · 2023-07-25T17:56:32Z

Your issue description doesn't include the text of any Python exception. Can you post the actual traceback?

M4R4G0N · 2023-07-26T12:13:03Z

Good morning.

the traceback is:

AttributeError: 'NoneType' object has no attribute 'resultFormat'

i have over 20000 rows.

susodapop · 2023-07-26T13:27:02Z

Thanks. What version of databricks-sql-connector do you use? If you downgrade to a previous release does the download work?

M4R4G0N · 2023-07-26T15:29:31Z

the version is 2.8.0

williamjacksn · 2023-07-27T16:05:11Z

I am also having the same problem with version 2.8.0:

Traceback (most recent call last):
  ...
  File "/home/python/databricks-sql-scripts/dbx/cnx.py", line 15, in yield_rows
    cur.execute(sql, params)
  File "/home/python/venv/lib/python3.11/site-packages/databricks/sql/client.py", line 505, in execute
    self.active_result_set = ResultSet(
                             ^^^^^^^^^^
  File "/home/python/venv/lib/python3.11/site-packages/databricks/sql/client.py", line 818, in __init__
    self._fill_results_buffer()
  File "/home/python/venv/lib/python3.11/site-packages/databricks/sql/client.py", line 830, in _fill_results_buffer
    results, has_more_rows = self.thrift_backend.fetch_results(
                             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/python/venv/lib/python3.11/site-packages/databricks/sql/thrift_backend.py", line 934, in fetch_results
    row_set_type=resp.resultSetMetadata.resultFormat,
                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
AttributeError: 'NoneType' object has no attribute 'resultFormat'

I rolled back to 2.7.0 and the problem went away.

M4R4G0N · 2023-07-27T16:51:48Z

Thankyou william

noctuid · 2023-08-02T17:57:41Z

@maragondestiny Could you reopen this issue this there is a problem on the latest 2.8.0? I also see this when trying to create a table with to_sql on 2.8.0 but not 2.7.0.

susodapop · 2023-08-02T20:18:52Z

I've reopened. Downgrading is a decent workaround but if this is a regression then we need to solve it going forward.

noctuid · 2023-08-04T15:27:03Z

~~I'm seeing this issue less often but still seeing it in 2.7.0 unfortunately.~~ Nevermind, I had switched back to 2.8.0. It's probably worth noting that it does not always happen for me though, and it happens for many different types of queries.

noctuid · 2023-08-18T13:49:47Z

What's the priority of this? My team would like to use some functionality that does not work in 2.7.0, but this issue makes the newer versions unusable. Let me know if there is any other information that would be helpful.

ayush-shah · 2023-11-09T18:44:13Z

is this still an issue for 2.9.3?

euanmacinnes · 2024-04-04T09:23:04Z

I am still seeing this issue in 3.1.1

castelo-software · 2024-06-12T10:10:28Z

Issue is indeed still present in v3.1.1. It seems to be caused by the Cloud Fetch download manager (#146) which was added in v2.8.0.

As far as I can see, it's simply due to the fact that the timeout for downloads is set to 60 seconds and it's not possible to configure a different value from outside the library. This means that larger datasets will always result in the is_file_download_successful method returning False.

The ResultFileDownloadManager then returns None as the result of the query, assuming that it will be retried, although it is never retried and instead, it results in the caller simply receiving an empty list with no indications that anything went wrong.

A retry would not fix the issue either, since some downloads might always take longer than 60 seconds. instead, it should be possible for callers to determine what the accepted timeout is for their queries. In case that the timeout is hit, an exception should be raised instead of simply logging to debug and returning an empty list.

Temporary workaround

Downgrading to v2.7.0 is not necessary! Instead, it's possible to simply disable Cloud Fetch when creating the client:

from databricks import sql

connection = sql.connect(
    server_hostname=server_hostname,
    http_path=http_path,
    access_token=access_token,
    use_cloud_fetch=False,
)

susodapop · 2024-06-12T16:35:05Z

A retry would not fix the issue either, since some downloads might always take longer than 60 seconds. instead, it should be possible for callers to determine what the accepted timeout is for their queries. In case that the timeout is hit, an exception should be raised instead of simply logging to debug and returning an empty list.

Bingo. I'd strongly advise putting this into a PR and tagging @benc-db or @kravets-levko for review.

It's strange that a cloudfetch download could exceed sixty seconds. I wonder if there is a low-bandwidth connection to the cloud provider where the cloudfetch chunk is stored? These file sizes are quite small. But either way, the connector should give more actionable feedback rather than failing silently.

Currently databricks-sql-connector doesn't unify the retry behaviour between Thrift requests (like ExecuteStatement) and Cloud Fetch requests (standard HTTP GET requests to a pre-signed cloud storage provider URL). The two features were developed independently and handle their own retry behaviour. Ultimately it makes sense to give the same level of retry configurability to cloud fetch that Thrift requests receive. Whether they should be independently configurable is more of a design decision I'm not prepared to speculate about.

castelo-software · 2024-06-15T22:38:45Z

Bingo. I'd strongly advise putting this into a PR and tagging @benc-db or @kravets-levko for review.

I can see that there's already an open PR since February which should fix Cloud Fetch, and also allows for the timeout to be configured using an environment variable.

@andrefurlan-db, are you still working on it?

kravets-levko · 2024-07-17T17:53:25Z

Hello here 👋 We just released v3.3.0 which includes a refactoring of CloudFetch-related code. Please give it a try and let me know if it helped with your issues or not (remember to enable CloudFetch via use_cloud_fetch=True). If you still see any issues - please enable debug logging (see #383 (comment)) and share log output. Thank you!

M4R4G0N closed this as completed Jul 27, 2023

susodapop reopened this Aug 2, 2023

kravets-levko added the bug Something isn't working label Apr 17, 2024

kravets-levko closed this as completed Oct 2, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

I can't export a very large table #181

I can't export a very large table #181

M4R4G0N commented Jul 25, 2023 •

edited by susodapop

Loading

susodapop commented Jul 25, 2023

M4R4G0N commented Jul 26, 2023

susodapop commented Jul 26, 2023 •

edited

Loading

M4R4G0N commented Jul 26, 2023

williamjacksn commented Jul 27, 2023

M4R4G0N commented Jul 27, 2023

noctuid commented Aug 2, 2023

susodapop commented Aug 2, 2023

noctuid commented Aug 4, 2023 •

edited

Loading

noctuid commented Aug 18, 2023

ayush-shah commented Nov 9, 2023

euanmacinnes commented Apr 4, 2024

castelo-software commented Jun 12, 2024 •

edited

Loading

susodapop commented Jun 12, 2024

castelo-software commented Jun 15, 2024 •

edited

Loading

kravets-levko commented Jul 17, 2024

I can't export a very large table #181

I can't export a very large table #181

Comments

M4R4G0N commented Jul 25, 2023 • edited by susodapop Loading

however when I use select * from table LIMIT 5000 it works

susodapop commented Jul 25, 2023

M4R4G0N commented Jul 26, 2023

susodapop commented Jul 26, 2023 • edited Loading

M4R4G0N commented Jul 26, 2023

williamjacksn commented Jul 27, 2023

M4R4G0N commented Jul 27, 2023

noctuid commented Aug 2, 2023

susodapop commented Aug 2, 2023

noctuid commented Aug 4, 2023 • edited Loading

noctuid commented Aug 18, 2023

ayush-shah commented Nov 9, 2023

euanmacinnes commented Apr 4, 2024

castelo-software commented Jun 12, 2024 • edited Loading

Temporary workaround

susodapop commented Jun 12, 2024

castelo-software commented Jun 15, 2024 • edited Loading

kravets-levko commented Jul 17, 2024

M4R4G0N commented Jul 25, 2023 •

edited by susodapop

Loading

susodapop commented Jul 26, 2023 •

edited

Loading

noctuid commented Aug 4, 2023 •

edited

Loading

castelo-software commented Jun 12, 2024 •

edited

Loading

castelo-software commented Jun 15, 2024 •

edited

Loading