Unpin `pandas` #342

dhirschfeld · 2024-02-01T23:54:20Z

I would like to be able to use this library with the latest pandas version. Currently pandas is pinned to <2.2.0:

Lines 14 to 16 in 0552990

    
           pandas = [ 
        
               { version = ">=1.2.5,<2.2.0", python = ">=3.8" } 
        
           ]

It would be good to remove this restriction.

The text was updated successfully, but these errors were encountered:

dhirschfeld · 2024-02-01T23:58:21Z

The pin was added in:

Update pyproject.toml to fix issues seen in dbt-databricks #330

To fix the issue described in:

TypeError: int() argument must be a string, a bytes-like object or a number, not 'NoneType' #326

...but that just avoids the problem whilst causing another problem; this library can't be used with the latest pandas :/

dhirschfeld · 2024-02-01T23:59:31Z

I'm opening this issue to track any progress towards compatibility with the latest pandas version.

dhirschfeld · 2024-02-23T04:37:49Z

Bump! I would like to upgrade to the latest version but am stuck on 3.0.1 because of this pin 😔

benc-db · 2024-03-27T18:06:34Z

Does 3.0.1 work with latest pandas? That would be an interesting data point.

dhirschfeld · 2024-05-29T06:13:42Z

Does 3.0.1 work with latest pandas? That would be an interesting data point.

I've been using 3.0.1 in combination with pandas 2.2.2 with no issues:

❯ pip list | rg 'pandas|databricks'
databricks-connect              14.3.1
databricks-sdk                  0.20.0
databricks-sql-connector        3.0.1
pandas                          2.2.2

...but that's apparently because I don't query all int data sources.
Running:

with engine.connect() as conn:
    res = conn.execute(sa.text("select 1")).scalar_one()

gives:

TypeError: int() argument must be a string, a bytes-like object or a real number, not 'NoneType'

dhirschfeld · 2024-05-29T06:30:25Z

It seems like it doesn't like assigning a None into an integer array:

> /opt/python/envs/dev310/lib/python3.10/site-packages/pandas/core/internals/managers.py(1703)as_array()
   1701             pass
   1702         else:
-> 1703             arr[isna(arr)] = na_value
   1704 
   1705         return arr.transpose()

ipdb>  arr
array([[1]], dtype=int32)

ipdb>  isna(arr)
array([[False]])

ipdb>  na_value

ipdb>  na_value is None
True

If we go up the stack we can see we get type errors if we try to assign anything other than an integer:

> /opt/python/envs/dev310/lib/python3.10/site-packages/databricks/sql/client.py(1149)_convert_arrow_table()
   1147         )
   1148 
-> 1149         res = df.to_numpy(na_value=None)
   1150         return [ResultRow(*v) for v in res]
   1151 

ipdb>  df.to_numpy(na_value=None)
*** TypeError: int() argument must be a string, a bytes-like object or a real number, not 'NoneType'

ipdb>  df.to_numpy(na_value=float('NaN'))
*** ValueError: cannot convert float NaN to integer

ipdb>  df.to_numpy(na_value=-99)
array([[1]], dtype=int32)

Casting to object before assigning does seem to work:

ipdb>  df.astype(object).to_numpy(na_value=None)
array([[1]], dtype=object)

dhirschfeld · 2024-05-29T06:35:22Z

The problematic function:

databricks-sql-python/src/databricks/sql/client.py

Lines 1130 to 1166 in a6e9b11

    
           def _convert_arrow_table(self, table): 
        
               column_names = [c[0] for c in self.description] 
        
               ResultRow = Row(*column_names) 
        
               if self.connection.disable_pandas is True: 
        
                   return [ 
        
                       ResultRow(*[v.as_py() for v in r]) for r in zip(*table.itercolumns()) 
        
                   ] 
        
               # Need to use nullable types, as otherwise type can change when there are missing values. 
        
               # See https://arrow.apache.org/docs/python/pandas.html#nullable-types 
        
               # NOTE: This api is epxerimental https://pandas.pydata.org/pandas-docs/stable/user_guide/integer_na.html 
        
               dtype_mapping = { 
        
                   pyarrow.int8(): pandas.Int8Dtype(), 
        
                   pyarrow.int16(): pandas.Int16Dtype(), 
        
                   pyarrow.int32(): pandas.Int32Dtype(), 
        
                   pyarrow.int64(): pandas.Int64Dtype(), 
        
                   pyarrow.uint8(): pandas.UInt8Dtype(), 
        
                   pyarrow.uint16(): pandas.UInt16Dtype(), 
        
                   pyarrow.uint32(): pandas.UInt32Dtype(), 
        
                   pyarrow.uint64(): pandas.UInt64Dtype(), 
        
                   pyarrow.bool_(): pandas.BooleanDtype(), 
        
                   pyarrow.float32(): pandas.Float32Dtype(), 
        
                   pyarrow.float64(): pandas.Float64Dtype(), 
        
                   pyarrow.string(): pandas.StringDtype(), 
        
               } 
        
               # Need to rename columns, as the to_pandas function cannot handle duplicate column names 
        
               table_renamed = table.rename_columns([str(c) for c in range(table.num_columns)]) 
        
               df = table_renamed.to_pandas( 
        
                   types_mapper=dtype_mapping.get, 
        
                   date_as_object=True, 
        
                   timestamp_as_object=True, 
        
               ) 
        
               res = df.to_numpy(na_value=None) 
        
               return [ResultRow(*v) for v in res]

dhirschfeld · 2024-05-29T06:48:58Z

I can work around the issue by disabling pandas:

with engine.connect() as conn:
    cursor = conn.connection.cursor()
    cursor.connection.disable_pandas = True
    res = cursor.execute("select 1").fetchall()

>>> res
[Row(1=1)]

...but obviously the casting to numpy needs to be fixed.

dhirschfeld · 2024-05-29T07:08:35Z

Probably casting to object before assigning a None value is the right fix.

diego-jd · 2024-05-29T18:50:09Z

I second this. I cannot use pd.read_sql_query() because of this requirement.

Also, it would be good if you delete the distutils dependency

Aryik · 2024-07-15T23:55:26Z

@dhirschfeld any idea when this is going to make it to a release? Looks like it didn't go into 3.2.0 as I am unable to poetry install databricks-sql-connector in a project that includes pandas 2.2.2

dhirschfeld · 2024-07-16T00:49:28Z

I'm not a maintainer here so I couldn't say.

I was hoping to do some more testing at some point, but haven't found the time.

dhirschfeld mentioned this issue Jun 6, 2024

Prepare release 3.2.0 #396

Merged

kfollesdal mentioned this issue Jul 18, 2024

Fix pandas 2.2.2 support #416

Merged

kravets-levko closed this as completed in #416 Jul 26, 2024

dhirschfeld-ffma mentioned this issue Aug 5, 2024

Support for Pandas 2.2 #420

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Unpin `pandas` #342

Unpin `pandas` #342

dhirschfeld commented Feb 1, 2024

dhirschfeld commented Feb 1, 2024

dhirschfeld commented Feb 1, 2024

dhirschfeld commented Feb 23, 2024

benc-db commented Mar 27, 2024

dhirschfeld commented May 29, 2024

dhirschfeld commented May 29, 2024

dhirschfeld commented May 29, 2024

dhirschfeld commented May 29, 2024

dhirschfeld commented May 29, 2024

diego-jd commented May 29, 2024

Aryik commented Jul 15, 2024

dhirschfeld commented Jul 16, 2024

Unpin pandas #342

Unpin pandas #342

Comments

dhirschfeld commented Feb 1, 2024

dhirschfeld commented Feb 1, 2024

dhirschfeld commented Feb 1, 2024

dhirschfeld commented Feb 23, 2024

benc-db commented Mar 27, 2024

dhirschfeld commented May 29, 2024

dhirschfeld commented May 29, 2024

dhirschfeld commented May 29, 2024

dhirschfeld commented May 29, 2024

dhirschfeld commented May 29, 2024

diego-jd commented May 29, 2024

Aryik commented Jul 15, 2024

dhirschfeld commented Jul 16, 2024

Unpin `pandas` #342

Unpin `pandas` #342