Skip to content

Unpin pandas #342

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
dhirschfeld opened this issue Feb 1, 2024 · 12 comments · Fixed by #416
Closed

Unpin pandas #342

dhirschfeld opened this issue Feb 1, 2024 · 12 comments · Fixed by #416

Comments

@dhirschfeld
Copy link
Contributor

I would like to be able to use this library with the latest pandas version. Currently pandas is pinned to <2.2.0:

pandas = [
{ version = ">=1.2.5,<2.2.0", python = ">=3.8" }
]

It would be good to remove this restriction.

@dhirschfeld
Copy link
Contributor Author

The pin was added in:

To fix the issue described in:

...but that just avoids the problem whilst causing another problem; this library can't be used with the latest pandas :/

@dhirschfeld
Copy link
Contributor Author

I'm opening this issue to track any progress towards compatibility with the latest pandas version.

@dhirschfeld
Copy link
Contributor Author

Bump! I would like to upgrade to the latest version but am stuck on 3.0.1 because of this pin 😔

@benc-db
Copy link
Collaborator

benc-db commented Mar 27, 2024

Does 3.0.1 work with latest pandas? That would be an interesting data point.

@dhirschfeld
Copy link
Contributor Author

Does 3.0.1 work with latest pandas? That would be an interesting data point.

I've been using 3.0.1 in combination with pandas 2.2.2 with no issues:

❯ pip list | rg 'pandas|databricks'
databricks-connect              14.3.1
databricks-sdk                  0.20.0
databricks-sql-connector        3.0.1
pandas                          2.2.2

...but that's apparently because I don't query all int data sources.
Running:

with engine.connect() as conn:
    res = conn.execute(sa.text("select 1")).scalar_one()

gives:

TypeError: int() argument must be a string, a bytes-like object or a real number, not 'NoneType'

@dhirschfeld
Copy link
Contributor Author

It seems like it doesn't like assigning a None into an integer array:

> /opt/python/envs/dev310/lib/python3.10/site-packages/pandas/core/internals/managers.py(1703)as_array()
   1701             pass
   1702         else:
-> 1703             arr[isna(arr)] = na_value
   1704 
   1705         return arr.transpose()

ipdb>  arr
array([[1]], dtype=int32)

ipdb>  isna(arr)
array([[False]])

ipdb>  na_value

ipdb>  na_value is None
True

If we go up the stack we can see we get type errors if we try to assign anything other than an integer:

> /opt/python/envs/dev310/lib/python3.10/site-packages/databricks/sql/client.py(1149)_convert_arrow_table()
   1147         )
   1148 
-> 1149         res = df.to_numpy(na_value=None)
   1150         return [ResultRow(*v) for v in res]
   1151 

ipdb>  df.to_numpy(na_value=None)
*** TypeError: int() argument must be a string, a bytes-like object or a real number, not 'NoneType'

ipdb>  df.to_numpy(na_value=float('NaN'))
*** ValueError: cannot convert float NaN to integer

ipdb>  df.to_numpy(na_value=-99)
array([[1]], dtype=int32)

Casting to object before assigning does seem to work:

ipdb>  df.astype(object).to_numpy(na_value=None)
array([[1]], dtype=object)

@dhirschfeld
Copy link
Contributor Author

The problematic function:

def _convert_arrow_table(self, table):
column_names = [c[0] for c in self.description]
ResultRow = Row(*column_names)
if self.connection.disable_pandas is True:
return [
ResultRow(*[v.as_py() for v in r]) for r in zip(*table.itercolumns())
]
# Need to use nullable types, as otherwise type can change when there are missing values.
# See https://arrow.apache.org/docs/python/pandas.html#nullable-types
# NOTE: This api is epxerimental https://pandas.pydata.org/pandas-docs/stable/user_guide/integer_na.html
dtype_mapping = {
pyarrow.int8(): pandas.Int8Dtype(),
pyarrow.int16(): pandas.Int16Dtype(),
pyarrow.int32(): pandas.Int32Dtype(),
pyarrow.int64(): pandas.Int64Dtype(),
pyarrow.uint8(): pandas.UInt8Dtype(),
pyarrow.uint16(): pandas.UInt16Dtype(),
pyarrow.uint32(): pandas.UInt32Dtype(),
pyarrow.uint64(): pandas.UInt64Dtype(),
pyarrow.bool_(): pandas.BooleanDtype(),
pyarrow.float32(): pandas.Float32Dtype(),
pyarrow.float64(): pandas.Float64Dtype(),
pyarrow.string(): pandas.StringDtype(),
}
# Need to rename columns, as the to_pandas function cannot handle duplicate column names
table_renamed = table.rename_columns([str(c) for c in range(table.num_columns)])
df = table_renamed.to_pandas(
types_mapper=dtype_mapping.get,
date_as_object=True,
timestamp_as_object=True,
)
res = df.to_numpy(na_value=None)
return [ResultRow(*v) for v in res]

@dhirschfeld
Copy link
Contributor Author

I can work around the issue by disabling pandas:

with engine.connect() as conn:
    cursor = conn.connection.cursor()
    cursor.connection.disable_pandas = True
    res = cursor.execute("select 1").fetchall()
>>> res
[Row(1=1)]

...but obviously the casting to numpy needs to be fixed.

@dhirschfeld
Copy link
Contributor Author

Probably casting to object before assigning a None value is the right fix.

@diego-jd
Copy link

I second this. I cannot use pd.read_sql_query() because of this requirement.

Also, it would be good if you delete the distutils dependency

@Aryik
Copy link

Aryik commented Jul 15, 2024

@dhirschfeld any idea when this is going to make it to a release? Looks like it didn't go into 3.2.0 as I am unable to poetry install databricks-sql-connector in a project that includes pandas 2.2.2

@dhirschfeld
Copy link
Contributor Author

I'm not a maintainer here so I couldn't say.

I was hoping to do some more testing at some point, but haven't found the time.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants