REGR?: read_sql no longer supports duplicate column names #53117

jorisvandenbossche · 2023-05-06T09:14:03Z

Probably caused by #50048, and probably related to #52437 (another change in behaviour that might have been caused by the same PR). In essence the change is from using DataFrame.from_records to processing the records into a list of column arrays and then using DataFrame(dict(zip(columns, arrays))). That has slightly different behaviour.

Dummy reproducer for the read_sql change:

import pandas as pd
df = pd.DataFrame({'a': [1, 2, 3], 'b': [0.1, 0.2, 0.3]})

from sqlalchemy import create_engine
eng = create_engine("sqlite://")
df.to_sql("test_table", eng, index=False)

pd.read_sql("SELECT a, b, a +1 as a FROM test_table;", eng)

With pandas 1.5 this returns

   a    b  a
0  1  0.1  2
1  2  0.2  3
2  3  0.3  4

with pandas 2.0 this returns

I don't know how much we want to support duplicate column names in read_sql, but it is a change in behaviour, and the new behaviour of just silently ignoring it / dropping some data also isn't ideal IMO.

cc @phofl

The text was updated successfully, but these errors were encountered:

jorisvandenbossche · 2023-05-06T09:22:44Z

To illustrate with just the construction of the DataFrame:

In [1]: data = [(1, 0.1, 2), (2, 0.2, 3), (3, 0.3, 4)]
   ...: columns = ['a', 'b', 'a']

In [2]: pd.DataFrame.from_records(data, columns)
Out[2]: 
   0    1  2
a  1  0.1  2
b  2  0.2  3
a  3  0.3  4

In [10]: arrays = [np.array([1, 2, 3]), np.array([0.1, 0.2, 0.3]), np.array([2, 3, 4])]

In [11]: pd.DataFrame(dict(zip(columns, arrays)))
Out[11]: 
   a    b
0  2  0.1
1  3  0.2
2  4  0.3

I assume we could relatively easily fix this by using a different constructor for the arrays, like:

In [15]: pd.DataFrame._from_arrays(arrays, columns, None)
Out[15]: 
   a    b  a
0  1  0.1  2
1  2  0.2  3
2  3  0.3  4

Further, the functionality to specify the dtype backend is probably something that could be moved into from_records? (and then the sql code can keep using from_records)

MarcoGorelli · 2023-05-06T10:29:28Z

would it possible/acceptable to just raise on duplicate column names? I'd make the case disallowing that wherever possible (if people have a dataframe with duplicates rows labels and take a tranpose, then OK, they'll get duplicates, but in IO functions I'd have thought it acceptable to prohibit)

phofl · 2023-05-06T10:51:59Z

Can discuss this, but should fix for 2.0.2

z--m-n · 2023-05-11T18:16:35Z

Thanks for solving this.

I don't know how much we want to support duplicate column names in read_sql, but it is a change in behaviour,

would it possible/acceptable to just raise on duplicate column names?

Relevant discussion. But read_sql would then deviate in behaviour from DataFrame.

Meaning, it would be best if read_sql outputs the same number of columns as when replacing

...
pd.read_sql("SELECT a, b, a +1 as a FROM test_table;", eng)

in the dummy reproducer, with

...
exe = eng.execute("SELECT a, b, a +1 as a FROM test_table;")
pd.DataFrame(exe)

which in pandas 2.0.1 returns

   a    b  a
0  1  0.1  2
1  2  0.2  3
2  3  0.3  4

with the duplicate columns intact.

jorisvandenbossche added Regression Functionality that used to work in a prior pandas version IO SQL to_sql, read_sql, read_sql_query labels May 6, 2023

jorisvandenbossche added this to the 2.0.2 milestone May 6, 2023

jorisvandenbossche mentioned this issue May 6, 2023

Made read_postgis raise ValueError if the geom_col is specified twice (current error is cryptic) geopandas/geopandas#2849

Merged

phofl mentioned this issue May 6, 2023

REGR: read_sql dropping duplicated columns #53118

Merged

5 tasks

jorisvandenbossche closed this as completed in #53118 May 8, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

REGR?: read_sql no longer supports duplicate column names #53117

REGR?: read_sql no longer supports duplicate column names #53117

jorisvandenbossche commented May 6, 2023

jorisvandenbossche commented May 6, 2023

MarcoGorelli commented May 6, 2023

phofl commented May 6, 2023

z--m-n commented May 11, 2023

REGR?: read_sql no longer supports duplicate column names #53117

REGR?: read_sql no longer supports duplicate column names #53117

Comments

jorisvandenbossche commented May 6, 2023

jorisvandenbossche commented May 6, 2023

MarcoGorelli commented May 6, 2023

phofl commented May 6, 2023

z--m-n commented May 11, 2023