Inconsistent index between plain DataFrame and read_sql DataFrame #48193

weikhor · 2022-08-22T12:48:05Z

closes BUG: inconsistent index between plain DataFrame and read_sql DataFrame #47608
Tests added and passed if fixing a bug or adding a new feature
All code checks passed.
Added type annotations to new arguments/methods/functions.
Added an entry in the latest doc/source/whatsnew/vX.X.X.rst file if fixing a bug or adding a new feature.

mroeschke

Would be good to have the sql smoke test from the original issue as well.

phofl

Do we want this? This seems to be deliberate.

I think the if isn't needed at all, if we want a RangeIndex?

phofl

I am -1 on this change without deprecation and -0 on this change in general.

data = {1: ["foo"], 2: ["bar"]}
exp = DataFrame(data, columns=["a", "b"])

DataFrame()

These operations create regular Index objects, not a RangeIndex, this change would cause inconsistencies.

You can also see this on the number of tests you had to change for this

weikhor · 2022-08-25T12:04:57Z

ok. I try to search other method.

weikhor · 2022-08-25T14:58:25Z

Hi @phofl, when I try this.
1)

data = {"a": [], "b": []}
exp = pd.DataFrame(data, columns=["a", "b"])
print(exp)
print(exp.index)

Empty DataFrame
Columns: [a, b]
Index: []
RangeIndex(start=0, stop=0, step=1)

data = {1: ["foo"], 2: ["bar"]}
exp = pd.DataFrame(data, columns=["a", "b"])
print(exp)
print(exp.index)

Empty DataFrame
Columns: [a, b]
Index: []
Index([], dtype='object')

On current main branch, index type created is not same. Is this expected?

phofl · 2022-08-25T20:37:26Z

Can you create an overview of all possible cases and then add it to the issue? We should aim for consistency if we make a change

E.g. reading empty csv for example should be consistent too

weikhor · 2022-08-26T17:18:27Z

Hi @phofl Possible cases of empty dataframe is as follow. I based on documentation

class pandas.DataFrame(data=None, index=None, columns=None, dtype=None, copy=None)

Parameters
    data : ndarray (structured or homogeneous), Iterable, dict, or DataFrame
           Dict can contain Series, arrays, constants, dataclass or list-like objects. If data is a dict, column order follows insertion- 
           order. If a dict contains Series which have an index defined, it is aligned by its index.

Iterable

data = []
df = pd.DataFrame(data, columns=["A", "B"])
print(df.index)

Index([], dtype='object')

Dict

a) using series

data = {'a': pd.Series([]), 'B': pd.Series([])}
df = pd.DataFrame(data, columns=["A", "B"])
print(df.index)

RangeIndex(start=0, stop=0, step=1)

b) using list

data = {"A": [], "B": []}
df = pd.DataFrame(data)
print(df.index)

RangeIndex(start=0, stop=0, step=1)

c) using array

data = {'A': np.array([]), 'B': np.array([])}
df = pd.DataFrame(data)
print(df.index)

d) based on columns

data = {1: ["foo"], 2: ["bar"]}
df = pd.DataFrame(data, columns=["A", "B"])
print(df.index)

Index([], dtype='object')

Read csv

df = pd.read_csv("empty.csv")
print(df.index)

Index([], dtype='object')

weikhor · 2022-08-26T17:36:25Z

Based on https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.html,

index  : Index or array-like
         Index to use for resulting frame. Will default to RangeIndex if no indexing information part of input data and no index 
         provided.

The default is RangeIndex when no index provided

mroeschke · 2022-08-26T18:51:51Z

Thanks for your investigation so far @weikhor! I think your findings also intersect with #47959 so would be good to document your findings there as well.

I think this topic needs more discussion in the issues before moving forward with a fix, so closing. But thanks for your effort here so far.

weikhor added 2 commits August 22, 2022 20:46

add test

e85a4f8

add test

29c7c12

mroeschke added the DataFrame DataFrame data structure label Aug 22, 2022

mroeschke requested changes Aug 22, 2022

View reviewed changes

phofl reviewed Aug 22, 2022

View reviewed changes

Khor Chean Wei added 4 commits August 23, 2022 20:21

Merge branch 'pandas-dev:main' into main

cb013a6

add

285e9e3

add gh

3cd0f6a

Merge branch 'main' into 47608_inconsistent_index

191936c

weikhor requested a review from phofl August 25, 2022 04:54

phofl requested changes Aug 25, 2022

View reviewed changes

weikhor added 2 commits August 25, 2022 20:14

revert

55f8960

test

6339cf6

mroeschke closed this Aug 26, 2022

weikhor deleted the 47608_inconsistent_index branch August 28, 2022 03:39

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Inconsistent index between plain DataFrame and read_sql DataFrame #48193

Inconsistent index between plain DataFrame and read_sql DataFrame #48193

weikhor commented Aug 22, 2022

mroeschke left a comment

phofl left a comment

phofl left a comment

weikhor commented Aug 25, 2022 •

edited

Loading

weikhor commented Aug 25, 2022 •

edited

Loading

phofl commented Aug 25, 2022 •

edited

Loading

weikhor commented Aug 26, 2022 •

edited

Loading

weikhor commented Aug 26, 2022 •

edited

Loading

mroeschke commented Aug 26, 2022

Inconsistent index between plain DataFrame and read_sql DataFrame #48193

Inconsistent index between plain DataFrame and read_sql DataFrame #48193

Conversation

weikhor commented Aug 22, 2022

mroeschke left a comment

Choose a reason for hiding this comment

phofl left a comment

Choose a reason for hiding this comment

phofl left a comment

Choose a reason for hiding this comment

weikhor commented Aug 25, 2022 • edited Loading

weikhor commented Aug 25, 2022 • edited Loading

phofl commented Aug 25, 2022 • edited Loading

weikhor commented Aug 26, 2022 • edited Loading

weikhor commented Aug 26, 2022 • edited Loading

mroeschke commented Aug 26, 2022

weikhor commented Aug 25, 2022 •

edited

Loading

weikhor commented Aug 25, 2022 •

edited

Loading

phofl commented Aug 25, 2022 •

edited

Loading

weikhor commented Aug 26, 2022 •

edited

Loading

weikhor commented Aug 26, 2022 •

edited

Loading