Skip to content

Inconsistent index between plain DataFrame and read_sql DataFrame #48193

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 8 commits into from

Conversation

weikhor
Copy link
Contributor

@weikhor weikhor commented Aug 22, 2022

@mroeschke mroeschke added the DataFrame DataFrame data structure label Aug 22, 2022
Copy link
Member

@mroeschke mroeschke left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would be good to have the sql smoke test from the original issue as well.

Copy link
Member

@phofl phofl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we want this? This seems to be deliberate.

I think the if isn't needed at all, if we want a RangeIndex?

@weikhor weikhor requested a review from phofl August 25, 2022 04:54
Copy link
Member

@phofl phofl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am -1 on this change without deprecation and -0 on this change in general.

data = {1: ["foo"], 2: ["bar"]}
exp = DataFrame(data, columns=["a", "b"])
DataFrame()

These operations create regular Index objects, not a RangeIndex, this change would cause inconsistencies.

You can also see this on the number of tests you had to change for this

@weikhor
Copy link
Contributor Author

weikhor commented Aug 25, 2022

ok. I try to search other method.

weikhor added 2 commits August 25, 2022 20:14
@weikhor
Copy link
Contributor Author

weikhor commented Aug 25, 2022

Hi @phofl, when I try this.
1)

data = {"a": [], "b": []}
exp = pd.DataFrame(data, columns=["a", "b"])
print(exp)
print(exp.index)
Empty DataFrame
Columns: [a, b]
Index: []
RangeIndex(start=0, stop=0, step=1)
data = {1: ["foo"], 2: ["bar"]}
exp = pd.DataFrame(data, columns=["a", "b"])
print(exp)
print(exp.index)
Empty DataFrame
Columns: [a, b]
Index: []
Index([], dtype='object')

On current main branch, index type created is not same. Is this expected?

@phofl
Copy link
Member

phofl commented Aug 25, 2022

Can you create an overview of all possible cases and then add it to the issue? We should aim for consistency if we make a change

E.g. reading empty csv for example should be consistent too

@weikhor
Copy link
Contributor Author

weikhor commented Aug 26, 2022

Hi @phofl Possible cases of empty dataframe is as follow. I based on documentation

class pandas.DataFrame(data=None, index=None, columns=None, dtype=None, copy=None)

Parameters
    data : ndarray (structured or homogeneous), Iterable, dict, or DataFrame
           Dict can contain Series, arrays, constants, dataclass or list-like objects. If data is a dict, column order follows insertion- 
           order. If a dict contains Series which have an index defined, it is aligned by its index.
  1. Iterable
data = []
df = pd.DataFrame(data, columns=["A", "B"])
print(df.index)

Index([], dtype='object')
  1. Dict

a) using series

data = {'a': pd.Series([]), 'B': pd.Series([])}
df = pd.DataFrame(data, columns=["A", "B"])
print(df.index)

RangeIndex(start=0, stop=0, step=1)

b) using list

data = {"A": [], "B": []}
df = pd.DataFrame(data)
print(df.index)

RangeIndex(start=0, stop=0, step=1)

c) using array

data = {'A': np.array([]), 'B': np.array([])}
df = pd.DataFrame(data)
print(df.index)

d) based on columns

data = {1: ["foo"], 2: ["bar"]}
df = pd.DataFrame(data, columns=["A", "B"])
print(df.index)

Index([], dtype='object')
  1. Read csv
df = pd.read_csv("empty.csv")
print(df.index)

Index([], dtype='object')

@weikhor
Copy link
Contributor Author

weikhor commented Aug 26, 2022

Based on https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.html,

index  : Index or array-like
         Index to use for resulting frame. Will default to RangeIndex if no indexing information part of input data and no index 
         provided.

The default is RangeIndex when no index provided

@mroeschke
Copy link
Member

Thanks for your investigation so far @weikhor! I think your findings also intersect with #47959 so would be good to document your findings there as well.

I think this topic needs more discussion in the issues before moving forward with a fix, so closing. But thanks for your effort here so far.

@mroeschke mroeschke closed this Aug 26, 2022
@weikhor weikhor deleted the 47608_inconsistent_index branch August 28, 2022 03:39
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
DataFrame DataFrame data structure
Projects
None yet
Development

Successfully merging this pull request may close these issues.

BUG: inconsistent index between plain DataFrame and read_sql DataFrame
3 participants