Skip to content

add __len__ and __getitem__ to Column #140

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 11 commits into from
May 4, 2023
24 changes: 23 additions & 1 deletion spec/API_specification/dataframe_api/column_object.py
Original file line number Diff line number Diff line change
@@ -1,2 +1,24 @@
from typing import NoReturn

class Column:
pass
def __len__(self) -> int:
"""
Return the number of rows.
"""

def __getitem__(self, row) -> object:
"""
Get the element at row index `row`.
"""
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the expected type of "row" here? An integer?

I wonder if we should have more explicit methods than __getitem__ since intuitively people will try to feed both scalars and Sequences / Columns into it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oops, forgot to annotate that one, thanks

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if we should have more explicit methods than getitem since intuitively people will try to feed both scalars and Sequences / Columns into it.

could you expand please?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Historically, Pandas and other dataframe libraries have allowed passing scalar-like objects here and getting a scalar-like object in return or passing Sequence-like / Column-like objects here and getting Column-like objects in return. I.E. for Pandas:

import pandas as pd

pds = pd.Series(['a', 'b', 'c', 'd', 'e'])

print(pds[0])  # Scalar-like input --> Scalar-like output
print(pds[[0]])  # Sequence-like input --> Column-like output

For our DataFrame object we have the get_rows method that takes a Sequence-like (should probably take a Column-like) where we could presumably have the same function for our Column object, and then instead of using __getitem__ we could have something like get_element or get_scalar or something like that.

If we wanted to keep the __getitem__ method I would suggest we handle both the Scalar-like and Sequence-like / Column-like cases as opposed to just the Scalar-like.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we have the get_rows method that takes a Sequence-like (should probably take a Column-like)

yeah, I came across this - I'll open a separate issue

Copy link
Contributor Author

@MarcoGorelli MarcoGorelli Apr 19, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we wanted to keep the getitem method I would suggest we handle both the Scalar-like and Sequence-like / Column-like cases as opposed to just the Scalar-like.

I've pushed a commit but am not overly keen on it, how about Column.get_rows?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added another commit with get_rows, better to avoid overloads if we can help it


def __iter__(self) -> NoReturn:
"""
Iterate over elements.

This is intentionally "poisoned" to discourage inefficient code patterns.

Raises
------
NotImplementedError
"""
raise NotImplementedError("'__iter__' is intentionally not implemented.")