Skip to content

ENH: add Series.info #31796

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 71 commits into from
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
71 commits
Select commit Hold shift + click to select a range
2b1e5fc
first draft
Feb 7, 2020
a4ad077
add whatsnew
Feb 7, 2020
c7bfb94
docstring sharing
Feb 7, 2020
01fd802
wip
Feb 7, 2020
abbae9a
add series tests
Feb 19, 2020
1a474fe
formatting
Feb 19, 2020
b30ce1b
formatting
Feb 19, 2020
6d8c765
remove old file
Feb 19, 2020
99411e4
clean
Feb 19, 2020
4651bd7
add test
Feb 19, 2020
7de4703
add test
Feb 19, 2020
99472fd
isort
Feb 19, 2020
2902fe7
remove test
Feb 19, 2020
8b8adfa
Merge remote-tracking branch 'upstream/master' into series-info
MarcoGorelli Feb 23, 2020
c6d8a76
use isinstance abcdataframe, disallow max_cols for series.info
MarcoGorelli Feb 23, 2020
d0b2e1f
refactor
MarcoGorelli Feb 23, 2020
2225810
aint no autoformatter gonna unnecessarily split my strings
MarcoGorelli Feb 23, 2020
8c6c6f5
isort
MarcoGorelli Feb 23, 2020
8afcb82
Merge branch 'master' into series-info
MarcoGorelli Feb 26, 2020
71260f3
Merge remote-tracking branch 'upstream/master' into series-info
MarcoGorelli Apr 18, 2020
127f84f
fix failing tests due to refactoring, merge conflicts
MarcoGorelli Apr 18, 2020
acae58f
Merge branch 'master' into series-info
MarcoGorelli Apr 20, 2020
9654198
resolve conflicts
MarcoGorelli Apr 20, 2020
c1006a7
replace appender with doc
MarcoGorelli Apr 20, 2020
27e45e1
Merge remote-tracking branch 'upstream/master' into series-info
MarcoGorelli Apr 22, 2020
3592e8e
indent series.info subs
MarcoGorelli Apr 22, 2020
af771e6
revert deleted line
MarcoGorelli Apr 22, 2020
317a148
fix indentation in doctests
MarcoGorelli Apr 22, 2020
5082bc5
Merge remote-tracking branch 'upstream/master' into series-info
MarcoGorelli May 12, 2020
ae0065b
reuse col_count
MarcoGorelli May 12, 2020
c36d4c4
reorder to reduce diff size
MarcoGorelli May 12, 2020
751d346
help mypy
MarcoGorelli May 16, 2020
631d914
aftermentioned 'help' should only be applied for DataFrame case
MarcoGorelli May 16, 2020
23bd173
add docstring to _get_ids_and_dtypes
MarcoGorelli May 16, 2020
304f445
correct return type of _get_ids_and_dtypes, as in Series case dtypes …
MarcoGorelli May 19, 2020
a2d6e43
return Series for dtypes in all cases
MarcoGorelli May 19, 2020
f33f0df
black bug
MarcoGorelli May 19, 2020
05c9091
Merge remote-tracking branch 'upstream/master' into series-info
MarcoGorelli May 20, 2020
22de3c5
reduce if/then
MarcoGorelli May 30, 2020
8a58bd6
simplify diff
MarcoGorelli May 30, 2020
9568d03
Merge remote-tracking branch 'origin/series-info' into series-info
MarcoGorelli May 30, 2020
21d263c
factor out memory usage
MarcoGorelli May 30, 2020
cfa8039
clarify docstring
MarcoGorelli May 30, 2020
3811545
initial OOP approach
MarcoGorelli Jun 6, 2020
6bcbef7
space method
MarcoGorelli Jun 6, 2020
a245484
add _verbose_repr method
MarcoGorelli Jun 6, 2020
c04dabf
wip
MarcoGorelli Jun 7, 2020
d9993ee
some typing / removing unnecessary methods
MarcoGorelli Jun 13, 2020
700801b
Merge remote-tracking branch 'upstream/master' into series-info
MarcoGorelli Jun 30, 2020
ad39d85
resolve better
MarcoGorelli Jun 30, 2020
cad1391
remove docstrings from inherited class
MarcoGorelli Jun 30, 2020
a53033b
fix typing
MarcoGorelli Jun 30, 2020
f0e2290
:art:
MarcoGorelli Jun 30, 2020
53e8c20
:art:, fix doctests
MarcoGorelli Jul 1, 2020
ee717c8
factor out _get_count_configs
MarcoGorelli Jul 1, 2020
4d7a211
factor _get_count_configs out of Series._verbose_repr as well
MarcoGorelli Jul 1, 2020
6eccf00
fix typing
MarcoGorelli Jul 2, 2020
81d22eb
factor out _display_counts_and_dtypes
MarcoGorelli Jul 2, 2020
f2ca520
fix typing, factor out _get_header_and_spaces
MarcoGorelli Jul 2, 2020
669ff38
document _get_count_configs
MarcoGorelli Jul 4, 2020
6f8f8b1
document _display_counts_and_dtypes and _get_header_and_spaces
MarcoGorelli Jul 4, 2020
97dc73c
remove breakpoints
MarcoGorelli Jul 4, 2020
0707f32
fix docstring substitution
MarcoGorelli Jul 4, 2020
2a2324b
Merge remote-tracking branch 'upstream/master' into series-info
MarcoGorelli Jul 5, 2020
0c08335
Merge remote-tracking branch 'upstream/master' into series-info
MarcoGorelli Sep 13, 2020
c93f1ad
Merge remote-tracking branch 'upstream/master' into series-info
MarcoGorelli Sep 19, 2020
ddf9efc
fix failing doctest
MarcoGorelli Sep 19, 2020
21d94b2
use CountConfigs namedtuple
MarcoGorelli Sep 19, 2020
a213d9c
:fire:
MarcoGorelli Sep 19, 2020
089ce24
remove trailing comma
MarcoGorelli Sep 19, 2020
4581385
fix failing doctest
MarcoGorelli Sep 19, 2020
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions doc/source/whatsnew/v1.2.0.rst
Original file line number Diff line number Diff line change
Expand Up @@ -118,6 +118,7 @@ Other enhancements
- :class:`Index` with object dtype supports division and multiplication (:issue:`34160`)
- :meth:`DataFrame.explode` and :meth:`Series.explode` now support exploding of sets (:issue:`35614`)
- `Styler` now allows direct CSS class name addition to individual data cells (:issue:`36159`)
- :meth:`Series.info` has been added, for compatibility with :meth:`DataFrame.info` (:issue:`5167`)
- :meth:`Rolling.mean()` and :meth:`Rolling.sum()` use Kahan summation to calculate the mean to avoid numerical problems (:issue:`10319`, :issue:`11645`, :issue:`13254`, :issue:`32761`, :issue:`36031`)
- :meth:`DatetimeIndex.searchsorted`, :meth:`TimedeltaIndex.searchsorted`, :meth:`PeriodIndex.searchsorted`, and :meth:`Series.searchsorted` with datetimelike dtypes will now try to cast string arguments (listlike and scalar) to the matching datetimelike type (:issue:`36346`)

Expand Down
89 changes: 89 additions & 0 deletions pandas/core/series.py
Original file line number Diff line number Diff line change
Expand Up @@ -96,6 +96,7 @@
from pandas.core.tools.datetimes import to_datetime

import pandas.io.formats.format as fmt
from pandas.io.formats.info import SeriesInfo
import pandas.plotting

if TYPE_CHECKING:
Expand Down Expand Up @@ -4551,6 +4552,94 @@ def replace(
method=method,
)

@Substitution(
klass="Series",
type_sub="",
max_cols_sub="",
examples_sub=(
"""
>>> int_values = [1, 2, 3, 4, 5]
>>> text_values = ['alpha', 'beta', 'gamma', 'delta', 'epsilon']
>>> s = pd.Series(text_values, index=int_values)
>>> s.info()
<class 'pandas.core.series.Series'>
Int64Index: 5 entries, 1 to 5
Series name: None
Non-Null Count Dtype
-------------- -----
5 non-null object
dtypes: object(1)
memory usage: 80.0+ bytes

Prints a summary excluding information about its values:

>>> s.info(verbose=False)
<class 'pandas.core.series.Series'>
Int64Index: 5 entries, 1 to 5
dtypes: object(1)
memory usage: 80.0+ bytes

Pipe output of Series.info to buffer instead of sys.stdout, get
buffer content and writes to a text file:

>>> import io
>>> buffer = io.StringIO()
>>> s.info(buf=buffer)
>>> s = buffer.getvalue()
>>> with open("df_info.txt", "w",
... encoding="utf-8") as f: # doctest: +SKIP
... f.write(s)
260

The `memory_usage` parameter allows deep introspection mode, specially
useful for big Series and fine-tune memory optimization:

>>> random_strings_array = np.random.choice(['a', 'b', 'c'], 10 ** 6)
>>> s = pd.Series(np.random.choice(['a', 'b', 'c'], 10 ** 6))
>>> s.info()
<class 'pandas.core.series.Series'>
RangeIndex: 1000000 entries, 0 to 999999
Series name: None
Non-Null Count Dtype
-------------- -----
1000000 non-null object
dtypes: object(1)
memory usage: 7.6+ MB

>>> s.info(memory_usage='deep')
<class 'pandas.core.series.Series'>
RangeIndex: 1000000 entries, 0 to 999999
Series name: None
Non-Null Count Dtype
-------------- -----
1000000 non-null object
dtypes: object(1)
memory usage: 55.3 MB"""
),
see_also_sub=(
"""
Series.describe: Generate descriptive statistics of Series.
Series.memory_usage: Memory usage of Series."""
),
)
@doc(SeriesInfo.info)
def info(
self,
verbose: Optional[bool] = None,
buf: Optional[IO[str]] = None,
max_cols: Optional[int] = None,
memory_usage: Optional[Union[bool, str]] = None,
null_counts: Optional[bool] = None,
) -> None:
if max_cols is not None:
raise ValueError(
"Argument `max_cols` can only be passed "
"in DataFrame.info, not Series.info"
)
return SeriesInfo(
self, verbose, buf, max_cols, memory_usage, null_counts
).info()

@doc(NDFrame.shift, klass=_shared_doc_kwargs["klass"])
def shift(self, periods=1, freq=None, axis=0, fill_value=None) -> "Series":
return super().shift(
Expand Down
247 changes: 207 additions & 40 deletions pandas/io/formats/info.py
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
from abc import ABCMeta, abstractmethod
import sys
from typing import IO, TYPE_CHECKING, List, Optional, Tuple, Union
from typing import IO, TYPE_CHECKING, List, NamedTuple, Optional, Tuple, Union, cast

from pandas._config import get_option

Expand All @@ -15,6 +15,32 @@
from pandas.core.series import Series # noqa: F401


class CountConfigs(NamedTuple):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you generate this using collections.namedtuple instead? We typically don't subclass things from typing

Copy link
Member Author

@MarcoGorelli MarcoGorelli Sep 22, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, can do, although I think this is the newer syntax, it's taken directly from https://docs.python.org/3/library/typing.html#typing.NamedTuple

They also recommend it in the mypy docs: https://mypy.readthedocs.io/en/stable/kinds_of_types.html#named-tuples

"""
Configs with which to display counts.

Attributes
----------
counts : Series
Non-null count of Series (or of each column of DataFrame).
count_header : str
Header that will be printed out above non-null counts in output.
space_count : int
Number of spaces that count_header should occupy
(including space before `dtypes` column).
len_count : int
Length of count header.
count_temp : str
String that can be formatted to include non-null count.
"""

counts: "Series"
count_header: str
space_count: int
len_count: int
count_temp: str


def _put_str(s: Union[str, Dtype], space: int) -> str:
"""
Make string of specified length, padding to the right if necessary.
Expand Down Expand Up @@ -72,6 +98,134 @@ def _sizeof_fmt(num: Union[int, float], size_qualifier: str) -> str:
return f"{num:3.1f}{size_qualifier} PB"


def _get_count_configs(
counts: "Series", col_space: int, show_counts: bool, col_count: Optional[int] = None
) -> CountConfigs:
"""
Get configs for displaying counts, depending on the value of `show_counts`.

Parameters
----------
counts : Series
Non-null count of Series (or of each column of DataFrame).
col_space : int
How many space to leave between non-null count and dtype columns.
show_counts : bool
Whether to display non-null counts.
col_count : int, optional
Number of columns in DataFrame.

Returns
-------
CountConfigs
"""
if show_counts:
if col_count is not None and col_count != len(counts): # pragma: no cover
raise AssertionError(
f"Columns must equal counts ({col_count} != {len(counts)})"
)
count_header = "Non-Null Count"
len_count = len(count_header)
non_null = " non-null"
max_count = max(len(pprint_thing(k)) for k in counts) + len(non_null)
space_count = max(len_count, max_count) + col_space
count_temp = "{count}" + non_null
else:
count_header = ""
space_count = len(count_header)
len_count = space_count
count_temp = "{count}"
return CountConfigs(counts, count_header, space_count, len_count, count_temp)


def _display_counts_and_dtypes(
lines: List[str],
ids: "Index",
dtypes: "Series",
show_counts: bool,
count_configs: CountConfigs,
space_dtype: int,
space: int = 0,
space_num: int = 0,
) -> None:
"""
Append count and dtype of Series (or of each column of Frame) to `lines`.

Parameters
----------
lines : List[str]
At this stage, this contains the main header and the info table headers.
ids : Index
Series name (or names of DataFrame columns).
dtypes : Series
Series dtype (or dtypes of DataFrame columns).
show_counts : bool
Whether to show non-null counts.
count_configs: CountConfigs
Configs with which to display counts.
space_dtype : int
Number of spaces that `dtypes` column should occupy.
space : int = 0
Number of spaces that `Column` header should occupy
(including space before `non-null count` column).
space_num : int = 0
Number of spaces that ` # ` header should occupy (including space
before `Column` column), only applicable for `DataFrame.info`.
"""
for i, col in enumerate(ids):
dtype = dtypes[i]
col = pprint_thing(col)

line_no = _put_str(f" {i}", space_num)
count = ""
if show_counts:
count = count_configs.counts[i]

lines.append(
line_no
+ _put_str(col, space)
+ _put_str(
count_configs.count_temp.format(count=count), count_configs.space_count
)
+ _put_str(dtype, space_dtype)
)


def _get_header_and_spaces(
dtypes: "Series", space_count: int, count_header: str, header: str = ""
) -> Tuple[int, str, int]:
"""
Append extra columns (count and type) to header, if applicable.

Parameters
----------
dtypes : Series
Series dtype (or dtypes of DataFrame columns).
space_count : int
Number of spaces that count_header should occupy
(including space before `dtypes` column).
count_header : str
Header that will be printed out above non-null counts in output.
header : str
Current header.

Returns
-------
space_dtype : int
Number of spaces that `dtypes` column should occupy.
header : str
Header with extra columns (count and type) appended.
len_dtype : int
Length of dtype header.
"""
dtype_header = "Dtype"
len_dtype = len(dtype_header)
max_dtypes = max(len(pprint_thing(k)) for k in dtypes)
space_dtype = max(len_dtype, max_dtypes)
header += _put_str(count_header, space_count) + _put_str(dtype_header, space_dtype)
return space_dtype, header, len_dtype


class BaseInfo(metaclass=ABCMeta):
def __init__(
self,
Expand Down Expand Up @@ -297,55 +451,68 @@ def _verbose_repr(
space_num = max(max_id, len_id) + col_space

header = _put_str(id_head, space_num) + _put_str(column_head, space)
if show_counts:
counts = self.data.count()
if col_count != len(counts): # pragma: no cover
raise AssertionError(
f"Columns must equal counts ({col_count} != {len(counts)})"
)
count_header = "Non-Null Count"
len_count = len(count_header)
non_null = " non-null"
max_count = max(len(pprint_thing(k)) for k in counts) + len(non_null)
space_count = max(len_count, max_count) + col_space
count_temp = "{count}" + non_null
else:
count_header = ""
space_count = len(count_header)
len_count = space_count
count_temp = "{count}"

dtype_header = "Dtype"
len_dtype = len(dtype_header)
max_dtypes = max(len(pprint_thing(k)) for k in dtypes)
space_dtype = max(len_dtype, max_dtypes)
header += _put_str(count_header, space_count) + _put_str(
dtype_header, space_dtype
counts = self.data.count()
count_configs = _get_count_configs(counts, col_space, show_counts, col_count)

space_dtype, header, len_dtype = _get_header_and_spaces(
dtypes, count_configs.space_count, count_configs.count_header, header
)

lines.append(header)
lines.append(
_put_str("-" * len_id, space_num)
+ _put_str("-" * len_column, space)
+ _put_str("-" * len_count, space_count)
+ _put_str("-" * count_configs.len_count, count_configs.space_count)
+ _put_str("-" * len_dtype, space_dtype)
)

for i, col in enumerate(ids):
dtype = dtypes[i]
col = pprint_thing(col)
_display_counts_and_dtypes(
lines,
ids,
dtypes,
show_counts,
count_configs,
space_dtype,
space,
space_num,
)

line_no = _put_str(f" {i}", space_num)
count = ""
if show_counts:
count = counts[i]
def _non_verbose_repr(self, lines: List[str], ids: "Index") -> None:
lines.append(ids._summary(name="Columns"))

lines.append(
line_no
+ _put_str(col, space)
+ _put_str(count_temp.format(count=count), space_count)
+ _put_str(dtype, space_dtype)
)

class SeriesInfo(BaseInfo):
def _get_mem_usage(self, deep: bool) -> int:
return self.data.memory_usage(index=True, deep=deep)

def _get_ids_and_dtypes(self) -> Tuple["Index", "Series"]:
ids = Index([self.data.name])
dtypes = cast("Series", self.data._constructor(self.data.dtypes))
return ids, dtypes

def _verbose_repr(
self, lines: List[str], ids: "Index", dtypes: "Series", show_counts: bool
) -> None:
lines.append(f"Series name: {self.data.name}")

id_space = 2

counts = cast("Series", self.data._constructor(self.data.count()))
count_configs = _get_count_configs(counts, id_space, show_counts)

space_dtype, header, len_dtype = _get_header_and_spaces(
dtypes, count_configs.space_count, count_configs.count_header
)

lines.append(header)
lines.append(
_put_str("-" * count_configs.len_count, count_configs.space_count)
+ _put_str("-" * len_dtype, space_dtype)
)

_display_counts_and_dtypes(
lines, ids, dtypes, show_counts, count_configs, space_dtype,
)

def _non_verbose_repr(self, lines: List[str], ids: "Index") -> None:
lines.append(ids._summary(name="Columns"))
pass
Loading