Skip to content

Latest commit

 

History

History
704 lines (543 loc) · 18.7 KB

arrays.rst

File metadata and controls

704 lines (543 loc) · 18.7 KB

{{ header }}

pandas arrays, scalars, and data types

Objects

.. currentmodule:: pandas

For most data types, pandas uses NumPy arrays as the concrete objects contained with a :class:`Index`, :class:`Series`, or :class:`DataFrame`.

For some data types, pandas extends NumPy's type system. String aliases for these types can be found at :ref:`basics.dtypes`.

Kind of Data pandas Data Type Scalar Array
TZ-aware datetime :class:`DatetimeTZDtype` :class:`Timestamp` :ref:`api.arrays.datetime`
Timedeltas (none) :class:`Timedelta` :ref:`api.arrays.timedelta`
Period (time spans) :class:`PeriodDtype` :class:`Period` :ref:`api.arrays.period`
Intervals :class:`IntervalDtype` :class:`Interval` :ref:`api.arrays.interval`
Nullable Integer :class:`Int64Dtype`, ... (none) :ref:`api.arrays.integer_na`
Nullable Float :class:`Float64Dtype`, ... (none) :ref:`api.arrays.float_na`
Categorical :class:`CategoricalDtype` (none) :ref:`api.arrays.categorical`
Sparse :class:`SparseDtype` (none) :ref:`api.arrays.sparse`
Strings :class:`StringDtype` :class:`str` :ref:`api.arrays.string`
Nullable Boolean :class:`BooleanDtype` :class:`bool` :ref:`api.arrays.bool`
PyArrow :class:`ArrowDtype` Python Scalars or :class:`NA` :ref:`api.arrays.arrow`

pandas and third-party libraries can extend NumPy's type system (see :ref:`extending.extension-types`). The top-level :meth:`array` method can be used to create a new array, which may be stored in a :class:`Series`, :class:`Index`, or as a column in a :class:`DataFrame`.

.. autosummary::
   :toctree: api/

   array

PyArrow

Warning

This feature is experimental, and the API can change in a future release without warning.

The :class:`arrays.ArrowExtensionArray` is backed by a :external+pyarrow:py:class:`pyarrow.ChunkedArray` with a :external+pyarrow:py:class:`pyarrow.DataType` instead of a NumPy array and data type. The .dtype of a :class:`arrays.ArrowExtensionArray` is an :class:`ArrowDtype`.

Pyarrow provides similar array and data type support as NumPy including first-class nullability support for all data types, immutability and more.

The table below shows the equivalent pyarrow-backed (pa), pandas extension, and numpy (np) types that are recognized by pandas. Pyarrow-backed types below need to be passed into :class:`ArrowDtype` to be recognized by pandas e.g. pd.ArrowDtype(pa.bool_())

PyArrow type pandas extension type NumPy type
:external+pyarrow:py:func:`pyarrow.bool_` :class:`BooleanDtype` np.bool_
:external+pyarrow:py:func:`pyarrow.int8` :class:`Int8Dtype` np.int8
:external+pyarrow:py:func:`pyarrow.int16` :class:`Int16Dtype` np.int16
:external+pyarrow:py:func:`pyarrow.int32` :class:`Int32Dtype` np.int32
:external+pyarrow:py:func:`pyarrow.int64` :class:`Int64Dtype` np.int64
:external+pyarrow:py:func:`pyarrow.uint8` :class:`UInt8Dtype` np.uint8
:external+pyarrow:py:func:`pyarrow.uint16` :class:`UInt16Dtype` np.uint16
:external+pyarrow:py:func:`pyarrow.uint32` :class:`UInt32Dtype` np.uint32
:external+pyarrow:py:func:`pyarrow.uint64` :class:`UInt64Dtype` np.uint64
:external+pyarrow:py:func:`pyarrow.float32` :class:`Float32Dtype` np.float32
:external+pyarrow:py:func:`pyarrow.float64` :class:`Float64Dtype` np.float64
:external+pyarrow:py:func:`pyarrow.time32` (none) (none)
:external+pyarrow:py:func:`pyarrow.time64` (none) (none)
:external+pyarrow:py:func:`pyarrow.timestamp` :class:`DatetimeTZDtype` np.datetime64
:external+pyarrow:py:func:`pyarrow.date32` (none) (none)
:external+pyarrow:py:func:`pyarrow.date64` (none) (none)
:external+pyarrow:py:func:`pyarrow.duration` (none) np.timedelta64
:external+pyarrow:py:func:`pyarrow.binary` (none) (none)
:external+pyarrow:py:func:`pyarrow.string` :class:`StringDtype` np.str_
:external+pyarrow:py:func:`pyarrow.decimal128` (none) (none)
:external+pyarrow:py:func:`pyarrow.list_` (none) (none)
:external+pyarrow:py:func:`pyarrow.map_` (none) (none)
:external+pyarrow:py:func:`pyarrow.dictionary` :class:`CategoricalDtype` (none)

Note

For string types (pyarrow.string(), string[pyarrow]), PyArrow support is still facilitated by :class:`arrays.ArrowStringArray` and StringDtype("pyarrow"). See the :ref:`string section <api.arrays.string>` below.

While individual values in an :class:`arrays.ArrowExtensionArray` are stored as a PyArrow objects, scalars are returned as Python scalars corresponding to the data type, e.g. a PyArrow int64 will be returned as Python int, or :class:`NA` for missing values.

.. autosummary::
   :toctree: api/
   :template: autosummary/class_without_autosummary.rst

   arrays.ArrowExtensionArray

.. autosummary::
   :toctree: api/
   :template: autosummary/class_without_autosummary.rst

   ArrowDtype

Datetimes

NumPy cannot natively represent timezone-aware datetimes. pandas supports this with the :class:`arrays.DatetimeArray` extension array, which can hold timezone-naive or timezone-aware values.

:class:`Timestamp`, a subclass of :class:`datetime.datetime`, is pandas' scalar type for timezone-naive or timezone-aware datetime data. :class:`NaT` is the missing value for datetime data.

.. autosummary::
   :toctree: api/

   Timestamp

.. autosummary::
   :toctree: api/
   :template: autosummary/class_without_autosummary.rst

   NaT

Properties
.. autosummary::
   :toctree: api/

   Timestamp.asm8
   Timestamp.day
   Timestamp.dayofweek
   Timestamp.day_of_week
   Timestamp.dayofyear
   Timestamp.day_of_year
   Timestamp.days_in_month
   Timestamp.daysinmonth
   Timestamp.fold
   Timestamp.hour
   Timestamp.is_leap_year
   Timestamp.is_month_end
   Timestamp.is_month_start
   Timestamp.is_quarter_end
   Timestamp.is_quarter_start
   Timestamp.is_year_end
   Timestamp.is_year_start
   Timestamp.max
   Timestamp.microsecond
   Timestamp.min
   Timestamp.minute
   Timestamp.month
   Timestamp.nanosecond
   Timestamp.quarter
   Timestamp.resolution
   Timestamp.second
   Timestamp.tz
   Timestamp.tzinfo
   Timestamp.unit
   Timestamp.value
   Timestamp.week
   Timestamp.weekofyear
   Timestamp.year

Methods
.. autosummary::
   :toctree: api/

   Timestamp.as_unit
   Timestamp.astimezone
   Timestamp.ceil
   Timestamp.combine
   Timestamp.ctime
   Timestamp.date
   Timestamp.day_name
   Timestamp.dst
   Timestamp.floor
   Timestamp.fromordinal
   Timestamp.fromtimestamp
   Timestamp.isocalendar
   Timestamp.isoformat
   Timestamp.isoweekday
   Timestamp.month_name
   Timestamp.normalize
   Timestamp.now
   Timestamp.replace
   Timestamp.round
   Timestamp.strftime
   Timestamp.strptime
   Timestamp.time
   Timestamp.timestamp
   Timestamp.timetuple
   Timestamp.timetz
   Timestamp.to_datetime64
   Timestamp.to_numpy
   Timestamp.to_julian_date
   Timestamp.to_period
   Timestamp.to_pydatetime
   Timestamp.today
   Timestamp.toordinal
   Timestamp.tz_convert
   Timestamp.tz_localize
   Timestamp.tzname
   Timestamp.utcfromtimestamp
   Timestamp.utcnow
   Timestamp.utcoffset
   Timestamp.utctimetuple
   Timestamp.weekday

A collection of timestamps may be stored in a :class:`arrays.DatetimeArray`. For timezone-aware data, the .dtype of a :class:`arrays.DatetimeArray` is a :class:`DatetimeTZDtype`. For timezone-naive data, np.dtype("datetime64[ns]") is used.

If the data are timezone-aware, then every value in the array must have the same timezone.

.. autosummary::
   :toctree: api/
   :template: autosummary/class_without_autosummary.rst

   arrays.DatetimeArray

.. autosummary::
   :toctree: api/
   :template: autosummary/class_without_autosummary.rst

   DatetimeTZDtype

Timedeltas

NumPy can natively represent timedeltas. pandas provides :class:`Timedelta` for symmetry with :class:`Timestamp`. :class:`NaT` is the missing value for timedelta data.

.. autosummary::
   :toctree: api/

   Timedelta

.. autosummary::
   :toctree: api/
   :template: autosummary/class_without_autosummary.rst

   NaT

Properties
.. autosummary::
   :toctree: api/

   Timedelta.asm8
   Timedelta.components
   Timedelta.days
   Timedelta.max
   Timedelta.microseconds
   Timedelta.min
   Timedelta.nanoseconds
   Timedelta.resolution
   Timedelta.seconds
   Timedelta.unit
   Timedelta.value
   Timedelta.view

Methods
.. autosummary::
   :toctree: api/

   Timedelta.as_unit
   Timedelta.ceil
   Timedelta.floor
   Timedelta.isoformat
   Timedelta.round
   Timedelta.to_pytimedelta
   Timedelta.to_timedelta64
   Timedelta.to_numpy
   Timedelta.total_seconds

A collection of :class:`Timedelta` may be stored in a :class:`TimedeltaArray`.

.. autosummary::
   :toctree: api/
   :template: autosummary/class_without_autosummary.rst

   arrays.TimedeltaArray

Periods

pandas represents spans of times as :class:`Period` objects.

Period

.. autosummary::
   :toctree: api/

   Period

Properties
.. autosummary::
   :toctree: api/

   Period.day
   Period.dayofweek
   Period.day_of_week
   Period.dayofyear
   Period.day_of_year
   Period.days_in_month
   Period.daysinmonth
   Period.end_time
   Period.freq
   Period.freqstr
   Period.hour
   Period.is_leap_year
   Period.minute
   Period.month
   Period.ordinal
   Period.quarter
   Period.qyear
   Period.second
   Period.start_time
   Period.week
   Period.weekday
   Period.weekofyear
   Period.year

Methods
.. autosummary::
   :toctree: api/

   Period.asfreq
   Period.now
   Period.strftime
   Period.to_timestamp

A collection of :class:`Period` may be stored in a :class:`arrays.PeriodArray`. Every period in a :class:`arrays.PeriodArray` must have the same freq.

.. autosummary::
   :toctree: api/
   :template: autosummary/class_without_autosummary.rst

   arrays.PeriodArray

.. autosummary::
   :toctree: api/
   :template: autosummary/class_without_autosummary.rst

   PeriodDtype

Intervals

Arbitrary intervals can be represented as :class:`Interval` objects.

.. autosummary::
   :toctree: api/

    Interval

Properties
.. autosummary::
   :toctree: api/

   Interval.closed
   Interval.closed_left
   Interval.closed_right
   Interval.is_empty
   Interval.left
   Interval.length
   Interval.mid
   Interval.open_left
   Interval.open_right
   Interval.overlaps
   Interval.right

A collection of intervals may be stored in an :class:`arrays.IntervalArray`.

.. autosummary::
   :toctree: api/
   :template: autosummary/class_without_autosummary.rst

   arrays.IntervalArray

.. autosummary::
   :toctree: api/
   :template: autosummary/class_without_autosummary.rst

   IntervalDtype


Nullable integer

:class:`numpy.ndarray` cannot natively represent integer-data with missing values. pandas provides this through :class:`arrays.IntegerArray`.

.. autosummary::
   :toctree: api/
   :template: autosummary/class_without_autosummary.rst

   arrays.IntegerArray

.. autosummary::
   :toctree: api/
   :template: autosummary/class_without_autosummary.rst

   Int8Dtype
   Int16Dtype
   Int32Dtype
   Int64Dtype
   UInt8Dtype
   UInt16Dtype
   UInt32Dtype
   UInt64Dtype
   NA

Nullable float

.. autosummary::
   :toctree: api/
   :template: autosummary/class_without_autosummary.rst

   arrays.FloatingArray

.. autosummary::
   :toctree: api/
   :template: autosummary/class_without_autosummary.rst

   Float32Dtype
   Float64Dtype
   NA

Categoricals

pandas defines a custom data type for representing data that can take only a limited, fixed set of values. The dtype of a :class:`Categorical` can be described by a :class:`CategoricalDtype`.

.. autosummary::
   :toctree: api/
   :template: autosummary/class_without_autosummary.rst

   CategoricalDtype

.. autosummary::
   :toctree: api/

   CategoricalDtype.categories
   CategoricalDtype.ordered

Categorical data can be stored in a :class:`pandas.Categorical`

.. autosummary::
   :toctree: api/
   :template: autosummary/class_without_autosummary.rst

   Categorical

The alternative :meth:`Categorical.from_codes` constructor can be used when you have the categories and integer codes already:

.. autosummary::
   :toctree: api/

   Categorical.from_codes

The dtype information is available on the :class:`Categorical`

.. autosummary::
   :toctree: api/

   Categorical.dtype
   Categorical.categories
   Categorical.ordered
   Categorical.codes

np.asarray(categorical) works by implementing the array interface. Be aware, that this converts the :class:`Categorical` back to a NumPy array, so categories and order information is not preserved!

.. autosummary::
   :toctree: api/

   Categorical.__array__

A :class:`Categorical` can be stored in a :class:`Series` or :class:`DataFrame`. To create a Series of dtype category, use cat = s.astype(dtype) or Series(..., dtype=dtype) where dtype is either

If the :class:`Series` is of dtype :class:`CategoricalDtype`, Series.cat can be used to change the categorical data. See :ref:`api.series.cat` for more.

Sparse

Data where a single value is repeated many times (e.g. 0 or NaN) may be stored efficiently as a :class:`arrays.SparseArray`.

.. autosummary::
   :toctree: api/
   :template: autosummary/class_without_autosummary.rst

   arrays.SparseArray

.. autosummary::
   :toctree: api/
   :template: autosummary/class_without_autosummary.rst

   SparseDtype

The Series.sparse accessor may be used to access sparse-specific attributes and methods if the :class:`Series` contains sparse values. See :ref:`api.series.sparse` and :ref:`the user guide <sparse>` for more.

Strings

When working with text data, where each valid element is a string or missing, we recommend using :class:`StringDtype` (with the alias "string").

.. autosummary::
   :toctree: api/
   :template: autosummary/class_without_autosummary.rst

   arrays.StringArray
   arrays.ArrowStringArray

.. autosummary::
   :toctree: api/
   :template: autosummary/class_without_autosummary.rst

   StringDtype

The Series.str accessor is available for :class:`Series` backed by a :class:`arrays.StringArray`. See :ref:`api.series.str` for more.

Nullable Boolean

The boolean dtype (with the alias "boolean") provides support for storing boolean data (True, False) with missing values, which is not possible with a bool :class:`numpy.ndarray`.

.. autosummary::
   :toctree: api/
   :template: autosummary/class_without_autosummary.rst

   arrays.BooleanArray

.. autosummary::
   :toctree: api/
   :template: autosummary/class_without_autosummary.rst

   BooleanDtype
   NA


Utilities

Constructors

.. autosummary::
   :toctree: api/

   api.types.union_categoricals
   api.types.infer_dtype
   api.types.pandas_dtype

Data type introspection
.. autosummary::
   :toctree: api/

    api.types.is_bool_dtype
    api.types.is_categorical_dtype
    api.types.is_complex_dtype
    api.types.is_datetime64_any_dtype
    api.types.is_datetime64_dtype
    api.types.is_datetime64_ns_dtype
    api.types.is_datetime64tz_dtype
    api.types.is_extension_array_dtype
    api.types.is_float_dtype
    api.types.is_int64_dtype
    api.types.is_integer_dtype
    api.types.is_interval_dtype
    api.types.is_numeric_dtype
    api.types.is_any_real_numeric_dtype
    api.types.is_object_dtype
    api.types.is_period_dtype
    api.types.is_signed_integer_dtype
    api.types.is_string_dtype
    api.types.is_timedelta64_dtype
    api.types.is_timedelta64_ns_dtype
    api.types.is_unsigned_integer_dtype
    api.types.is_sparse

Iterable introspection
.. autosummary::
   :toctree: api/

    api.types.is_dict_like
    api.types.is_file_like
    api.types.is_list_like
    api.types.is_named_tuple
    api.types.is_iterator

Scalar introspection
.. autosummary::
   :toctree: api/

    api.types.is_bool
    api.types.is_complex
    api.types.is_float
    api.types.is_hashable
    api.types.is_integer
    api.types.is_interval
    api.types.is_number
    api.types.is_re
    api.types.is_re_compilable
    api.types.is_scalar