DOC/WIP: doc page for the layout of the internals of pandas #4082

cpcloud · 2013-06-29T15:05:00Z

this would be useful for reference purposes and also so that @jreback doesn't have to fix almost every non-trivial bug that pops up :). i would be happy to start writing this (with the help of others who know these things better than i do), i think it would be an excellent way to gain a deeper understanding of the internals.

jreback · 2013-06-29T15:07:57Z

sure...all for that!

jtratner · 2013-06-29T15:15:00Z

👍 though maybe it would be better to add the documentation into the code (e.g., module level docstrings or comments at the top of modules) as opposed to putting it into documentation elsewhere -- might make it easier to keep them updated as changes occur.

cpcloud · 2013-06-29T15:15:38Z

yeah i think a doc page might be better than the wiki for this

jreback · 2013-06-29T15:19:29Z

maybe in this case a description at the top of core/internals would be useful......

clham · 2014-06-26T23:34:26Z

Is this still alive and kicking? Perhaps adding a subhead to contributing to pandas titled code layout? I'm envisioning a paragraph/bulleted style doc with what calls what when you (For example) make a DataFrame, and how the major parts and pieces interact.

jreback · 2014-06-26T23:38:01Z

I think some progress has been made in groupby,internals,index to document more with some top level comments

not sure this is really for public consumption and better documented in the modules themselves

jreback · 2014-06-26T23:54:37Z

on second thought

this might be nice if it's included in the docs
so can be updated when the code is updated (and as an rst might be easier)

Imaybe u want to give a stab at some things that might be useful in this page? (and I can fill them in a bit)

jreback · 2014-06-26T23:58:46Z

and their is a section on index internals at the end of indexing.rst which should be moved to internals as well

clham · 2014-06-27T00:01:34Z

Sure! I'll put together a PR with a TOC and some headings, then start muddling through the code.

jreback · 2014-07-17T15:11:03Z

document internal attributes of DataFrameGroupby and friends: http://stackoverflow.com/questions/24806601/convert-groupby-to-dataframe-join-the-groups-again/24807309#24807309

immerrr · 2014-07-17T15:35:41Z

After reinventing several cythonized routines and hitting my head against the wall of pytables io code I was thinking along the lines of actually generating a separate developer doc (with its own conf.py): separation would help keeping the scope and build time of public doc down, and one could use cross-references where necessary.

clham · 2014-07-17T22:24:47Z

That is a much cleaner solution than the disaster I've been trying to cook up.

jreback · 2014-07-18T17:37:48Z

little tidbits that need docs (see end of this): #7790
e.g how to compare tz with 'UTC'

sinhrks · 2015-06-29T13:10:04Z

I think the guide is really useful for contributors (including me). I prepared a rough summary for internal docs for discussion.

Data Layers

Explanation of internal data layers. Consists from following 4 levels.

Series, DataFrame and Panel: Contains internal data in BlockManager
BlockManager: Allow to handle multiple Blocks.
Block: Representing data based on each internal data types.
pandas raw data: Representing internal data types which doesn't exist in numpy. Currently, Categorical and Sparse. numpy existing dtypes doesn't have this layer.
numpy.array: All the internal data are finally mapped to numpy.array.

ToDo: Explain what ops are (basically) defined in what layers, such as slicing and numeric ops.

Internal Data Access

Assuming following DataFrame.

import pandas as pd
df = pd.DataFrame({'int': [1, 2],
                   'float': [1.1, 2.1],
                   'complex': [1+1j, 1+2j],
                   'bool': [True, False],
                   'object': ['A', 'B'],
                   'category (object)': pd.Categorical(['A', 'B']),
                   'datetime': [pd.Timestamp('2015-01-01'), pd.Timestamp('2015-02-01')],
                   'timedelta': [pd.Timedelta('1 day'), pd.Timedelta('2 day')],
                   'sparse': pd.SparseSeries([1, 0], fill_value=0),
                  }, columns=['int', 'float', 'complex', 'bool', 'object',
                              'category (object)', 'datetime', 'timedelta', 'sparse'])
df
#    int  float  complex   bool object category (object)   datetime  timedelta  \
# 0    1    1.1   (1+1j)   True      A                 A 2015-01-01     1 days   
# 1    2    2.1   (1+2j)  False      B                 B 2015-02-01     2 days   
# 
#    sparse  
# 0       1  
# 1       0

Access to `BlockManager` and `Block`

DataFrame._data contains its internal BlockManager. BlockManager has blocks attribute which stores its internal Blocks. Blocks are separated based on its types.

for c, s in df.iteritems():
    for block in s._data.blocks:
        print(c, type(block), block.dtype, block.dtype.type)
# ('int', <class 'pandas.core.internals.IntBlock'>, dtype('int64'), <type 'numpy.int64'>)
# ('float', <class 'pandas.core.internals.FloatBlock'>, dtype('float64'), <type 'numpy.float64'>)
# ('complex', <class 'pandas.core.internals.ComplexBlock'>, dtype('complex128'), <type 'numpy.complex128'>)
# ('bool', <class 'pandas.core.internals.BoolBlock'>, dtype('bool'), <type 'numpy.bool_'>)
# ('object', <class 'pandas.core.internals.ObjectBlock'>, dtype('O'), <type 'numpy.object_'>)
# ('category (object)', <class 'pandas.core.internals.CategoricalBlock'>, category, <class 'pandas.core.common.CategoricalDtypeType'>)
# ('datetime', <class 'pandas.core.internals.DatetimeBlock'>, dtype('<M8[ns]'), <type 'numpy.datetime64'>)
# ('timedelta', <class 'pandas.core.internals.TimeDeltaBlock'>, dtype('<m8[ns]'), <type 'numpy.timedelta64'>)
# ('sparse', <class 'pandas.core.internals.SparseBlock'>, dtype('float64'), <type 'numpy.float64'>)

DataFrame.values or Block.values returns pandas raw data.

# values
for c, s in df.iteritems():
    v = s.values
    print(c, type(v), v.dtype, v.dtype.type)
# ('int', <type 'numpy.ndarray'>, dtype('int64'), <type 'numpy.int64'>)
# ('float', <type 'numpy.ndarray'>, dtype('float64'), <type 'numpy.float64'>)
# ('complex', <type 'numpy.ndarray'>, dtype('complex128'), <type 'numpy.complex128'>)
# ('bool', <type 'numpy.ndarray'>, dtype('bool'), <type 'numpy.bool_'>)
# ('object', <type 'numpy.ndarray'>, dtype('O'), <type 'numpy.object_'>)
# ('category (object)', <class 'pandas.core.categorical.Categorical'>, category, <class 'pandas.core.common.CategoricalDtypeType'>)
# ('datetime', <type 'numpy.ndarray'>, dtype('<M8[ns]'), <type 'numpy.datetime64'>)
# ('timedelta', <type 'numpy.ndarray'>, dtype('<m8[ns]'), <type 'numpy.timedelta64'>)
# ('sparse', <class 'pandas.sparse.array.SparseArray'>, dtype('float64'), <type 'numpy.float64'>)

DataFrame.get_values() or Block.get_values() returns numpy.array. All data including Categorical and Sparce are mapped to numpy.array based on its internal data types.

# get_values
for c, s in df.iteritems():
    v = s.get_values()
    print(c, type(v), v.dtype, v.dtype.type)
# ('int', <type 'numpy.ndarray'>, dtype('int64'), <type 'numpy.int64'>)
# ('float', <type 'numpy.ndarray'>, dtype('float64'), <type 'numpy.float64'>)
# ('complex', <type 'numpy.ndarray'>, dtype('complex128'), <type 'numpy.complex128'>)
# ('bool', <type 'numpy.ndarray'>, dtype('bool'), <type 'numpy.bool_'>)
# ('object', <type 'numpy.ndarray'>, dtype('O'), <type 'numpy.object_'>)
# ('category (object)', <type 'numpy.ndarray'>, dtype('O'), <type 'numpy.object_'>)
# ('datetime', <type 'numpy.ndarray'>, dtype('<M8[ns]'), <type 'numpy.datetime64'>)
# ('timedelta', <type 'numpy.ndarray'>, dtype('<m8[ns]'), <type 'numpy.timedelta64'>)
# ('sparse', <type 'numpy.ndarray'>, dtype('float64'), <type 'numpy.float64'>)

ToDo: It may useful to draw conversion maps between each layers.

mroeschke · 2023-03-31T18:36:55Z

Looks like we have https://pandas.pydata.org/docs/development/internals.html as a start so I think we can close in favor of issues noting what aspects we are missing

sinhrks mentioned this issue Jul 15, 2015

Expose the blocks API and disable automatic consolidation #10556

Closed

mroeschke added Internals Related to non-user accessible pandas implementation and removed Ideas Long-Term Enhancement Discussions labels Apr 4, 2020

mroeschke removed this from the Someday milestone Oct 13, 2022

mroeschke closed this as completed Mar 31, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

DOC/WIP: doc page for the layout of the internals of pandas #4082

DOC/WIP: doc page for the layout of the internals of pandas #4082

cpcloud commented Jun 29, 2013

jreback commented Jun 29, 2013

jtratner commented Jun 29, 2013

cpcloud commented Jun 29, 2013

jreback commented Jun 29, 2013

clham commented Jun 26, 2014

jreback commented Jun 26, 2014

jreback commented Jun 26, 2014

jreback commented Jun 26, 2014

clham commented Jun 27, 2014

jreback commented Jul 17, 2014

immerrr commented Jul 17, 2014

clham commented Jul 17, 2014

jreback commented Jul 18, 2014

sinhrks commented Jun 29, 2015

mroeschke commented Mar 31, 2023

DOC/WIP: doc page for the layout of the internals of pandas #4082

DOC/WIP: doc page for the layout of the internals of pandas #4082

Comments

cpcloud commented Jun 29, 2013

jreback commented Jun 29, 2013

jtratner commented Jun 29, 2013

cpcloud commented Jun 29, 2013

jreback commented Jun 29, 2013

clham commented Jun 26, 2014

jreback commented Jun 26, 2014

jreback commented Jun 26, 2014

jreback commented Jun 26, 2014

clham commented Jun 27, 2014

jreback commented Jul 17, 2014

immerrr commented Jul 17, 2014

clham commented Jul 17, 2014

jreback commented Jul 18, 2014

sinhrks commented Jun 29, 2015

Data Layers

Internal Data Access

Access to BlockManager and Block

mroeschke commented Mar 31, 2023

Access to `BlockManager` and `Block`