Skip to content

DOC/WIP: doc page for the layout of the internals of pandas #4082

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
cpcloud opened this issue Jun 29, 2013 · 15 comments
Closed

DOC/WIP: doc page for the layout of the internals of pandas #4082

cpcloud opened this issue Jun 29, 2013 · 15 comments
Labels
Docs Internals Related to non-user accessible pandas implementation

Comments

@cpcloud
Copy link
Member

cpcloud commented Jun 29, 2013

this would be useful for reference purposes and also so that @jreback doesn't have to fix almost every non-trivial bug that pops up :). i would be happy to start writing this (with the help of others who know these things better than i do), i think it would be an excellent way to gain a deeper understanding of the internals.

@jreback
Copy link
Contributor

jreback commented Jun 29, 2013

sure...all for that!

@jtratner
Copy link
Contributor

👍 though maybe it would be better to add the documentation into the code (e.g., module level docstrings or comments at the top of modules) as opposed to putting it into documentation elsewhere -- might make it easier to keep them updated as changes occur.

@cpcloud
Copy link
Member Author

cpcloud commented Jun 29, 2013

yeah i think a doc page might be better than the wiki for this

@jreback
Copy link
Contributor

jreback commented Jun 29, 2013

maybe in this case a description at the top of core/internals would be useful......

@clham
Copy link
Contributor

clham commented Jun 26, 2014

Is this still alive and kicking? Perhaps adding a subhead to contributing to pandas titled code layout? I'm envisioning a paragraph/bulleted style doc with what calls what when you (For example) make a DataFrame, and how the major parts and pieces interact.

@jreback
Copy link
Contributor

jreback commented Jun 26, 2014

I think some progress has been made in groupby,internals,index to document more with some top level comments

not sure this is really for public consumption and better documented in the modules themselves

@jreback
Copy link
Contributor

jreback commented Jun 26, 2014

on second thought

this might be nice if it's included in the docs
so can be updated when the code is updated (and as an rst might be easier)

Imaybe u want to give a stab at some things that might be useful in this page? (and I can fill them in a bit)

@jreback
Copy link
Contributor

jreback commented Jun 26, 2014

and their is a section on index internals at the end of indexing.rst which should be moved to internals as well

@clham
Copy link
Contributor

clham commented Jun 27, 2014

Sure! I'll put together a PR with a TOC and some headings, then start muddling through the code.

@jreback
Copy link
Contributor

jreback commented Jul 17, 2014

document internal attributes of DataFrameGroupby and friends: http://stackoverflow.com/questions/24806601/convert-groupby-to-dataframe-join-the-groups-again/24807309#24807309

@immerrr
Copy link
Contributor

immerrr commented Jul 17, 2014

After reinventing several cythonized routines and hitting my head against the wall of pytables io code I was thinking along the lines of actually generating a separate developer doc (with its own conf.py): separation would help keeping the scope and build time of public doc down, and one could use cross-references where necessary.

@clham
Copy link
Contributor

clham commented Jul 17, 2014

That is a much cleaner solution than the disaster I've been trying to cook up.

@jreback
Copy link
Contributor

jreback commented Jul 18, 2014

little tidbits that need docs (see end of this): #7790
e.g how to compare tz with 'UTC'

@sinhrks
Copy link
Member

sinhrks commented Jun 29, 2015

I think the guide is really useful for contributors (including me). I prepared a rough summary for internal docs for discussion.

Data Layers

Explanation of internal data layers. Consists from following 4 levels.

  • Series, DataFrame and Panel: Contains internal data in BlockManager
  • BlockManager: Allow to handle multiple Blocks.
  • Block: Representing data based on each internal data types.
  • pandas raw data: Representing internal data types which doesn't exist in numpy. Currently, Categorical and Sparse. numpy existing dtypes doesn't have this layer.
  • numpy.array: All the internal data are finally mapped to numpy.array.

ToDo: Explain what ops are (basically) defined in what layers, such as slicing and numeric ops.

Internal Data Access

Assuming following DataFrame.

import pandas as pd
df = pd.DataFrame({'int': [1, 2],
                   'float': [1.1, 2.1],
                   'complex': [1+1j, 1+2j],
                   'bool': [True, False],
                   'object': ['A', 'B'],
                   'category (object)': pd.Categorical(['A', 'B']),
                   'datetime': [pd.Timestamp('2015-01-01'), pd.Timestamp('2015-02-01')],
                   'timedelta': [pd.Timedelta('1 day'), pd.Timedelta('2 day')],
                   'sparse': pd.SparseSeries([1, 0], fill_value=0),
                  }, columns=['int', 'float', 'complex', 'bool', 'object',
                              'category (object)', 'datetime', 'timedelta', 'sparse'])
df
#    int  float  complex   bool object category (object)   datetime  timedelta  \
# 0    1    1.1   (1+1j)   True      A                 A 2015-01-01     1 days   
# 1    2    2.1   (1+2j)  False      B                 B 2015-02-01     2 days   
# 
#    sparse  
# 0       1  
# 1       0  

Access to BlockManager and Block

DataFrame._data contains its internal BlockManager. BlockManager has blocks attribute which stores its internal Blocks. Blocks are separated based on its types.

for c, s in df.iteritems():
    for block in s._data.blocks:
        print(c, type(block), block.dtype, block.dtype.type)
# ('int', <class 'pandas.core.internals.IntBlock'>, dtype('int64'), <type 'numpy.int64'>)
# ('float', <class 'pandas.core.internals.FloatBlock'>, dtype('float64'), <type 'numpy.float64'>)
# ('complex', <class 'pandas.core.internals.ComplexBlock'>, dtype('complex128'), <type 'numpy.complex128'>)
# ('bool', <class 'pandas.core.internals.BoolBlock'>, dtype('bool'), <type 'numpy.bool_'>)
# ('object', <class 'pandas.core.internals.ObjectBlock'>, dtype('O'), <type 'numpy.object_'>)
# ('category (object)', <class 'pandas.core.internals.CategoricalBlock'>, category, <class 'pandas.core.common.CategoricalDtypeType'>)
# ('datetime', <class 'pandas.core.internals.DatetimeBlock'>, dtype('<M8[ns]'), <type 'numpy.datetime64'>)
# ('timedelta', <class 'pandas.core.internals.TimeDeltaBlock'>, dtype('<m8[ns]'), <type 'numpy.timedelta64'>)
# ('sparse', <class 'pandas.core.internals.SparseBlock'>, dtype('float64'), <type 'numpy.float64'>)

DataFrame.values or Block.values returns pandas raw data.

# values
for c, s in df.iteritems():
    v = s.values
    print(c, type(v), v.dtype, v.dtype.type)
# ('int', <type 'numpy.ndarray'>, dtype('int64'), <type 'numpy.int64'>)
# ('float', <type 'numpy.ndarray'>, dtype('float64'), <type 'numpy.float64'>)
# ('complex', <type 'numpy.ndarray'>, dtype('complex128'), <type 'numpy.complex128'>)
# ('bool', <type 'numpy.ndarray'>, dtype('bool'), <type 'numpy.bool_'>)
# ('object', <type 'numpy.ndarray'>, dtype('O'), <type 'numpy.object_'>)
# ('category (object)', <class 'pandas.core.categorical.Categorical'>, category, <class 'pandas.core.common.CategoricalDtypeType'>)
# ('datetime', <type 'numpy.ndarray'>, dtype('<M8[ns]'), <type 'numpy.datetime64'>)
# ('timedelta', <type 'numpy.ndarray'>, dtype('<m8[ns]'), <type 'numpy.timedelta64'>)
# ('sparse', <class 'pandas.sparse.array.SparseArray'>, dtype('float64'), <type 'numpy.float64'>)

DataFrame.get_values() or Block.get_values() returns numpy.array. All data including Categorical and Sparce are mapped to numpy.array based on its internal data types.

# get_values
for c, s in df.iteritems():
    v = s.get_values()
    print(c, type(v), v.dtype, v.dtype.type)
# ('int', <type 'numpy.ndarray'>, dtype('int64'), <type 'numpy.int64'>)
# ('float', <type 'numpy.ndarray'>, dtype('float64'), <type 'numpy.float64'>)
# ('complex', <type 'numpy.ndarray'>, dtype('complex128'), <type 'numpy.complex128'>)
# ('bool', <type 'numpy.ndarray'>, dtype('bool'), <type 'numpy.bool_'>)
# ('object', <type 'numpy.ndarray'>, dtype('O'), <type 'numpy.object_'>)
# ('category (object)', <type 'numpy.ndarray'>, dtype('O'), <type 'numpy.object_'>)
# ('datetime', <type 'numpy.ndarray'>, dtype('<M8[ns]'), <type 'numpy.datetime64'>)
# ('timedelta', <type 'numpy.ndarray'>, dtype('<m8[ns]'), <type 'numpy.timedelta64'>)
# ('sparse', <type 'numpy.ndarray'>, dtype('float64'), <type 'numpy.float64'>)

ToDo: It may useful to draw conversion maps between each layers.

@mroeschke mroeschke added Internals Related to non-user accessible pandas implementation and removed Ideas Long-Term Enhancement Discussions labels Apr 4, 2020
@mroeschke mroeschke removed this from the Someday milestone Oct 13, 2022
@mroeschke
Copy link
Member

Looks like we have https://pandas.pydata.org/docs/development/internals.html as a start so I think we can close in favor of issues noting what aspects we are missing

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Docs Internals Related to non-user accessible pandas implementation
Projects
None yet
Development

No branches or pull requests

7 participants