Skip to content

BRANCH: libpandas native core experiments #11960

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed

Conversation

wesm
Copy link
Member

@wesm wesm commented Jan 5, 2016

Summary of implementation approach

  • Base pandas::Array class
  • Implement dtype-specific Array subclasses and use dynamic dispatch to invoke generic array methods
  • Hold NumPy arrays internally
  • For data types that don't have a well-defined NA values, use a bitmask to store null/not-null

I adapted several bits of code from github.com/libdynd/dynd-python to save me time sorting out issues with NumPy's PyArray_API global variable and other fun stuff. We'll need to add a copyright notice to the LICENSES directoy

Demo:

In [1]: import pandas.native as nt

In [2]: arr = np.arange(10)

In [3]: parr = nt.to_array(arr)

In [4]: parr
Out[4]: <pandas.native.Array at 0x7f51dfce3e40>

In [5]: parr.dtype
Out[5]: PandasType(int64)

In [6]: parr[3]
Out[6]: 3

In [7]: from pandas.native import NA

In [8]: parr[3] = NA

In [10]: parr[3]
Out[10]: NA

In [11]: parr[3] = 12

In [13]: parr[3]
Out[13]: 12

@jreback
Copy link
Contributor

jreback commented Jan 5, 2016

A couple of thoughts, mainly about dtypes.

  • the extension dtypes in main types are parameterized, which is quite a good thing. This would require some factory functions to generate these at run-time.
    • Categorical really should be parameterized with the actual categories, then the impl of a Categorical is quite easy. (and we dont have to do the is_dtype_equal thingy all over the place)
    • datetime w/tz & timedelta similarly (if we want to support other units)
    • strings can also use this for having an encoding
    • ideal these will be sub-class of the base dtype
  • how do these dtypes interact with Python/numpy dtypes? e.g. comparisons / convertibility
  • show these be actual types (eg. inherit from PyTypeObject)?
  • how 'separable' is this dtype meant to be? e.g. in theory this should be a separate library that say numpy/python could actually build upon (or maybe just is just a working impl that someone else can build on).

@jreback
Copy link
Contributor

jreback commented Jan 5, 2016

OT: what is your editor / debug setup for c++?

@jreback
Copy link
Contributor

jreback commented Jan 5, 2016

Darwin 069-jreback.local 14.5.0 Darwin Kernel Version 14.5.0: Wed Jul 29 02:26:53 PDT 2015; root:xnu-2782.40.9~1/RELEASE_X86_64 x86_64

[jreback-~/pandas] python setup.py build_ext --inplace
running build_ext
cmake  -DPYTHON_EXECUTABLE=/Users/jreback/miniconda/bin/python   /Users/jreback/pandas
-- Toolchain build.
-- Downloading and extracting dependencies.
CMake Error at cmake_modules/toolchain.cmake:42 (message):
  Toolchain bootstrap failed.
Call Stack (most recent call first):
  CMakeLists.txt:24 (include)


-- Configuring incomplete, errors occurred!
error: command 'cmake' failed with exit status 1

[jreback-~/pandas] cmake --version
cmake version 3.3.1

CMake suite maintained and supported by Kitware (kitware.com/cmake).

what am I missing?

@wesm
Copy link
Member Author

wesm commented Jan 5, 2016

Sorry, I haven't begun to address the build toolchain problem, so for the moment I've been developing using the portable C++ toolchain we're using for several projects being developed at Cloudera https://github.com/cloudera/native-toolchain .

I'm not sure what to do here. We'll want to make it easy to add new C++ libraries that dynamically link to libpandas, but they'll need to get installed on the user's system somehow. conda? homebrew? What about Windows?

Let me come up with some instructions that will work for you without necessarily requiring you to install the native-toolchain I linked above (though that may be the easiest thing).

@wesm
Copy link
Member Author

wesm commented Jan 5, 2016

I need to get Travis CI passing anyway

@wesm
Copy link
Member Author

wesm commented Jan 5, 2016

to your other comments

how do these dtypes interact with Python/numpy dtypes? e.g. comparisons / convertibility
show these be actual types (eg. inherit from PyTypeObject)?

From pandas's point of view, there will only be pandas::DataType instances. If you want to interact with some other data, it will need to be properly marshalled / wrapped to a pandas data type and object in libpandas. So comparison will happen after mapping NumPy dtypes to pandas types (which is already implicitly happening, but because of the fragmentation between NumPy and non-NumPy types it's a bit messy).

how 'separable' is this dtype meant to be? e.g. in theory this should be a separate library that say
numpy/python could actually build upon (or maybe just is just a working impl that someone else
can build on).

At least at the outset not very separable (making it so would be a lot of work with little benefit), but with some engineering effort the code could be made reusable.

@wesm
Copy link
Member Author

wesm commented Jan 5, 2016

show these be actual types (eg. inherit from PyTypeObject)?

No — I think it's better to have a clean and flexible C++ API without the burdens of Python's C API object construction protocol and only wrap the objects in Python types when the user is interacting with them.

@wesm
Copy link
Member Author

wesm commented Jan 5, 2016

@jreback I removed some of the currently-unneeded google thirdparty libraries. I'll add a thirdparty directory with googletest and get that building so Travis CI will work and we don't need to rely on an external toolchain for the moment. I'd like to add libraries like RE2, snappy, and lz4 to libpandas so we will need to figure out a dependency-hell avoidance strategy

@wesm
Copy link
Member Author

wesm commented Jan 6, 2016

Phew OK I got the build working on OS X. Please let me know if

python setup.py build_ext --inplace

does not work for you so I can investigate further. I'll fix up the Travis CI build when I can and set it to only test the native core stuff for now

@jreback
Copy link
Contributor

jreback commented Jan 6, 2016

ok, build procedure works for me (on osx)! (and your example above).

[jreback-~/pandas] nosetests pandas/internals/tests/ -v
test_convert_numpy_int_dtypes (pandas.internals.tests.test_proto.TestLibPandas) ... ok
test_convert_primitive_numpy_arrays_basics (pandas.internals.tests.test_proto.TestLibPandas) ... ok
test_integer_get_set_unset_na (pandas.internals.tests.test_proto.TestLibPandas) ... ok
test_isnull_natype_interop (pandas.internals.tests.test_proto.TestLibPandas) ... ok
test_libpandas_decrefs (pandas.internals.tests.test_proto.TestLibPandas) ... ok

----------------------------------------------------------------------
Ran 5 tests in 0.018s

OK

@jreback
Copy link
Contributor

jreback commented Jan 6, 2016

where do you want to put a list of API wishlist things for these structures?

e.g.

isnull on the Array, and a cached-property on ImmutableArray
is_monotonoic_increasing/is_monotonic_decreasing as well

@jreback
Copy link
Contributor

jreback commented Jan 6, 2016

I just started things on #11970

@wesm
Copy link
Member Author

wesm commented Jan 6, 2016

I'll start making the (long) list in a Google document that I'll link on that issue. I'll see about getting Travis passing on this branch and merge this into the integration branch.

If you start hacking on this with me, this may be useful (also for my own reference since I make plenty of mistakes): https://google.github.io/styleguide/cppguide.html

@jreback
Copy link
Contributor

jreback commented Jan 6, 2016

@wesm gr8. mainly acting as a 'user'/feature requestor for the moment

@wesm
Copy link
Member Author

wesm commented Jan 6, 2016

OMG the build passed. @jreback have any idea why I had to twiddle $APT_ARGS to get Travis to work?

I'm going to squash and merge this into the integration branch and move onto the next experiments later this week

@jreback
Copy link
Contributor

jreback commented Jan 6, 2016

I guess that makes the confirmation go away on the apt-get installs

we don't really use the included stuff too much except for very basics and use default compiler

@wesm
Copy link
Member Author

wesm commented Jan 6, 2016

Gotcha. I just squashed and merged this PR into libpandas-native-core branch

@wesm wesm closed this Jan 6, 2016
@wesm wesm deleted the libpandas-prototyping branch January 6, 2016 22:08
@wesm
Copy link
Member Author

wesm commented Jan 6, 2016

See afc53f0

@jreback jreback added this to the 2.0 milestone Sep 15, 2016
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants