Skip to content

WIP: prototype for unit support #10349 #17153

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 9 commits into from

Conversation

Bernhard10
Copy link
Contributor

@Bernhard10 Bernhard10 commented Aug 2, 2017

Trying to follow @jreback idea of adding a dtype for holding the unit.

Related to issue #10349

  • closes #xxxx
  • tests added / passed (incomplete tests)
  • passes git diff upstream/master -u -- "*.py" | flake8 --diff (except for one warning)
  • whatsnew entry

@pep8speaks
Copy link

pep8speaks commented Aug 2, 2017

Hello @Bernhard10! Thanks for updating the PR.

Line 10:1: E302 expected 2 blank lines, found 1
Line 10:40: E261 at least two spaces before inline comment
Line 16:1: E302 expected 2 blank lines, found 1
Line 42:5: E303 too many blank lines (2)
Line 46:80: E501 line too long (93 > 79 characters)
Line 48:80: E501 line too long (82 > 79 characters)
Line 56:80: E501 line too long (80 > 79 characters)
Line 67:80: E501 line too long (83 > 79 characters)
Line 90:38: E251 unexpected spaces around keyword / parameter equals
Line 90:40: E251 unexpected spaces around keyword / parameter equals
Line 92:1: E302 expected 2 blank lines, found 1
Line 98:18: E201 whitespace after '['
Line 99:5: E301 expected 1 blank line, found 0
Line 101:13: E265 block comment should start with '# '
Line 108:80: E501 line too long (81 > 79 characters)
Line 109:50: E261 at least two spaces before inline comment
Line 109:51: E262 inline comment should start with '# '
Line 109:80: E501 line too long (97 > 79 characters)
Line 139:20: E226 missing whitespace around arithmetic operator
Line 145:80: E501 line too long (83 > 79 characters)
Line 155:13: E306 expected 1 blank line before a nested definition, found 0
Line 162:80: E501 line too long (95 > 79 characters)
Line 167:13: E306 expected 1 blank line before a nested definition, found 0
Line 170:80: E501 line too long (93 > 79 characters)
Line 185:42: E261 at least two spaces before inline comment
Line 185:43: E262 inline comment should start with '# '

Line 335:80: E501 line too long (98 > 79 characters)
Line 345:80: E501 line too long (83 > 79 characters)
Line 347:80: E501 line too long (89 > 79 characters)
Line 349:80: E501 line too long (85 > 79 characters)

Line 154:80: E501 line too long (80 > 79 characters)

Line 14:1: E302 expected 2 blank lines, found 1
Line 15:21: E231 missing whitespace after ','
Line 15:23: E231 missing whitespace after ','
Line 15:25: E231 missing whitespace after ','
Line 15:27: E231 missing whitespace after ','
Line 15:29: E231 missing whitespace after ','
Line 15:39: E251 unexpected spaces around keyword / parameter equals
Line 15:41: E251 unexpected spaces around keyword / parameter equals
Line 17:21: E231 missing whitespace after ','
Line 18:20: E225 missing whitespace around operator
Line 20:1: E302 expected 2 blank lines, found 1
Line 21:21: E231 missing whitespace after ','
Line 21:23: E231 missing whitespace after ','
Line 21:25: E231 missing whitespace after ','
Line 21:27: E231 missing whitespace after ','
Line 21:29: E231 missing whitespace after ','
Line 21:39: E251 unexpected spaces around keyword / parameter equals
Line 21:41: E251 unexpected spaces around keyword / parameter equals
Line 23:37: E231 missing whitespace after ','
Line 23:40: E231 missing whitespace after ','
Line 23:43: E231 missing whitespace after ','
Line 23:46: E231 missing whitespace after ','
Line 23:49: E231 missing whitespace after ','
Line 23:60: E251 unexpected spaces around keyword / parameter equals
Line 23:62: E251 unexpected spaces around keyword / parameter equals
Line 24:36: E231 missing whitespace after ','
Line 24:38: E231 missing whitespace after ','
Line 24:40: E231 missing whitespace after ','
Line 24:42: E231 missing whitespace after ','
Line 24:44: E231 missing whitespace after ','
Line 24:54: E251 unexpected spaces around keyword / parameter equals
Line 24:56: E251 unexpected spaces around keyword / parameter equals
Line 25:36: E231 missing whitespace after ','
Line 25:38: E231 missing whitespace after ','
Line 25:40: E231 missing whitespace after ','
Line 25:42: E231 missing whitespace after ','
Line 25:44: E231 missing whitespace after ','
Line 25:54: E251 unexpected spaces around keyword / parameter equals
Line 25:56: E251 unexpected spaces around keyword / parameter equals
Line 29:21: E231 missing whitespace after ','
Line 29:23: E231 missing whitespace after ','
Line 29:25: E231 missing whitespace after ','
Line 29:27: E231 missing whitespace after ','
Line 29:29: E231 missing whitespace after ','
Line 29:39: E251 unexpected spaces around keyword / parameter equals
Line 29:41: E251 unexpected spaces around keyword / parameter equals
Line 34:21: E231 missing whitespace after ','
Line 34:23: E231 missing whitespace after ','
Line 34:25: E231 missing whitespace after ','
Line 34:27: E231 missing whitespace after ','
Line 34:29: E231 missing whitespace after ','
Line 34:39: E251 unexpected spaces around keyword / parameter equals
Line 34:41: E251 unexpected spaces around keyword / parameter equals
Line 35:21: E231 missing whitespace after ','
Line 35:23: E231 missing whitespace after ','
Line 35:25: E231 missing whitespace after ','
Line 35:27: E231 missing whitespace after ','
Line 35:29: E231 missing whitespace after ','
Line 35:39: E251 unexpected spaces around keyword / parameter equals
Line 35:41: E251 unexpected spaces around keyword / parameter equals
Line 36:21: E231 missing whitespace after ','
Line 36:23: E231 missing whitespace after ','
Line 36:25: E231 missing whitespace after ','
Line 36:27: E231 missing whitespace after ','
Line 36:29: E231 missing whitespace after ','
Line 36:39: E251 unexpected spaces around keyword / parameter equals
Line 36:41: E251 unexpected spaces around keyword / parameter equals

Comment last updated on August 03, 2017 at 10:49 Hours UTC



class Dimensional(PandasObject):
"""
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is is meant to be a dimensioned scalar?

Copy link
Contributor Author

@Bernhard10 Bernhard10 Aug 2, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is ment to be an array with units. (I did not find a better word for "quantity with unit" than dimensional - see https://english.stackexchange.com/a/48069)
It stores the data as float array, but has its own dtype. (Modelled after the Categorical class)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

units is a property of the dtype, NOT the array itself. You don't need another array-like class. Just a proper dtype.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried that. But I didn't get it to work properly. I guess I don't know enough about the internals of pandas to get it to work.

Copy link
Contributor Author

@Bernhard10 Bernhard10 Aug 2, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Numpy did not allow me to create an array with my costum dtype. Maybe I missed something in the dtype class construction.
Numpy gives me TypeError: data type not understood, which is desired according to test_dtypes.Base.test_numpy_informed

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we barely use numpy except to hold plain vanilla things. The way to handle this is to simply make a regular float Series with a parameterized dtype of the unit. These would be constructed similar to DatetimeTZDtype, maybe float64['ergs'] or whatever The machinery is already in place to do this.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the feedback. What you said is exactly what I planned to do, but I somehow don't get it how to do it.

The problem is that Series does not seem to habe a dtype other than the numpy array's dtype:
If I understood the code correctly, then the following happens:

If I do Series([1,2,3], dtype=DimensionedFloatDtype("meter")), then the code
hits data = _sanitize_array(data, index, dtype, copy, raise_cast_failure=True) which returns an np.array.
This array gets passed into the SingleBlockManager which is passed to NDFrame.__init__, where it is stored as _data.

Now Series.dtype is a property that just retrieves the dtype from the SingleBlockManager (which was constructed with a np.array, and according to _interleaved_dtype(blocks) disallows ExtensionDtypes).

The way DatetimeTZDtype does it is by creating a pandas.core.indexes.DatetimeIndex, categorical uses Categorical to mimic an array.

I'm sorry if I missed something obvious, but I cannot see how to do this without the need for a wrapper class that holds my float data.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The only other option I see would be to explicitly store the dtype in the Series object, but that would be a change that potentially affects things unrelated to numbers with units.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you need to step thru code where we use an existing dtype and see how its done. its not easy but all is there. start with simple tests.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, I don't get it.
I looked at DatetimeTZDtype, but it has its own block type DatetimeTZBlock.
I looked at categorical data, it has its own class Categorical(PandasObject).
I looked at the PeriodDtype, but it is only used for indexes and I cannot instantiate a Series like pd.Series([1,2,3], dtype="period[S]").

I'm sorry if I overlooked it, but if it is all there, I just don't see where it is.
I can create the Dtype easily enough, I just can't get it to integrate well with pd.Series.

@jorisvandenbossche jorisvandenbossche changed the title An (incomplete) prototype for ENH issue #10349 WIP: prototype for unit support #10349 Aug 2, 2017
@Bernhard10
Copy link
Contributor Author

If I am allowed to store the dtype in the Series class (for some reason I assumed I was not allowed to do this), then everything is indeed there already in pandas.

The latest commit stored the dtype directly in Series and works straight forward now.
Obviously more tests for edge-cases have to be added.

@Bernhard10
Copy link
Contributor Author

The next step would be to remove the direct dependency on pint and make it generic.

@gfyoung gfyoung added Dtype Conversions Unexpected or buggy dtype conversions Enhancement labels Aug 2, 2017
@Bernhard10
Copy link
Contributor Author

With Series storing the dtype, everything else seems to perfectly fall in place.

The current state of the code has (currently) nothing to do with units, but provides a framework how
unit libraries could subclass pandas ExtensionDtype. Since ExtentionDtype subclasses like DatetimeTZDtype have a lot of special handling all over pandas, I had to subclass ExtensionType creating the "abstract" dtype NumpyDtypeWithMetadata.

The idea of NumpyDtypeWithMetadata is to define an interface for dtypes that store the data in a numpy-compatible dtype but where the dtype holds additional information and defines its own operations.

There is still some work to do:

*) this works only with Series, not with SparseSeries etc.
*) Thus, not all tests pass
*) getting the dtype from a string does not work. It probably will never work, because the actual subclasses of NumpyDtypeWithMetadata would be provided by libraries like pint.
*) More thorough testing for edge cases is needed
*) How about indices.

Unfortunately I currently have no time to complete this pull-request, but I will continue to work on it in September, if the basic direction of where this is going is deemed fitting for pandas. Then I could also provide a sample implementation for the use with Pint.

@jreback
Copy link
Contributor

jreback commented Sep 23, 2017

@Bernhard10 welcome to have this, though this would require extensive effort / testing, e.g. see for example #16015

@jreback
Copy link
Contributor

jreback commented Oct 29, 2017

@Bernhard10 good effort here. going to close. but if you would like to continue working you certainly can. This would need extensive testing in order to be accepted I think.

@jreback jreback closed this Oct 29, 2017
@Bernhard10
Copy link
Contributor Author

Thanks for the feedback and sorry for the delayed response.

I hope I will find time to continue working on this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Dtype Conversions Unexpected or buggy dtype conversions Enhancement
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants