-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
Add support for IP Address and MAC Address data #18767
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Wow, detailed proposal! First question that comes to my mind: why is it needed to be included in pandas (from technical point of view). Or to put it differently: what is currently in E.g. in geopandas the GeometryBlock can be stored in a Series as well, the main reason we have the subclasses GeoSeries and GeoDataFrame is to add a bunch of additional methods (but which could be solved with an accessor). |
For example, I see you list concat and indexing in the notebook as things that don't work. However, if you define the correct method on your block, concatting Series objects should work, and basic indexing should work as well. |
Unless I'm missing something, there isn't a good way stuff an arbitrary "thing" into the regular In [1]: import pandas as pd
pi
In [2]: import pandas_ip as ip
In [3]: arr = ip.IPAddress.from_pyints([1, 2])
In [4]: arr
Out[4]: <IPAddress(['0.0.0.1', '0.0.0.2'])>
In [5]: pd.Series(arr)
Out[5]:
0 <IPAddress(['0.0.0.1', '0.0.0.2'])>
dtype: object AFAICT, the only way to do this from outside pandas is to construct blocks directly and use fastpath In [8]: pd.Series(ip.IPBlock(arr, slice(0, 1)), pd.RangeIndex(2), fastpath=True)
Out[8]:
0 0.0.0.1
1 0.0.0.2
dtype: ip So an alternative to my proposal would be to make something like (edited a bug in my example). |
I could imagine coming up with an interface where if an object passed to the interface satisfies it, we dispatch some of the
Then pandas can (maybe) figure out the right thing to do. To be clear, I'd be more than satisfied if we can make this solution work. |
I was actually thinking about this yesterday, but in the context of Obviously a bit of work would need to be done on IP Addresses and Additionally, the PostgreSQL docs might be useful as an additional reference/another perspective in general: |
Updated the original with some information on why doing this outside pandas is (currently) difficult, but I'd be happy to work on making that smoother. @jschendel, yes I was just reading through https://docs.python.org/3/howto/ipaddress.html#defining-networks on this. I'm not especially familiar with the network side of things, so I'm not sure what that would look like. And good call on using Postgres for design inspiration. |
I'm not opposed to having an IP type in pandas, but does seem like it could be an interesting case to try develop an "extension block API" around, i.e., you do something like subclass That said, I really don't know our own internal interfaces well enough to know if this is feasible without massive refactoring or even a good idea. |
FWIW, I plan to experiment with defining an interface through ABCs next
week. I'll update with how that turns out.
…On Thu, Dec 14, 2017 at 3:27 PM, chris-b1 ***@***.***> wrote:
I'm not opposed to having an IP type in pandas, but does seem like it
could be an interesting case to try develop an "extension block API"
around, i.e., you do something like subclass Block and ExtensionDtype and
through metaclass registration or whatever, everything works!
That said, I really don't know our own internal interfaces well enough to
know if this is feasible without massive refactoring or even a good idea.
—
You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub
<#18767 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/ABQHIsjr4QTqm_j2sCedSyV5WSMK7Z0Aks5tAZK4gaJpZM4RA0QJ>
.
|
Yes, that is correct. That is also something with which I have struggled in geopandas. For the short term, you could provide functional constructors like BTW, the fact that it doesn't see your ip array-like as an array-like and unwraps it in a series (so getting series of length 2) feels like a bug in pandas (in
An alternative interface could be pandas checking for a
Can you explain this a bit in more detail? |
I haven't (yet) implemented the methods to make that IP array an iterable.
A class (ABC or otherwise) that contains enough information for the pandas constructors to do the right thing (the |
Adds new methods for registing custom accessors to pandas objects. This will be helpful for implementing pandas-dev#18767 outside of pandas. Closes pandas-dev#14781
Adds new methods for registing custom accessors to pandas objects. This will be helpful for implementing pandas-dev#18767 outside of pandas. Closes pandas-dev#14781
Adds new methods for registing custom accessors to pandas objects. This will be helpful for implementing pandas-dev#18767 outside of pandas. Closes pandas-dev#14781
Adds new methods for registing custom accessors to pandas objects. This will be helpful for implementing pandas-dev#18767 outside of pandas. Closes pandas-dev#14781
Adds new methods for registing custom accessors to pandas objects. This will be helpful for implementing pandas-dev#18767 outside of pandas. Closes pandas-dev#14781
Adds new methods for registing custom accessors to pandas objects. This will be helpful for implementing pandas-dev#18767 outside of pandas. Closes pandas-dev#14781
Adds new methods for registing custom accessors to pandas objects. This will be helpful for implementing pandas-dev#18767 outside of pandas. Closes pandas-dev#14781
* ENH: Added public accessor registrar Adds new methods for registing custom accessors to pandas objects. This will be helpful for implementing #18767 outside of pandas. Closes #14781 * PEP8 * Moved to extensions * More docs * Fix see also * DOC: Added whatsnew * Move to api * Update post review * flake8 * Raise the underlying error instead of a RuntimeError * str validate * DOC: Moved to developer * REF: Use public registrars for accessors * cleanup * Implemented optional caching * Document cache * Tests passing * Use for plot * Fix autodoc * Fix the class instantiation * Refactor again. 1. Removed optional caching 2. Refactored `Properties` to create the indexes it uses on demand 3. Moved accessor definitions to classes for clarity * Fix API files * Remove stale comment * Tests pass * DOC: some cleanup * No need to assign doc * Rename, shared docs * Doc __new__ * Use UserWarning * Update test
Closing this. It's implemented in https://cyberpandas.readthedocs.io/. |
@TomAugspurger the title of this issue mentions mac-addresses; I see that cyberpandas groks IPs now, but is there a solution for mac addresses? If so, can you elaborate? |
Yes, cyberpandas has a MACArray type.
https://cyberpandas.readthedocs.io/en/latest/api.html#macarray
Feel free to open an issue at https://github.com/ContinuumIO/cyberpandas if
you have questions / issues.
…On Tue, Jun 19, 2018 at 5:32 PM, Mike Pennington ***@***.***> wrote:
@TomAugspurger <https://github.com/TomAugspurger> the title of this issue
mentions mac-addresses; I see that cyberpandas groks IPs now, but is there
a solution for mac addresses? If so, can you elaborate?
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#18767 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/ABQHIvW-6fwHAYx8eAnpz85usHpqYp0Mks5t-XwPgaJpZM4RA0QJ>
.
|
Hi all, this is a proposal to add a new block and type for representing IP Addresses.
There are still some details that need ironing out, but I wanted to gauge reactions to
including this in pandas before spending too much more time on it.
Here's a notebook demonstrating the basics: http://nbviewer.jupyter.org/gist/TomAugspurger/3ba2bc273edfec809b61b5030fd278b9
Abstract
Proposal to add support for storing and operating on IP Address data.
Adds a new block type for ip address data and an
ip
accessor toSeries
andIndex
.Rationale
For some communities, IP and MAC addresses are a common data format. The data
format was deemed important enough to add the
ipaddress
module to the standardlibrary (see
PEP 3144
_). At Anaconda, we hear from customers who would use afirst-class IP address array container if it existed in pandas.
I turned to StackOverflow to gauge interest in this topic. A search for "IP" on
the pandas stackoverflow
tag turns up 300 results.
Under the NumPy tag there are another 80. For comparison, I ran a few other
searches to see what interest there is in other "specialized" data types (this
is a very rough, probably incorrect, way of estimating interest):
Categorical, which is already in pandas, turned up 1,089 items.
Overall, I think there's enough interest relative to the implementation /
maintenance burden to warrant adding the support for IP Addresses. I don't
anticipate this causing any issues for the arrow transition, once ARROW-1587 is
in place. We can be careful which parts of the storage layer are implementation
details.
Specification
The proposal is to add
CategoricalDtype
andCategorical
).CategoricalBlock
)..ip
, for operating on IPaddresses and MAC addresses (similar to
.cat
).The type and block should be generic IP address blocks, with no
distinction between IPv4 and IPv6 addresses. In our experience, it's
common to work with data from multiple sources, some of which may be
IPv4, and some of which may be IPv6. This also matches the semantics
of the default
ipaddress.ip_address
factory function, which returnsan
IPv4Address
orIPv6Address
as needed. Being able to deal withip addresses in an IPv4 vs. IPv6 agnostic fashion is useful.
Data Layout
Since IPv6 addresses are 128 bits, they do not fit into a standard NumPy uint64
space. This complicates the implementation (but, gives weight to accepting the
proposal, since doing this on your own can be tricky).
Each record will be composed of two uint64s. The first element
contains the first 64 bits, and the second array contains the second 64
bits. As a NumPy structured dtype, that's
This is a common format for handling IPv4 and IPv6 data:
From here
Missing Data
Use the lowest possible IP address as a marker. According to RFC2373,
See here.
Methods
The new user-facing
IPAddress
(analogous to aCategorical
) will havea few methods for easily constructing arrays of IP addresses.
The methods in the new
.ip
namespace should follow the standardlibrary's design.
Properties
is_multicast
is_private
is_global
is_unspecificed
is_reserved
is_loopback
is_link_local
Reference Implementation
An implementation of the types and block is available at
pandas-ip (at the moment
it's a proof of concept).
Alternatives
Adding a new block type to pandas is a major change. Downstream libraries may
have special-cased handling for pandas' extension types, so this shouldn't be
adopted without careful consideration.
Some alternatives to this that exist outside of pandas:
ipaddress.IPv4Address
oripaddress.IPv6Address
objects inan
object
dtype array. The.ip
namespace could still be includedwith an extension decorator. The drawback here is the poor
performance, as every operation would be done element-wise.
downside here is that the library would need to subclass
Series
,DataFrame
, andIndex
so that the custom blocks and types areinterpreted correctly. Users would need to use the custom
IPSeries
,IPDataFrame
, etc., which increases friction when workingwith other libraries that may expect / coerce to pandas objects.
To expand a bit on the (current) downside of alternative 2, when the pandas constructors
see an "unknown" object, they falls back to
object
dtype and stuffs the actual Python objectinto whatever container is being created:
I'd rather not have to make a subclass of Series, just to stick an array-like thing into a Series.
If pandas could provide an interface such that objects satisfying that interface
are treated as array-like, and not a simple python object, then I'll gladly close
this issue and develop the IP-address specific functionality in another package.
That might be the best possible outcome to all this.
References
The text was updated successfully, but these errors were encountered: