Serialize/deserialize a Categorical whose values are taken from an enum #25448

teto · 2019-02-26T13:57:04Z

Code Sample, a copy-pastable example if possible

should run as standalone

# Your code here
import pandas as pd
from enum import Enum, IntEnum, auto
import argparse

# Your code here
class ConnectionRoles(Enum):
    Client = auto()
    Server = auto()

csv_filename = "test.csv"

dtype_role = pd.api.types.CategoricalDtype(categories=list(ConnectionRoles), ordered=True)


df  = pd.DataFrame({ "tcpdest": [ConnectionRoles.Server] }, dtype=dtype_role)
print(df.info())
print(df)
df.to_csv(csv_filename)

loaded = pd.read_csv(csv_filename, dtype= {"tcpdest": dtype_role})
print(loaded.info())
print(loaded)

which outputs

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1 entries, 0 to 0
Data columns (total 1 columns):
tcpdest    1 non-null category
dtypes: category(1)
memory usage: 177.0 bytes
None
                  tcpdest
0  ConnectionRoles.Server
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1 entries, 0 to 0
Data columns (total 2 columns):
Unnamed: 0    1 non-null int64
tcpdest       0 non-null category
dtypes: category(1), int64(1)
memory usage: 185.0 bytes
None
   Unnamed: 0 tcpdest
0           0     NaN

The value ConnectionRoles.Server became nan through the serialization/deserialization process:

Problem description

I want to be able to serialize (to_csv) then read (read_csv) a CategoricalDType that takes its values from a python Enum (or IntEnum).

Actually the dtype I use in my project (contrary to the toy example) is:

dtype_role = pd.api.types.CategoricalDtype(categories=list(ConnectionRoles), ordered=True)


class ConnectionRoles(Enum):
    """
    Used to filter datasets and keep packets flowing in only one direction !
    Parser should accept --destination Client --destination Server if you want both.
    """
    Client = auto()
    Server = auto()

    def __str__(self):
        # Note that defining __str__ is required to get ArgumentParser's help output to include
        # the human readable (values) of Color
        return self.name

    @staticmethod
    def from_string(s):
        try:
            return ConnectionRoles[s]
        except KeyError:
            raise ValueError()

    def __next__(self):
        if self.value == 0:
            return ConnectionRoles.Server
        else:
            return ConnectionRoles.Client

I've search the tracker and the most relevant ones (but yet different) might be:

Expected Output

Output of `pd.show_versions()`

I am using v0.23.4 with a patch from master to fix some bug.

[paste the output of pd.show_versions() here below this line]

INSTALLED VERSIONS

commit: None
python: 3.7.2.final.0
python-bits: 64
OS: Linux
OS-release: 4.19.0
machine: x86_64
processor:
byteorder: little
LC_ALL: None
LANG: fr_FR.UTF-8
LOCALE: fr_FR.UTF-8

pandas: 0+unknown
pytest: None
pip: 18.1
setuptools: 40.6.3
Cython: None
numpy: 1.16.0
scipy: 1.2.0
pyarrow: None
xarray: None
IPython: None
sphinx: None
patsy: None
dateutil: 2.7.5
pytz: 2018.7
blosc: None
bottleneck: 1.2.1
tables: 3.4.4
numexpr: 2.6.9
feather: None
matplotlib: 3.0.2
openpyxl: 2.5.12
xlrd: 1.1.0
xlwt: 1.3.0
xlsxwriter: None
lxml.etree: 4.2.6
bs4: 4.6.3
html5lib: 1.0.1
sqlalchemy: 1.2.14
pymysql: None
psycopg2: None
jinja2: None
s3fs: None
fastparquet: None
pandas_gbq: None
pandas_datareader: None
gcsfs: None

The text was updated successfully, but these errors were encountered:

gfyoung · 2019-02-27T06:15:12Z

Interesting proposal...couldn't you use our converters parameter though to deserialize the string yourself when calling read_csv?

Likewise, when serializing, you could do a transform on the column before calling to_csv.

teto · 2019-02-27T08:17:31Z

I have a not so simple cycle of serializing/deserializing different types of dataframes with many fields so complexity quickly increases. It would be nice to have this tackled automatically.

I tried using a converter read_csv( converters= {"tcpdest": _convert_role;} with

def _convert_role(x):
    return ConnectionRoles(x)

but then I lose the dtype for the column as pandas warns it will use only the converter.

tcpdest             float64

and this is bad.
Wouldn't it be possible to properly rebuild enum values when codes in the CategoricalDType categories belong to an enum ?
With an IntEnum, my 1st example works a bit better, yet values are numpy.int64 when I would expect enum instances. Thus my checks fail:
assert isinstance(destination, ConnectionRoles), "destination is %r" % destination.

If you tell me where to look at I can even have a try myself. I tried having a look but the logic seems pretty complex between real_values, inferred ones etc...

gfyoung · 2019-03-14T21:56:42Z

@teto : Sorry for taking so long to respond here. Could you provide a complete code sample for what you're describing? That would be very helpful!

Also, if you are interested, you can search for parsers.pyx on GitHub, and that will take you to the file where we handle the converters for read_csv.

astrojuanlu · 2020-02-10T15:31:37Z

This also affects to_parquet, which doesn't have a converters parameter.

buhtz · 2021-08-12T08:32:41Z

I am not sure if I fully understand this issue here. Is it the same problem described here?
https://stackoverflow.com/q/68591255/4865723

Do you want to specify a columns dtype as an ordered Categorial while doing read_csv()?

teto changed the title ~~Categorical~~ Serialize/deserialize a Categorical whose values are taken from an enum Feb 26, 2019

teto mentioned this issue Feb 26, 2019

Tracking of dependency related bugs (pandas mostly): teto/pymptcpanalyzer#14

Open

gfyoung added Categorical Categorical Data Type Enhancement IO Data IO issues that don't fit into a more specific label labels Feb 27, 2019

mroeschke added IO CSV read_csv, to_csv and removed IO Data IO issues that don't fit into a more specific label labels May 2, 2020

kiaradlf mentioned this issue May 21, 2024

ENH: support parquet's enum type using Categorical when (de)serializing #58799

Closed

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Serialize/deserialize a Categorical whose values are taken from an enum #25448

Serialize/deserialize a Categorical whose values are taken from an enum #25448

teto commented Feb 26, 2019

INSTALLED VERSIONS

gfyoung commented Feb 27, 2019

teto commented Feb 27, 2019

gfyoung commented Mar 14, 2019

astrojuanlu commented Feb 10, 2020

buhtz commented Aug 12, 2021

Serialize/deserialize a Categorical whose values are taken from an enum #25448

Serialize/deserialize a Categorical whose values are taken from an enum #25448

Comments

teto commented Feb 26, 2019

Code Sample, a copy-pastable example if possible

Problem description

Expected Output

Output of pd.show_versions()

INSTALLED VERSIONS

gfyoung commented Feb 27, 2019

teto commented Feb 27, 2019

gfyoung commented Mar 14, 2019

astrojuanlu commented Feb 10, 2020

buhtz commented Aug 12, 2021

Output of `pd.show_versions()`