Skip to content

Serialize/deserialize a Categorical whose values are taken from an enum #25448

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
teto opened this issue Feb 26, 2019 · 5 comments
Open

Serialize/deserialize a Categorical whose values are taken from an enum #25448

teto opened this issue Feb 26, 2019 · 5 comments
Labels
Categorical Categorical Data Type Enhancement IO CSV read_csv, to_csv

Comments

@teto
Copy link

teto commented Feb 26, 2019

Code Sample, a copy-pastable example if possible

should run as standalone

# Your code here
import pandas as pd
from enum import Enum, IntEnum, auto
import argparse

# Your code here
class ConnectionRoles(Enum):
    Client = auto()
    Server = auto()

csv_filename = "test.csv"

dtype_role = pd.api.types.CategoricalDtype(categories=list(ConnectionRoles), ordered=True)


df  = pd.DataFrame({ "tcpdest": [ConnectionRoles.Server] }, dtype=dtype_role)
print(df.info())
print(df)
df.to_csv(csv_filename)

loaded = pd.read_csv(csv_filename, dtype= {"tcpdest": dtype_role})
print(loaded.info())
print(loaded)

which outputs

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1 entries, 0 to 0
Data columns (total 1 columns):
tcpdest    1 non-null category
dtypes: category(1)
memory usage: 177.0 bytes
None
                  tcpdest
0  ConnectionRoles.Server
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1 entries, 0 to 0
Data columns (total 2 columns):
Unnamed: 0    1 non-null int64
tcpdest       0 non-null category
dtypes: category(1), int64(1)
memory usage: 185.0 bytes
None
   Unnamed: 0 tcpdest
0           0     NaN

The value ConnectionRoles.Server became nan through the serialization/deserialization process:

Problem description

I want to be able to serialize (to_csv) then read (read_csv) a CategoricalDType that takes its values from a python Enum (or IntEnum).

Actually the dtype I use in my project (contrary to the toy example) is:

dtype_role = pd.api.types.CategoricalDtype(categories=list(ConnectionRoles), ordered=True)


class ConnectionRoles(Enum):
    """
    Used to filter datasets and keep packets flowing in only one direction !
    Parser should accept --destination Client --destination Server if you want both.
    """
    Client = auto()
    Server = auto()

    def __str__(self):
        # Note that defining __str__ is required to get ArgumentParser's help output to include
        # the human readable (values) of Color
        return self.name

    @staticmethod
    def from_string(s):
        try:
            return ConnectionRoles[s]
        except KeyError:
            raise ValueError()

    def __next__(self):
        if self.value == 0:
            return ConnectionRoles.Server
        else:
            return ConnectionRoles.Client

I've search the tracker and the most relevant ones (but yet different) might be:

Expected Output

Output of pd.show_versions()

I am using v0.23.4 with a patch from master to fix some bug.

[paste the output of pd.show_versions() here below this line]

INSTALLED VERSIONS

commit: None
python: 3.7.2.final.0
python-bits: 64
OS: Linux
OS-release: 4.19.0
machine: x86_64
processor:
byteorder: little
LC_ALL: None
LANG: fr_FR.UTF-8
LOCALE: fr_FR.UTF-8

pandas: 0+unknown
pytest: None
pip: 18.1
setuptools: 40.6.3
Cython: None
numpy: 1.16.0
scipy: 1.2.0
pyarrow: None
xarray: None
IPython: None
sphinx: None
patsy: None
dateutil: 2.7.5
pytz: 2018.7
blosc: None
bottleneck: 1.2.1
tables: 3.4.4
numexpr: 2.6.9
feather: None
matplotlib: 3.0.2
openpyxl: 2.5.12
xlrd: 1.1.0
xlwt: 1.3.0
xlsxwriter: None
lxml.etree: 4.2.6
bs4: 4.6.3
html5lib: 1.0.1
sqlalchemy: 1.2.14
pymysql: None
psycopg2: None
jinja2: None
s3fs: None
fastparquet: None
pandas_gbq: None
pandas_datareader: None
gcsfs: None

@teto teto changed the title Categorical Serialize/deserialize a Categorical whose values are taken from an enum Feb 26, 2019
@gfyoung gfyoung added Categorical Categorical Data Type Enhancement IO Data IO issues that don't fit into a more specific label labels Feb 27, 2019
@gfyoung
Copy link
Member

gfyoung commented Feb 27, 2019

Interesting proposal...couldn't you use our converters parameter though to deserialize the string yourself when calling read_csv?

Likewise, when serializing, you could do a transform on the column before calling to_csv.

@teto
Copy link
Author

teto commented Feb 27, 2019

I have a not so simple cycle of serializing/deserializing different types of dataframes with many fields so complexity quickly increases. It would be nice to have this tackled automatically.

I tried using a converter read_csv( converters= {"tcpdest": _convert_role;} with

def _convert_role(x):
    return ConnectionRoles(x)

but then I lose the dtype for the column as pandas warns it will use only the converter.

tcpdest             float64

and this is bad.
Wouldn't it be possible to properly rebuild enum values when codes in the CategoricalDType categories belong to an enum ?
With an IntEnum, my 1st example works a bit better, yet values are numpy.int64 when I would expect enum instances. Thus my checks fail:
assert isinstance(destination, ConnectionRoles), "destination is %r" % destination.

If you tell me where to look at I can even have a try myself. I tried having a look but the logic seems pretty complex between real_values, inferred ones etc...

@gfyoung
Copy link
Member

gfyoung commented Mar 14, 2019

@teto : Sorry for taking so long to respond here. Could you provide a complete code sample for what you're describing? That would be very helpful!

Also, if you are interested, you can search for parsers.pyx on GitHub, and that will take you to the file where we handle the converters for read_csv.

@astrojuanlu
Copy link

This also affects to_parquet, which doesn't have a converters parameter.

@mroeschke mroeschke added IO CSV read_csv, to_csv and removed IO Data IO issues that don't fit into a more specific label labels May 2, 2020
@buhtz
Copy link

buhtz commented Aug 12, 2021

I am not sure if I fully understand this issue here. Is it the same problem described here?
https://stackoverflow.com/q/68591255/4865723

Do you want to specify a columns dtype as an ordered Categorial while doing read_csv()?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Categorical Categorical Data Type Enhancement IO CSV read_csv, to_csv
Projects
None yet
Development

No branches or pull requests

5 participants