Skip to content

FeatureGroup _INTEGER_TYPES attribute incompatible with pandas 1.1.5 dtypes. #2868

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
tomasosorio opened this issue Jan 25, 2022 · 2 comments · Fixed by #3740
Closed

FeatureGroup _INTEGER_TYPES attribute incompatible with pandas 1.1.5 dtypes. #2868

tomasosorio opened this issue Jan 25, 2022 · 2 comments · Fixed by #3740

Comments

@tomasosorio
Copy link

tomasosorio commented Jan 25, 2022

Describe the bug
I was trying to define a feature group based on a sample pandas DataFrame though load_feature_definitions method.
The method failed due to the way the accepted pandas types are hardcoded on the FeatureGroup class.
The attribute lists integer pandas data types as 'int...' while with pandas list them as 'Int...'.

To reproduce
Sagemaker Notebook on a ml.t2.medium instance and conda_python3 kernel.
Sagemaker Version 2.72.1
Pandas Version 1.1.5
Build a pandas DataFrame with Integers.
Apply convert_dtypes method (from pandas.DataFrame) over the previously built DataFrame
Attempt to build feature definitions with the load_feature_definitions method using the previous DataFrame as parameter.

Expected behavior
Success on loading feature definitions.

Screenshots or logs

<class 'pandas.core.frame.DataFrame'>
Int64Index: 1 entries, 0 to 0
Data columns (total 62 columns):
 #      Column                                                                     Non-Null Count  Dtype  
---  ------                                                                          --------------  -----  
 0   customer                                                                      1 non-null      string 
 1   customerCountTotalTransactions0min10min            1 non-null      Int64  
 2   customerCountTotalTransactions0min45min           1 non-null      Int64  
 3   customerCountTotalTransactions0h2h                      1 non-null      Int64 
    _INTEGER_TYPES = [
        "int_",
        "int8",
        "int16",
        "int32",
        "int64",
        "uint8",
        "uint16",
        "uint32",
        "uint64",
    ]

System information
A description of your system. Please provide:

  • SageMaker Python SDK version: 2.72.1
  • Framework name (eg. PyTorch) or algorithm (eg. KMeans): Pandas
  • Framework version: 1.1.5
  • Python version: 3.9
  • CPU or GPU: CPU
  • Custom Docker image (Y/N): N

Additional context
Swapping (in load_feature_definitions method)

feature_type = self._DTYPE_TO_FEATURE_DEFINITION_CLS_MAP.get(
                str(data_frame[column].dtype), None
            )

TO

feature_type = self._DTYPE_TO_FEATURE_DEFINITION_CLS_MAP.get(
                str(data_frame[column].dtype).lower(), None
            )

Might solve the issue.

@brifordwylie
Copy link
Contributor

Ran into this today.. thankfully @tomasosorio did a great job of capturing the issue and fix. I've tried the fix out locally and it works great. I submitted a PR (#3740)

@brifordwylie
Copy link
Contributor

brifordwylie commented Apr 1, 2023

For folks looking to code around this (until #3740 gets merged).. here's a sloppy workaround:

    @staticmethod
    def convert_nullable_types(df: pd.DataFrame) -> pd.DataFrame:
        """Convert the new Pandas 'nullable types' since AWS SageMaker code doesn't currently support them
           See: https://github.com/aws/sagemaker-python-sdk/pull/3740"""
        for column in list(df.select_dtypes(include=[pd.Int64Dtype]).columns):
            df[column] = df[column].astype('int64')
        for column in list(df.select_dtypes(include=[pd.Float64Dtype]).columns):
            df[column] = df[column].astype('float64')
        return df

Simply call this right before you send the dataframe to load_feature_definitions()

        # Convert Int64 and Float64 types (see: https://github.com/aws/sagemaker-python-sdk/pull/3740)
        self.input_df = self.remove_nullable_types(self.input_df)

        # Create a Feature Group and load our Feature Definitions
        my_feature_group = FeatureGroup(name=self.output_uuid, sagemaker_session=self.sm_session)
        my_feature_group.load_feature_definitions(data_frame=self.input_df)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants