Skip to content

ENH: Add to_tf_dataset method to convert Pandas dataframe to TensorFlow dataset #48524

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
1 of 3 tasks
jamiecash opened this issue Sep 13, 2022 · 4 comments
Closed
1 of 3 tasks
Assignees
Labels
Enhancement IO Data IO issues that don't fit into a more specific label Needs Discussion Requires discussion from core team before further action

Comments

@jamiecash
Copy link

jamiecash commented Sep 13, 2022

Feature Type

  • Adding new functionality to pandas

  • Changing existing functionality in pandas

  • Removing existing functionality in pandas

Problem Description

I wish I could use pandas to create TensorFlow datasets

Feature Description

Add a new method to_tf_dataset to DataFrame, specifying columns to use as features and labels.

def to_tf_dataset(feature_columns: list[str], label_column: str) -> tf.data.Dataset:
"""
Params:
feature_columns: list[str]: A list of columns to use as features for the dataset
label_column: The column to use as the label for the dataset. Numeric columns will be used as is. Non numeric columns will be converted to pd.Categorical
"""

Alternative Solutions

Manual conversion using tf.data.Dataset.from_tensor_slices

Additional Context

No response

@jamiecash jamiecash added Enhancement Needs Triage Issue that has not been reviewed by a pandas team member labels Sep 13, 2022
@jamiecash
Copy link
Author

take

@jamiecash
Copy link
Author

#Method

def to_tf_dataset(feature_columns : list[str], label_column: str) -> tf.data.Dataset:
    _copy = self.copy()
    
    # Check if the feature columns exist. If not, raise an KeyError exception.
    if not set(feature_columns).issubset(_copy.columns):
        raise KeyError(f"Specified feature column(s) {set(feature_columns).difference(_copy.columns)} do not exist.")
    
    # Check if the label column exists
    if label_column not in _copy.columns:
        raise KeyError(f"Specified label column '{label_column}' does not exist.")
        
    # If data type of label is not numeric, then convert to categorical and use codes as label
    if not pd.api.types.is_numeric_dtype(_copy[label_column]):
        _copy[label_column] = pd.Categorical(_copy[label_column]).codes
    
    # Pop the label
    label = _copy.pop(label_column)
    
    # Create the data source
    ds = tf.data.Dataset.from_tensor_slices((_copy[feature_columns], label))
    
    # Return it
    return ds

#Tests

  • Test 1 - Invalid feature column should raise a KeyError
  • Test 2 - Invalid label column should raise a KeyError
  • Test 3 - Numeric label column creates dataset with numeric label
  • Test 4 - Non numeric label column creates dataset with numeric label that can be converted back using pd.Categorical

@mroeschke
Copy link
Member

Thanks for the request, but similar to #46000, I think this IO path would be too niche for pandas to specifically support and would be better as a 3rd party library.

@mroeschke mroeschke added Needs Discussion Requires discussion from core team before further action and removed Needs Triage Issue that has not been reviewed by a pandas team member labels Sep 13, 2022
@jbrockmendel jbrockmendel added the IO Data IO issues that don't fit into a more specific label label Sep 16, 2023
@mroeschke
Copy link
Member

Given that there's tf.data.Dataset.from_tensor_slices, I don't think it's sufficently necessarily to extend the pandas API to wrap this method, so closing

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Enhancement IO Data IO issues that don't fit into a more specific label Needs Discussion Requires discussion from core team before further action
Projects
None yet
Development

No branches or pull requests

3 participants