If this helped you, please share!

How to split a Pandas dataframe into training and test sets?

Published June 18, 2018 in data - 0 Comments

This question came up recently on a project where Pandas data needed to be fed to a TensorFlow classifier. In this case, we wanted to divide the dataframe using a random sampling.

Frameworks like scikit-learn may have utilities to split data sets into training, test and cross-validation sets. For example, sklearn.model_selection.train_test_split split numpy arrays or pandas DataFrames into training and test sets with or without shuffling. More advanced utilities like K-fold cross-validation and others allow resampling using more advanced resampling strategies.

I like to keep code simple and the number of dependencies small when starting a new project or creating a basic prototype, so I think it’s useful to also know how to break up a DataFrame into smaller sets using only Pandas functions.

To split the DataFrame without random shuffling or sampling, slice using DataFrame.loc or DataFrame.iloc depending on the type of index.

To randomly sample and return a fixed number or fraction of items from a DataFrame (or other pandas type) axis, use DataFrame.sample. The default axis depends on the pandas type; DataFrame default is the index axis. The new DataFrame will have new index information that DataFrame.drop can use to remove items contained in the DataFrame created by DataFrame.sample and return the rest.

Both these new DataFrames have a different index type than the original DataFrame. Int64Index is an ordered set of index labels that was created when labels were selected from the original DataFrame.
Resetting the indexes with DataFrame.reset_index restores the more optimal RangeIndex type.

No comments yet

Leave a Reply: