Archives

Categories

Tag Archives for " pandas "

How to split a Pandas dataframe into training and test sets?

Published June 18, 2018 in data - 0 Comments

This question came up recently on a project where Pandas data needed to be fed to a TensorFlow classifier. In this case, we wanted to divide the dataframe using a random sampling. Frameworks like scikit-learn may have utilities to split data sets into training, test and cross-validation sets. For example, sklearn.model_selection.train_test_split split numpy arrays or […]

Pandas DataFrame axis basics (Part 2)

Published April 13, 2018 in data - 0 Comments

Part 1 covered Pandas DataFrame basics. Pandas offers multiple options for accessing DataFrame values by axis labels using he DataFrame.loc function, or by integer indexes using the DataFrame.iloc function in one or two dimensions. If the DataFrame has a numerical index, calling the DataFrame.loc and DataFrame.iloc functions looks the same. Otherwise, use the appropriate axis […]

Tags: pandas

Pandas DataFrame axis basics (Part 1)

Published April 2, 2018 in data - 0 Comments

By default, a Pandas DataFrame is 2 dimensional with 2 axes initialized as empty Index structures. Under the basic indexing scheme, the first axis is the ‘index’ axis, which by default is a numerical index starting from 0 (using np.arange) generated for each DataFrame row. The second axis is the ‘columns’ axis, which is the […]

Tags: pandas

Importing Stringified JSON Objects Into Pandas (Part 2)

Published November 30, 2017 in data , programming - 0 Comments

All python code in this post is Python 3.5+. Continuing from Part 1, I discovered that movies_metadata.csv contains malformed rows that have missing fields, which is what caused file import to fail. I tried experimenting with some of the more advanced Pandas.read_csv parameters to see if I could work around the malformed rows. def main(path: […]

Importing Stringified JSON Objects Into Pandas (Part 1)

Published November 24, 2017 in data , programming - 0 Comments

All python code in this post is Python 3.5+. I’m continuing to work with the same Kaggle movies dataset as in the SQL import experiment. This time, I imported the data into Pandas DataFrames. The trickiest dataset to import was movies_metadata.csv. I first tried to use pandas.read_csv with the default settings. import argparse import pandas […]