- Home>
- pandas
Disclosure: I was given access to Effective Pandas – Standard to review. I was not compensated for writing this post and the links to the course are not affiliate links. All opinions are 100% my own. Quick take: Effective Pandas is a high quality course with well thought out content. I felt like it was […]
This question came up recently on a project where Pandas data needed to be fed to a TensorFlow classifier. In this case, we wanted to divide the dataframe using a random sampling. Frameworks like scikit-learn may have utilities to split data sets into training, test and cross-validation sets. For example, sklearn.model_selection.train_test_split split numpy arrays or […]
Part 1 covered Pandas DataFrame basics. Pandas offers multiple options for accessing DataFrame values by axis labels using he DataFrame.loc function, or by integer indexes using the DataFrame.iloc function in one or two dimensions. If the DataFrame has a numerical index, calling the DataFrame.loc and DataFrame.iloc functions looks the same. Otherwise, use the appropriate axis […]
By default, a Pandas DataFrame is 2 dimensional with 2 axes initialized as empty Index structures. Under the basic indexing scheme, the first axis is the ‘index’ axis, which by default is a numerical index starting from 0 (using np.arange) generated for each DataFrame row. The second axis is the ‘columns’ axis, which is the […]
All python code in this post is Python 3.5+. In my previous post, I described how I got usable Pandas dataframes from the Kaggle movies dataset. My next step was to start exploring the data with simple visualizations. The first feature I wanted explore was the distribution of movies by year in the movies_metadata data […]
All python code in this post is Python 3.5+. This post describes how I parsed movies_metadata.csv from the Kaggle movies dataset; a task I started in Part 1 and Part 2. After some digging into the Pandas documentation and Stack Overflow, I found that the best solution to my parsing problems was to explicitly set […]
All python code in this post is Python 3.5+. Continuing from Part 1, I discovered that movies_metadata.csv contains malformed rows that have missing fields, which is what caused file import to fail. I tried experimenting with some of the more advanced Pandas.read_csv parameters to see if I could work around the malformed rows. def main(path: […]
All python code in this post is Python 3.5+. I’m continuing to work with the same Kaggle movies dataset as in the SQL import experiment. This time, I imported the data into Pandas DataFrames. The trickiest dataset to import was movies_metadata.csv. I first tried to use pandas.read_csv with the default settings. import argparse import pandas […]