If this helped you, please share!

Importing Stringified JSON Objects Into Pandas (Part 3)

Published December 14, 2017 in data , programming - 0 Comments

All python code in this post is Python 3.5+.

This post describes how I parsed movies_metadata.csv from the Kaggle movies dataset; a task I started in Part 1 and Part 2.

After some digging into the Pandas documentation and Stack Overflow, I found that the best solution to my parsing problems was to explicitly set the column names in the names parameter, create custom importers and set them in the converters parameter. This also allowed me to assign missing sentinal data values per data type (NaN for numerical data, None etc.). The chapter on handling missing data in the Python Data Science Handbook was useful here.

All columns expected to have numerical values were converted to floats except for the Movie Database IDs, which were converted to ints. Columns that contain NaNs are required to be either floats or objects.

The convert_to_dict_list function uses Abstract Syntax Trees to convert the stringified JSON objects to valid dictionary objects by evaluating the strings as lists of literal Python dicts. An alternative approach might have been to replace the single quotes in the data set with double quotes and then attempt to import as JSON objects. This approach, which was suggested in the dataset comments, was simpler and cleaner.

This row is typical of one of the malformed rows that was breaking pandas.read_csv, where one or more of the first columns are missing. Here, the first nine columns are missing:

Given the pattern of missing data, it made sense to identify and drop these rows before serializing and exporting DataFrames as pickle files. The first column was a boolean value indicating if the movie is an adult film or not. Returning a NaN value for ‘adult’ values that couldn’t be converted to a valid boolean gave me a convenient list of rows to drop. Interestingly, it turns out that rows with ‘adult’ values also had invalid Movie Database IDs and IMDB IDs. The columns that were expected to contain boolean values were converted to bool after the cleanup steps.

I was also faced with the problem of inconsistent movie ID column names across all the datasets. The movie IDs from IMDB are also stored inconsistently: with ‘tt’ prepended to the numeric ID (consistent with IDs used in IMDB URLs) in movies_metadata.csv and without in links.csv and links_small.csv. It made sense to clean up the headers and standardize column names across all the datasets during the import step, since I was setting column names explicitly. Cleaning up the IMDB IDs was also simple to do in the converter function. Reading the chapter on cleaning data in Data Wrangling with Python was helpful during this step.

I uploaded the full script as a GitHub Gist. The pickled DataFrames are useful for import into Jupyter notebooks to do analysis.

No comments yet

Leave a Reply: