- Home>
- programming>
- Importing Stringified JSON Objects Into Pandas (Part 1)
All python code in this post is Python 3.5+.
I’m continuing to work with the same Kaggle movies dataset as in the SQL import experiment.
This time, I imported the data into Pandas DataFrames.
The trickiest dataset to import was movies_metadata.csv. I first tried to use pandas.read_csv with the default settings.
import argparse
import pandas as pd
def main(path: str) -> None:
movies_metadata_df = pd.read_csv('{}/movies_metadata.csv'.format(path))
print(movies_metadata_df.dtypes)
if '__main__' == __name__:
parser = argparse.ArgumentParser(
description='Load and import movies data files to Pandas dataframes. '
'Load files from and outputs pickled dataframes to PATH. '
'Default is "."')
parser.add_argument('--path', default='.', required=False)
args = parser.parse_args()
main(args.path)
I was able to read the movies_metadata dataset into Pandas, but the resulting DataFrame did not contain expected data types. For instance, the first column should be a boolean.
There was also the Pandas data type warning to consider. The default pandas.read_csv import behavior did a poor job of inferring column data types.
sys:1: DtypeWarning: Columns (10) have mixed types. Specify dtype option on import or set low_memory=False.
adult object
belongs_to_collection object
budget object
genres object
homepage object
id object
imdb_id object
original_language object
original_title object
overview object
popularity object
poster_path object
production_companies object
production_countries object
release_date object
revenue float64
runtime float64
spoken_languages object
status object
tagline object
title object
video object
vote_average float64
vote_count float64
dtype: object
Process finished with exit code 0
Next, I tried setting the data type starting just with the first two columns as a test (adult and belongs_to_collection):
import argparse
import pandas as pd
def main(path: str) -> None:
movies_metadata_df = pd.read_csv('{}/movies_metadata.csv'.format(path),
dtype={'adult': 'bool', 'belongs_to_collection': object})
if '__main__' == __name__:
parser = argparse.ArgumentParser(
description='Load and import movies data files to Pandas dataframes. '
'Load files from and outputs pickled dataframes to PATH. '
'Default is "."')
parser.add_argument('--path', default='.', required=False)
args = parser.parse_args()
main(args.path)
That change broke the import:
home/ayla/miniconda3/envs/ml_env/bin/python /home/ayla/movies/test_movies_metadata.csv.py
Traceback (most recent call last):
File "pandas/_libs/parsers.pyx", line 1184, in pandas._libs.parsers.TextReader._convert_tokens (pandas/_libs/parsers.c:15254)
TypeError: Cannot cast array from dtype('O') to dtype('bool') according to the rule 'safe'
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/home/ayla/movies/test_movies_metadata.csv.py", line 22, in <module>
main(args.path)
File "/home/ayla/movies/test_movies_metadata.csv.py", line 7, in main
dtype={'adult': 'bool', 'belongs_to_collection': object})
File "/home/ayla/miniconda3/envs/ml_env/lib/python3.6/site-packages/pandas/io/parsers.py", line 655, in parser_f
return _read(filepath_or_buffer, kwds)
File "/home/ayla/miniconda3/envs/ml_env/lib/python3.6/site-packages/pandas/io/parsers.py", line 411, in _read
data = parser.read(nrows)
File "/home/ayla/miniconda3/envs/ml_env/lib/python3.6/site-packages/pandas/io/parsers.py", line 1005, in read
ret = self._engine.read(nrows)
File "/home/ayla/miniconda3/envs/ml_env/lib/python3.6/site-packages/pandas/io/parsers.py", line 1748, in read
data = self._reader.read(nrows)
File "pandas/_libs/parsers.pyx", line 890, in pandas._libs.parsers.TextReader.read (pandas/_libs/parsers.c:10862)
File "pandas/_libs/parsers.pyx", line 912, in pandas._libs.parsers.TextReader._read_low_memory (pandas/_libs/parsers.c:11138)
File "pandas/_libs/parsers.pyx", line 989, in pandas._libs.parsers.TextReader._read_rows (pandas/_libs/parsers.c:12175)
File "pandas/_libs/parsers.pyx", line 1117, in pandas._libs.parsers.TextReader._convert_column_data (pandas/_libs/parsers.c:14136)
File "pandas/_libs/parsers.pyx", line 1192, in pandas._libs.parsers.TextReader._convert_tokens (pandas/_libs/parsers.c:15475)
ValueError: cannot safely convert passed user dtype of bool for object dtyped data in column 0
Process finished with exit code 1
Clearly I needed to try a different approach.