Archives

Categories

If this helped you, please share!

Importing Stringified JSON Objects Into Pandas (Part 1)

Published November 24, 2017 in data , programming - 0 Comments

All python code in this post is Python 3.5+.

I’m continuing to work with the same Kaggle movies dataset as in the SQL import experiment.
This time, I imported the data into Pandas DataFrames.

The trickiest dataset to import was movies_metadata.csv. I first tried to use pandas.read_csv with the default settings.

import argparse
import pandas as pd


def main(path: str) -> None:
    movies_metadata_df = pd.read_csv('{}/movies_metadata.csv'.format(path))
    print(movies_metadata_df.dtypes)


if '__main__' == __name__:
    parser = argparse.ArgumentParser(
        description='Load and import movies data files to Pandas dataframes. '
                    'Load files from and outputs pickled dataframes to PATH. '
                    'Default is "."')
    parser.add_argument('--path', default='.', required=False)
    args = parser.parse_args()
    main(args.path)

I was able to read the movies_metadata dataset into Pandas, but the resulting DataFrame did not contain expected data types. For instance, the first column should be a boolean.

There was also the Pandas data type warning to consider. The default pandas.read_csv import behavior did a poor job of inferring column data types.

sys:1: DtypeWarning: Columns (10) have mixed types. Specify dtype option on import or set low_memory=False.
adult                     object
belongs_to_collection     object
budget                    object
genres                    object
homepage                  object
id                        object
imdb_id                   object
original_language         object
original_title            object
overview                  object
popularity                object
poster_path               object
production_companies      object
production_countries      object
release_date              object
revenue                  float64
runtime                  float64
spoken_languages          object
status                    object
tagline                   object
title                     object
video                     object
vote_average             float64
vote_count               float64
dtype: object

Process finished with exit code 0

Next, I tried setting the data type starting just with the first two columns as a test (adult and belongs_to_collection):

import argparse
import pandas as pd


def main(path: str) -> None:
    movies_metadata_df = pd.read_csv('{}/movies_metadata.csv'.format(path),
      dtype={'adult': 'bool', 'belongs_to_collection': object})


if '__main__' == __name__:
    parser = argparse.ArgumentParser(
        description='Load and import movies data files to Pandas dataframes. '
                    'Load files from and outputs pickled dataframes to PATH. '
                    'Default is "."')
    parser.add_argument('--path', default='.', required=False)
    args = parser.parse_args()
    main(args.path)

That change broke the import:

home/ayla/miniconda3/envs/ml_env/bin/python /home/ayla/movies/test_movies_metadata.csv.py
Traceback (most recent call last):
  File "pandas/_libs/parsers.pyx", line 1184, in pandas._libs.parsers.TextReader._convert_tokens (pandas/_libs/parsers.c:15254)
TypeError: Cannot cast array from dtype('O') to dtype('bool') according to the rule 'safe'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/ayla/movies/test_movies_metadata.csv.py", line 22, in <module>
    main(args.path)
  File "/home/ayla/movies/test_movies_metadata.csv.py", line 7, in main
    dtype={'adult': 'bool', 'belongs_to_collection': object})
  File "/home/ayla/miniconda3/envs/ml_env/lib/python3.6/site-packages/pandas/io/parsers.py", line 655, in parser_f
    return _read(filepath_or_buffer, kwds)
  File "/home/ayla/miniconda3/envs/ml_env/lib/python3.6/site-packages/pandas/io/parsers.py", line 411, in _read
    data = parser.read(nrows)
  File "/home/ayla/miniconda3/envs/ml_env/lib/python3.6/site-packages/pandas/io/parsers.py", line 1005, in read
    ret = self._engine.read(nrows)
  File "/home/ayla/miniconda3/envs/ml_env/lib/python3.6/site-packages/pandas/io/parsers.py", line 1748, in read
    data = self._reader.read(nrows)
  File "pandas/_libs/parsers.pyx", line 890, in pandas._libs.parsers.TextReader.read (pandas/_libs/parsers.c:10862)
  File "pandas/_libs/parsers.pyx", line 912, in pandas._libs.parsers.TextReader._read_low_memory (pandas/_libs/parsers.c:11138)
  File "pandas/_libs/parsers.pyx", line 989, in pandas._libs.parsers.TextReader._read_rows (pandas/_libs/parsers.c:12175)
  File "pandas/_libs/parsers.pyx", line 1117, in pandas._libs.parsers.TextReader._convert_column_data (pandas/_libs/parsers.c:14136)
  File "pandas/_libs/parsers.pyx", line 1192, in pandas._libs.parsers.TextReader._convert_tokens (pandas/_libs/parsers.c:15475)
ValueError: cannot safely convert passed user dtype of bool for object dtyped data in column 0

Process finished with exit code 1

Clearly I needed to try a different approach.

No comments yet

Leave a Reply: