Archives

Categories

If this helped you, please share!

Importing Stringified JSON Objects Into Pandas (Part 2)

Published November 30, 2017 in data , programming - 0 Comments

All python code in this post is Python 3.5+.

Continuing from Part 1, I discovered that movies_metadata.csv contains malformed rows that have missing fields, which is what caused file import to fail. I tried experimenting with some of the more advanced Pandas.read_csv parameters to see if I could work around the malformed rows.

def main(path: str) -> None:
    movies_metadata_df = pd.read_csv('{}/movies_metadata.csv'.format(path),
        dtype={'adult': 'bool', 'belongs_to_collection': object},
        error_bad_lines=False, warn_bad_lines=True)
    print(movies_metadata_df.dtypes)
Traceback (most recent call last):
  File "pandas/_libs/parsers.pyx", line 1184, in pandas._libs.parsers.TextReader._convert_tokens (pandas/_libs/parsers.c:15254)
TypeError: Cannot cast array from dtype('O') to dtype('bool') according to the rule 'safe'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "test_movies_metadata.csv.py", line 27, in <module>
    main(args.path)
  File "test_movies_metadata.csv.py", line 12, in main
    dtype={'adult': 'bool', 'belongs_to_collection': object}, error_bad_lines=False, warn_bad_lines=True)
  File "/home/ayla/miniconda3/envs/ml_env/lib/python3.6/site-packages/pandas/io/parsers.py", line 655, in parser_f
    return _read(filepath_or_buffer, kwds)
  File "/home/ayla/miniconda3/envs/ml_env/lib/python3.6/site-packages/pandas/io/parsers.py", line 411, in _read
    data = parser.read(nrows)
  File "/home/ayla/miniconda3/envs/ml_env/lib/python3.6/site-packages/pandas/io/parsers.py", line 1005, in read
    ret = self._engine.read(nrows)
  File "/home/ayla/miniconda3/envs/ml_env/lib/python3.6/site-packages/pandas/io/parsers.py", line 1748, in read
    data = self._reader.read(nrows)
  File "pandas/_libs/parsers.pyx", line 890, in pandas._libs.parsers.TextReader.read (pandas/_libs/parsers.c:10862)
  File "pandas/_libs/parsers.pyx", line 912, in pandas._libs.parsers.TextReader._read_low_memory (pandas/_libs/parsers.c:11138)
  File "pandas/_libs/parsers.pyx", line 989, in pandas._libs.parsers.TextReader._read_rows (pandas/_libs/parsers.c:12175)
  File "pandas/_libs/parsers.pyx", line 1117, in pandas._libs.parsers.TextReader._convert_column_data (pandas/_libs/parsers.c:14136)
  File "pandas/_libs/parsers.pyx", line 1192, in pandas._libs.parsers.TextReader._convert_tokens (pandas/_libs/parsers.c:15475)
ValueError: cannot safely convert passed user dtype of bool for object dtyped data in column 0

Since the malformed rows have too few instead of too many commas, using the error_bad_lines option was not helpful. Using error_bad_lines is not supported by the ‘c’ parsing engine, and that parser still throws a value error. The Python fixed-width formatted line engine (‘python-fwf’) was also not helpful.

Increasing parser flexibility by using the ‘python’ engine succeeded when only the first two column types were specified. Increasing the number of columns with assigned types failed on attempting to parse a malformed row once again.

def main(path: str) -> None:
    movies_metadata_df = pd.read_csv('{}/movies_metadata.csv'.format(path),
        dtype={'adult': 'bool', 'belongs_to_collection': object, 'budget': 'float'},
        error_bad_lines=False, warn_bad_lines=True, engine='python')
    print(movies_metadata_df.dtypes)
Traceback (most recent call last):
  File "/home/ayla/miniconda3/envs/ml_env/lib/python3.6/site-packages/pandas/io/parsers.py", line 1567, in _cast_types
    values = astype_nansafe(values, cast_type, copy=True)
  File "/home/ayla/miniconda3/envs/ml_env/lib/python3.6/site-packages/pandas/core/dtypes/cast.py", line 636, in astype_nansafe
    return arr.astype(dtype)
ValueError: could not convert string to float: '/zaSf5OG7V8X8gqFvly88zDdRm46.jpg'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "test_movies_metadata.csv.py", line 19, in <module>
    main(args.path)
  File "test_movies_metadata.csv.py", line 8, in main
    error_bad_lines=False, warn_bad_lines=True, engine='python')
  File "/home/ayla/miniconda3/envs/ml_env/lib/python3.6/site-packages/pandas/io/parsers.py", line 655, in parser_f
    return _read(filepath_or_buffer, kwds)
  File "/home/ayla/miniconda3/envs/ml_env/lib/python3.6/site-packages/pandas/io/parsers.py", line 411, in _read
    data = parser.read(nrows)
  File "/home/ayla/miniconda3/envs/ml_env/lib/python3.6/site-packages/pandas/io/parsers.py", line 1005, in read
    ret = self._engine.read(nrows)
  File "/home/ayla/miniconda3/envs/ml_env/lib/python3.6/site-packages/pandas/io/parsers.py", line 2180, in read
    data = self._convert_data(data)
  File "/home/ayla/miniconda3/envs/ml_env/lib/python3.6/site-packages/pandas/io/parsers.py", line 2249, in _convert_data
    clean_conv, clean_dtypes)
  File "/home/ayla/miniconda3/envs/ml_env/lib/python3.6/site-packages/pandas/io/parsers.py", line 1482, in _convert_to_ndarrays
    cvals = self._cast_types(cvals, cast_type, c)
  File "/home/ayla/miniconda3/envs/ml_env/lib/python3.6/site-packages/pandas/io/parsers.py", line 1570, in _cast_types
    "type %s" % (column, cast_type))
ValueError: Unable to convert column budget to type float

The first of the malformed rows in movies_metadata.csv is on line 129. It made sense to try skipping it by using the skiprows parameter. Rows are expected to be zero-indexed, so I used 128 as my row number. Unfortunately, using skiprows also failed with the same value error as before.

My final attempt to parse the file used converter functions, which finally solved my import problems.

No comments yet

Leave a Reply: