- Home > Archive: November, 2017
All python code in this post is Python 3.5+. Continuing from Part 1, I discovered that movies_metadata.csv contains malformed rows that have missing fields, which is what caused file import to fail. I tried experimenting with some of the more advanced Pandas.read_csv parameters to see if I could work around the malformed rows.
1 2 3 4 5 6 7 |
def main(path: str) -> None: movies_metadata_df = pd.read_csv('{}/movies_metadata.csv'.format(path), dtype={'adult': 'bool', 'belongs_to_collection': object}, error_bad_lines=False, warn_bad_lines=True) print(movies_metadata_df.dtypes) |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 |
Traceback (most recent call last): File "pandas/_libs/parsers.pyx", line 1184, in pandas._libs.parsers.TextReader._convert_tokens (pandas/_libs/parsers.c:15254) TypeError: Cannot cast array from dtype('O') to dtype('bool') according to the rule 'safe' During handling of the above exception, another exception occurred: Traceback (most recent call last): File "test_movies_metadata.csv.py", line 27, in <module> main(args.path) File "test_movies_metadata.csv.py", line 12, in main dtype={'adult': 'bool', 'belongs_to_collection': object}, error_bad_lines=False, warn_bad_lines=True) File "/home/ayla/miniconda3/envs/ml_env/lib/python3.6/site-packages/pandas/io/parsers.py", line 655, in parser_f return _read(filepath_or_buffer, kwds) File "/home/ayla/miniconda3/envs/ml_env/lib/python3.6/site-packages/pandas/io/parsers.py", line 411, in _read data = parser.read(nrows) File "/home/ayla/miniconda3/envs/ml_env/lib/python3.6/site-packages/pandas/io/parsers.py", line 1005, in read ret = self._engine.read(nrows) File "/home/ayla/miniconda3/envs/ml_env/lib/python3.6/site-packages/pandas/io/parsers.py", line 1748, in read data = self._reader.read(nrows) File "pandas/_libs/parsers.pyx", line 890, in pandas._libs.parsers.TextReader.read (pandas/_libs/parsers.c:10862) File "pandas/_libs/parsers.pyx", line 912, in pandas._libs.parsers.TextReader._read_low_memory (pandas/_libs/parsers.c:11138) File "pandas/_libs/parsers.pyx", line 989, in pandas._libs.parsers.TextReader._read_rows (pandas/_libs/parsers.c:12175) File "pandas/_libs/parsers.pyx", line 1117, in pandas._libs.parsers.TextReader._convert_column_data (pandas/_libs/parsers.c:14136) File "pandas/_libs/parsers.pyx", line 1192, in pandas._libs.parsers.TextReader._convert_tokens (pandas/_libs/parsers.c:15475) ValueError: cannot safely convert passed user dtype of bool for object dtyped data in column 0 |
[…]
All python code in this post is Python 3.5+. I’m continuing to work with the same Kaggle movies dataset as in the SQL import experiment. This time, I imported the data into Pandas DataFrames. The trickiest dataset to import was movies_metadata.csv. I first tried to use pandas.read_csv with the default settings.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 |
import argparse import pandas as pd def main(path: str) -> None: movies_metadata_df = pd.read_csv('{}/movies_metadata.csv'.format(path)) print(movies_metadata_df.dtypes) if '__main__' == __name__: parser = argparse.ArgumentParser( description='Load and import movies data files to Pandas dataframes. ' 'Load files from and outputs pickled dataframes to PATH. ' 'Default is "."') parser.add_argument('--path', default='.', required=False) args = parser.parse_args() main(args.path) |
I was able […]
All python code is Python 3.5+. PostgreSQL database version is 10. I started digging into the Kaggle movies dataset recently, which is a collection of CSV files. I was curious to see if the data could be inserted into a SQL database (PostgreSQL) for further exploration. The credits.csv file contains two columns (cast, crew) of […]
All python code is Python 3.5+. A few months ago, I had to extract a small amount of data from a large and deeply nested JSON file quickly and export to CSV. I was working in C++ and Python on this project, so my first attempts to extract the data were using the Python json […]