Archives

Categories

If this helped you, please share!

JSONB arrays in PostgreSQL tables

Published November 13, 2017 in programming - 0 Comments

All python code is Python 3.5+. PostgreSQL database version is 10.

I started digging into the Kaggle movies dataset recently, which is a collection of CSV files. I was curious to see if the data could be inserted into a SQL database (PostgreSQL) for further exploration.

The credits.csv file contains two columns (cast, crew) of “stringified” JSON objects, which incidentally do not follow JSON quoting standards (single quotes instead of double). Fortunately there was a useful suggestion in the dataset discussion to use Python’s Abstract Syntax Tree (ast) module to parse the cast and crew columns.

The datasets contain 2 different movie IDs: one set from The Movie Database (tmdbId) and one set from IMDB (imdbId). The links.csv and links_small.csv files contain both the Movie Database and IMDB IDs and a movieId column that looks like an autoincremented index. The credits.csv file has the Movie Database IDs in an id column. Also, there are a small number of rows in credits.csv with the same Movie Database ID. The links.csv file is also missing some tmdbId entries. I decided to create my SQL credits table by renaming the id column to tmdbId and creating a serial autoincremented ID column, then combining the ID columns to form a composite primary key. There is also a keyword conflict with cast, so I renamed the cast and crew columns too.

I used the PostgreSQL JSONB language extension to insert the JSON arrays into the credits table. This blog post helped me pick between JSON and JSONB. I used the SQLAlchemy ORM to interact with Postgres using the Table, Column and MetaData API. The documentation, this blog post and this Stack Overflow post were useful resources for writing the SQLAlchemy code.

import ast
import csv
import logging

from sqlalchemy import cast, create_engine, Column, Integer, MetaData, Table, Index
from sqlalchemy.dialects.postgresql import ARRAY, JSONB


class CastingArray(ARRAY):
    def bind_expression(self, bindvalue):
        return cast(bindvalue, self)


logging.basicConfig()
logging.getLogger('sqlalchemy.engine').setLevel(logging.INFO)

engine = create_engine('postgresql://docker:docker@localhost:5432')
metadata = MetaData(engine)

links_table = Table('links', metadata,
                    Column('movieId', Integer, primary_key=True),
                    Column('imdbId', Integer, unique=True, nullable=False, index=True),
                    Column('tmdbId', Integer))


credits_table = Table('credits', metadata,
                      Column('id', Integer, autoincrement=True, primary_key=True),
                      Column('tmdbId', Integer, primary_key=True),
                      Column('cast_data', CastingArray(JSONB)),
                      Column('crew_data', CastingArray(JSONB)))

Index('credits_pk', credits_table.c.id, credits_table.c.tmdbId)

with engine.connect() as conn:
    metadata.create_all()

    with open('links.csv', 'r') as csvfile:
        reader = csv.DictReader(csvfile, delimiter=',')
        for row in reader:
            columns = []
            for key, value in row.items():
                try:
                    columns.append(int(value))
                except Exception as exception:
                    columns.append(None)

            conn.execute(links_table.insert().values(columns))

    # CSV files containing stringified JSON Object
    with open('credits.csv', 'r') as csvfile:
        reader = csv.DictReader(csvfile, delimiter=',')
        for row in reader:
            columns = []
            for key, value in row.items():
                try:
                    if key == 'id':
                        columns.append(int(value))
                    else:
                        converted_value = ast.literal_eval(value)
                        columns.append(converted_value)
                except Exception as exception:
                    columns.append(None)

            conn.execute(credits_table.insert().values(tmdbId=columns[2],
                                                       cast_data=columns[0],
                                                       crew_data=columns[1]))

Querying the database for tmdbId == 47018 returns the movie Carrington whose URL contains that ID.

INFO:sqlalchemy.engine.base.Engine:SELECT credits.id, credits."tmdbId", credits.cast_data, credits.crew_data 
FROM credits 
WHERE credits."tmdbId" = %(tmdbId_1)s
INFO:sqlalchemy.engine.base.Engine:{'tmdbId_1': '47018'}
(35, 47018, [{'id': 7056, 'name': 'Emma Thompson', 'order': 0, 'gender': 1, 'cast_id': 1, 'character': 'Dora Carrington', 'credit_id': '52fe471bc3a36847f8123a9f', ... (2109 characters truncated) ... 'gender': 0, 'cast_id': 15, 'character': 'Military Rep', 'credit_id': '57f7feffc3a36875b00009f1', 'profile_path': '/h4nFGD3zGD3cCAJ6dkydQOgiWi9.jpg'}], [{'id': 4034, 'job': 'Costume Design', 'name': 'Penny Rose', 'gender': 1, 'credit_id': '58594c1ac3a3682fb807e4cf', 'department': 'Costume & Make-Up',  ... (4849 characters truncated) ... 'Hairstylist', 'name': 'Elaine Davis', 'gender': 0, 'credit_id': '58594e37c3a36806a705e875', 'department': 'Costume & Make-Up', 'profile_path': None}])

No comments yet

Leave a Reply: