All python code in this post is Python 3.5+.
The first feature I wanted explore was the distribution of movies by year in the movies_metadata data set. Extracting the year from the release_date datetime column was straightforward. It was also easy to filter out unreleased movies in the dataset.
A quick way to visualize distribution of movies by year is to use a histogram, especially since the dataset contains continuous release dates ranging from 1874 to 2020 (when including unreleased movies).
A histogram summarizes a data set by dividing the data into bins and plotting bars with areas reflecting the amount of data in each bin. One very common use case is showing the frequency of pixel intensities in image processing tools like OpenCV.
The Pandas hist function has reasonable default settings but I wanted a plot with better formatting. I wanted to visualize years in a finer grained distribution, so I used a bin number that would roughly divide the data into 2 year bins. I then used functions from the matplotlib pyplot framework to customize the plot further. The x-axis labels were easier to read with 4 year interval ticks. I also increased the font size for plot and year labels for readability.
The plot, not surprisingly, shows that the data set contains mostly recent movie data. The amount of movie data collected and stored in this data set increases exponentially until about 2014. There’s a noticeable drop in the number of movies counted in the complete data set around 1944-1948. This drop disappears when unreleased movies are excluded from the plot. It would be interesting to explore the frequency of released vs unreleased movies around World War 2 further and see which countries are represented.