- Home>
- data wrangling
The Python FuzzyWuzzy module uses Levenshtein edit distance to implement fuzzy string matching. FuzzyWuzzy’s matching tools return results on a scale from 0 to 100. The simplest matching tool FuzzyWuzzy offers is the ratio(..) function: The basic ratio function works well for simple string matching. However if you’re trying to fuzzy match a single word […]
Reader comments on an old post about the ijson parser prompted me to check out the project’s more recent releases. The latest pre-release (v3.0rc1) added a coroutine interface, which allow users to supply their own file readers and have more control over when the parser is called. It looked like a fun feature to explore, […]
Figuring out how similar two strings are and then making that similarity a quantitative measurement is a basic problem in text analysis, text mining and natural language processing. There are a number of efficient methods to solve this problem. This survey looks at Python implementations of a simple but widely used method: Levenshtein distance as […]
This question came up recently on a project where Pandas data needed to be fed to a TensorFlow classifier. In this case, we wanted to divide the dataframe using a random sampling. Frameworks like scikit-learn may have utilities to split data sets into training, test and cross-validation sets. For example, sklearn.model_selection.train_test_split split numpy arrays or […]
All python code in this post is Python 3.5+. This post describes how I parsed movies_metadata.csv from the Kaggle movies dataset; a task I started in Part 1 and Part 2. After some digging into the Pandas documentation and Stack Overflow, I found that the best solution to my parsing problems was to explicitly set […]
All python code in this post is Python 3.5+. Continuing from Part 1, I discovered that movies_metadata.csv contains malformed rows that have missing fields, which is what caused file import to fail. I tried experimenting with some of the more advanced Pandas.read_csv parameters to see if I could work around the malformed rows. def main(path: […]
All python code in this post is Python 3.5+. I’m continuing to work with the same Kaggle movies dataset as in the SQL import experiment. This time, I imported the data into Pandas DataFrames. The trickiest dataset to import was movies_metadata.csv. I first tried to use pandas.read_csv with the default settings. import argparse import pandas […]
All python code is Python 3.5+. PostgreSQL database version is 10. I started digging into the Kaggle movies dataset recently, which is a collection of CSV files. I was curious to see if the data could be inserted into a SQL database (PostgreSQL) for further exploration. The credits.csv file contains two columns (cast, crew) of […]
All python code is Python 3.5+. A few months ago, I had to extract a small amount of data from a large and deeply nested JSON file quickly and export to CSV. I was working in C++ and Python on this project, so my first attempts to extract the data were using the Python json […]