Tag Archives for " data wrangling "

Python text analysis tools: Levenshtein Distance

Published January 31, 2020 in data - 0 Comments

Figuring out how similar two strings are and then making that similarity a quantitative measurement is a basic problem in text analysis, text mining and natural language processing. There are a number of efficient methods to solve this problem. This survey looks at Python implementations of a simple but widely used method: Levenshtein distance as […]

How to split a Pandas dataframe into training and test sets?

Published June 18, 2018 in data - 0 Comments

This question came up recently on a project where Pandas data needed to be fed to a TensorFlow classifier. In this case, we wanted to divide the dataframe using a random sampling. Frameworks like scikit-learn may have utilities to split data sets into training, test and cross-validation sets. For example, sklearn.model_selection.train_test_split split numpy arrays or […]

Importing Stringified JSON Objects Into Pandas (Part 2)

Published November 30, 2017 in data , programming - 0 Comments

All python code in this post is Python 3.5+. Continuing from Part 1, I discovered that movies_metadata.csv contains malformed rows that have missing fields, which is what caused file import to fail. I tried experimenting with some of the more advanced Pandas.read_csv parameters to see if I could work around the malformed rows.


Importing Stringified JSON Objects Into Pandas (Part 1)

Published November 24, 2017 in data , programming - 0 Comments

All python code in this post is Python 3.5+. I’m continuing to work with the same Kaggle movies dataset as in the SQL import experiment. This time, I imported the data into Pandas DataFrames. The trickiest dataset to import was movies_metadata.csv. I first tried to use pandas.read_csv with the default settings.

I was able […]