All posts in "data"

A closer look at Airflow’s KubernetesPodOperator and XCom

Published July 11, 2019 in data - 0 Comments

The KubernetesPodOperator handles communicating XCom values differently than other operators. The basics are described in the operator documentation under the xcom_push parameter. I’ve written up a more detailed example that expands on that documentation. An Airflow task instance described by the KubernetesPodOperator can write a dict to the file /airflow/xcom/return.json (always the same file) that […]

Trigger DAG runs with Airflow REST API

Published June 24, 2019 in data - 0 Comments

This article and code is applicable to Airflow 1.10.13. Hopefully the REST API will mature as Airflow is developed further, and the authentication methods will be easier. The experimental REST API does not use the Airflow role-based users. Instead, it currently requires a SQLAlchemy models.User object whose data is saved in the database. The code […]

Tags: airflow , python

Useful Airflow on Kubernetes Features

Published June 7, 2019 in data , devops - 0 Comments

KubernetesExecutor The KubernetesExecutor sets up Airflow to run on a Kubernetes cluster. This executor runs task instances in pods created from the same Airflow Docker image used by the KubernetesExecutor itself, unless configured otherwise (more on that at the end). Getting Airflow deployed with the KubernetesExecutor to a cluster is not a trivial task. I […]

SageMaker lifecycle configuration time constraints

Published May 17, 2019 in data - 0 Comments

Creating AWS SageMaker Lifecycle configuration scripts to customize notebook instances beats installing packages and making other environment changes in notebook instances. One advantage is that the customization code doesn’t need to be copied from notebook to notebook. Another is that the lifecycle configurations are managed outside of and separately from notebook instances, and can be […]

Tags: aws , python

Customize Yellowbrick color palettes

Published January 16, 2019 in data - 0 Comments

I’m starting to experiment with the Yellowbrick machine learning visualizer tools to learn how to visualize models more effectively. The documentation is good, and getting started with the tools is pretty straightforward. I started to get bored with the default color palette after playing with some of the basic visualization examples in the documentation (more […]

Attach to existing SageMaker job

Published July 30, 2018 in data - 0 Comments

The AWS SageMaker ntm_20newsgroups_topic_model example notebook is a simple to follow introduction to SageMaker’s pre-packaged Natural Language Processing (NLP) tools. The notebook demonstrates how to use the Neural Topic Model (NTM) algorithm to extract a set of topics from a sample usenet newsgroups dataset and visualize as word clouds. It also contains code demonstrating how […]

How to split a Pandas dataframe into training and test sets?

Published June 18, 2018 in data - 0 Comments

This question came up recently on a project where Pandas data needed to be fed to a TensorFlow classifier. In this case, we wanted to divide the dataframe using a random sampling. Frameworks like scikit-learn may have utilities to split data sets into training, test and cross-validation sets. For example, sklearn.model_selection.train_test_split split numpy arrays or […]

Pandas DataFrame axis basics (Part 2)

Published April 13, 2018 in data - 0 Comments

Part 1 covered Pandas DataFrame basics. Pandas offers multiple options for accessing DataFrame values by axis labels using he DataFrame.loc function, or by integer indexes using the DataFrame.iloc function in one or two dimensions. If the DataFrame has a numerical index, calling the DataFrame.loc and DataFrame.iloc functions looks the same. Otherwise, use the appropriate axis […]

Tags: pandas

Pandas DataFrame axis basics (Part 1)

Published April 2, 2018 in data - 0 Comments

By default, a Pandas DataFrame is 2 dimensional with 2 axes initialized as empty Index structures. Under the basic indexing scheme, the first axis is the ‘index’ axis, which by default is a numerical index starting from 0 (using np.arange) generated for each DataFrame row. The second axis is the ‘columns’ axis, which is the […]

Tags: pandas