Lightweight yet scalable TensorFlow workflow on Google Cloud

My superpower toolkit: TFRecorder, TensorFlow Cloud, AI Platform Predictions and Weights & Biases

Published in

Towards Data Science

9 min readOct 5, 2020

I am just going to say it. I am absolutely overwhelmed and intimidated with the growing breadth and depth of machine learning (ML) today.

Need to build a high performant data pipeline? Learn Apache Beam or Spark and protocol buffers
Need to scale your model training? Learn AllReduce and multi-node distributed architectures
Need to deploy your models? Learn Kubernetes, TFServing, quantization, and API management
Need to track pipelines? Set up a metadata database, learn docker, and become a DevOps engineer

This does not even include the algorithm and modeling space which also makes me feel like an imposter without a research background. There has got to be an easier way!

I spent the last few weeks thinking about this dilemma and what I would recommend to a data scientist with a similar mindset as me. Many of those topics above are important to learn, especially if you want to focus on the new field of MLOps, but are there tools and technologies that can allow you to stand on the shoulder of giants?

Below are 4 such tools that abstract away much of the complexity and can allow you to more efficiently develop, track and scale your ML workflows.

TFRecorder (via Dataflow): Turn data into TFRecords with ease by feeding in a CSV file. For images just provide the JPEG URIs and labels in the CSV. Scale to distributed servers with Dataflow without writing any Apache Beam code.
TensorFlow Cloud (via AI Platform Training): Scale your TensorFlow model training to single and multi-node clusters of GPUs on AI Platform Training with a simple API call.
AI Platform Predictions: Deploy your model as an API endpoint to a Kubernetes backed autoscaling service with GPUs, the same one used by Waze!
Weights & Biases: Log artifacts (datasets and models) to track versions and lineage across your development pipeline. Automatically generates a tree of relationships between your experiments and artifacts.

Workflow Overview

I will use a typical cats vs. dogs computer vision problem to walk through each of these tools. This workflow has the following steps:

Save the raw JPEG images to object storage with each image located under a subfolder specifying it’s label
Generate the CSV file with image URIs and labels in the required format
Convert images and labels to TFRecords
Create a dataset from the TFRecords and train a CNN model with Keras
Store the model as a SavedModel and deploy as an API endpoint
JPEG images, TFRecords, and SavedModels will all be stored in object storage
Experiments and lineage of artifacts will be tracked with Weights & Biases

The notebooks and scripts I used are in this GitHub repository.

Now let’s dive into each tool.

TFRecorder

TFRecords still confuse me. I understand the performance advantage they provide but have always struggled with working with them once I start on a new dataset. Apparently I am not the only one and thankfully the TFRecorder project was recently released. It has never been easier working with TFRecords and it only requires (1) organizing your images in a logical directory format and (2) working with PANDAS DataFrames and CSVs. Below are the steps I took:

Create a CSV file with 3 columns including an image URI pointing to the directory location of each image

Read the CSV into a PANDAS DataFrame and call the TFRecorder function to convert the files on Dataflow, specifying an output directory

dfgcs = pd.read_csv(FILENAME)dfgcs.tensorflow.to_tfr(
    output_dir=TFRECORD_OUTPUT,
    runner='DataFlowRunner',
    project=PROJECT,
    region=REGION,
    tfrecorder_wheel=TFRECORDER_WHEEL)

Thats it! Less than 10 lines of code that can scale to converting millions of images into the TFRecord format. As a data scientist you just laid the foundation for a high performant training pipeline. You can also take a look at the Dataflow job graph and metrics in the Dataflow console if you are curious about the magic happening in the background.

After reading through a bit of the GitHub repo, the schema of the tfrecord is below:

tfr_format = {
            "image": tf.io.FixedLenFeature([], tf.string),
            "image_channels": tf.io.FixedLenFeature([], tf.int64),
            "image_height": tf.io.FixedLenFeature([], tf.int64),
            "image_name": tf.io.FixedLenFeature([], tf.string),
            "image_width": tf.io.FixedLenFeature([], tf.int64),
            "label": tf.io.FixedLenFeature([], tf.int64),
            "split": tf.io.FixedLenFeature([], tf.string),
        }

You can then read the TFRecords into a TFRecordDataset for a Keras model training pipeline with the code below:

IMAGE_SIZE=[150,150]
BATCH_SIZE = 5def read_tfrecord(example):
    image_features= tf.io.parse_single_example(example, tfr_format)
    image_channels=image_features['image_channels']
    image_width=image_features['image_width']
    image_height=image_features['image_height']
    label=image_features['label']
    image_b64_bytes=image_features['image']
    
    image_decoded=tf.io.decode_base64(image_b64_bytes)
    image_raw = tf.io.decode_raw(image_decoded, out_type=tf.uint8)
    image = tf.reshape(image_raw, tf.stack([image_height,    image_width, image_channels]))
    image_resized = tf.cast(tf.image.resize(image, size=[*IMAGE_SIZE]),tf.uint8)
    return image_resized, labeldef get_dataset(filenames):
    dataset = tf.data.TFRecordDataset(filenames=filenames, compression_type='GZIP') 
    dataset = dataset.map(read_tfrecord)
    dataset = dataset.shuffle(2048)
    dataset = dataset.batch(BATCH_SIZE)
    return datasettrain_dataset = get_dataset(TRAINING_FILENAMES)
valid_dataset = get_dataset(VALID_FILENAMES)

TensorFlow Cloud (AI Platform Training)

Now that we have a tf.data.Dataset, we can feed it into our model training call. Below is a simple example of a CNN model using the Keras Sequential API.

model = tf.keras.models.Sequential([
    tf.keras.layers.Conv2D(16, (3,3), activation='relu', input_shape=(150, 150, 3)),
    tf.keras.layers.MaxPooling2D(2, 2),
    tf.keras.layers.Conv2D(32, (3,3), activation='relu'),
    tf.keras.layers.MaxPooling2D(2,2),
    tf.keras.layers.Conv2D(64, (3,3), activation='relu'),
    tf.keras.layers.MaxPooling2D(2,2),
    tf.keras.layers.Conv2D(64, (3,3), activation='relu'),
    tf.keras.layers.MaxPooling2D(2,2),
    tf.keras.layers.Flatten(),
    tf.keras.layers.Dense(256, activation='relu'),
    tf.keras.layers.Dense(1, activation='sigmoid')
])model.summary()
model.compile(loss='binary_crossentropy',
              optimizer=RMSprop(lr=1e-4),
              metrics=['accuracy'])
model.fit(
    train_dataset,
    epochs=10,
    validation_data=valid_dataset,
    verbose=2
)

I first ran this on my development environment on a subset of the images (in my case a Jupyter notebook) but I wanted to make it scale to all images and make it faster. TensorFlow Cloud allows me to use a single API command that takes care of containerizing my code and submitting to run as a distributed GPU job.

import tensorflow_cloud as tfctfc.run(entry_point='model_training.ipynb',
        chief_config=tfc.COMMON_MACHINE_CONFIGS['T4_4X'],
        requirements_txt='requirements.txt')

This is no April fool’s joke. The code above (<5 lines of code) is the full python script you need to place in the same directory as your Jupyter notebook. The trickiest part is to follow all the set up instructions to make sure you are correctly authenticated to your Google Cloud Platform project. This is a massive super power!

Let’s dive a bit deeper into what this is doing.

First a docker container with all your required libraries and notebook will be built and saved in Google Cloud’s Container Registry service.

That container is then submitted to a fully managed and serverless training service, AI Platform Training. Without having to set up any infrastructure and install any GPU libraries I was able to train this model on a 16 vCPU 60GB RAM machine with 4 Nvidia T4 GPUs. I only used those resources for the time I needed them (~15 minutes) and can go back to developing in my local environment with an IDE or Jupyter Notebook.

The SavedModel is finally stored in object storage as specified at the very end of my training script.

MODEL_PATH=time.strftime("gs://mchrestkha-demo-env-ml-examples/catsdogs/models/model_%Y%m%d_%H%M%S")
model.save(MODEL_PATH)

AI Platform Predictions

With my SavedModel in object storage, I can load it up to my development environment and run some sample predictions. But what if I want to allow others to use it without needing them to set up a Python environment and learn TensorFlow. This is where AI Platform Predictions comes into the picture. It allows you to deploy model binaries as API endpoints that can be called using REST, a simple Google Cloud SDK (gcloud), or various other client libraries. All end users need to know is the required input (in our case a JPEG image file converted to a [150,150,3] JSON array) and can embed your model as part of their workflow. When you make a change (retraining on a new dataset, a new model architecture, maybe even a new framework) you can publish a new version.

The simple gcloud SDK below is the super power to deploy your models to this Kubernetes backed autoscaling service.

MODEL_VERSION="v1"
MODEL_NAME="cats_dogs_classifier"
REGION="us-central1"gcloud ai-platform models create $MODEL_NAME \
    --regions $REGIONgcloud ai-platform versions create $MODEL_VERSION \
  --model $MODEL_NAME \
  --runtime-version 2.2 \
  --python-version 3.7 \
  --framework tensorflow \
  --origin $MODEL_PATH

AI Platform Predictions is a service I am particularly excited about as it eliminates many of the complexities to get your model out into the world (both internally and externally) to start driving value from it. While the scope of this blog is an experimentation workflow companies like Waze are using AI Platform Predictions to deploy and serve their models in production at scale.

Weights & Biases

At this point I’ve completed one experiment but what about my future experiments? I may need to:

run many more experiments later this week and track my work
come back in a month and try to remember all the inputs and outputs of each of my experiments
share the work with teammates who can hopefully piece my workflow together

There is lots of work being done in the ML Pipelines space. It is an exciting but nascent space with best practices and industry standards yet to develop. Some great projects to follow include MLFlow, Kubeflow Pipelines, TFX, and Comet.ML. For the needs of my workflow, MLOps and continuous delivery was out of scope and I wanted something simple. I chose Weights & Biases (WandB) due to its ease of use and lightweight integration to track experiments and artifacts.

Let’s start with experiments. WandB offers lots of customization options but if you’re using any one of the popular frameworks you don’t need to do much. For the TensorFlow Keras API I simply (1) imported the wandb python library (2) initialized my experiment run and (3) added a callback function within the model fit step.

model.fit(
    train_dataset,
    epochs=10,
    validation_data=valid_dataset,
    verbose=2,
    callbacks=[WandbCallback()]
)

This automatically streams out-of-the-box metrics into a centralized experiment tracking service. Take a look at all the ways folks are using WandB here.

WandB also provides an artifacts API which fits my needs much more than some of the heavy tooling out there today. I added short code snippets throughout my pipeline to define 4 key items:

Initializing a step in my pipeline
Using an existing artifact (if available) as part of this step
Logging an artifact generated by this step
Stating a step is complete

run = wandb.init(project='cats-dogs-keras', job_type='data', name='01_set_up_data')<Code to set up initial JPEGs in the appropriate directory structure>artifact = wandb.Artifact(name='training_images',job_type='data', type='dataset')
artifact.add_reference('gs://mchrestkha-demo-env-ml-examples/catsdogs/train/')
run.log_artifact(artifact)
run.finish()run = wandb.init(project='cats-dogs-keras',job_type='data', name='02_generate_tfrecords')artifact = run.use_artifact('training_images:latest')<TFRecorder Code>artifact = wandb.Artifact(name='tfrecords', type='dataset')
artifact.add_reference('gs://mchrestkha-demo-env-ml-examples/catsdogs/tfrecords/')
run.log_artifact(artifact)
run.finish()run = wandb.init(project='cats-dogs-keras',job_type='train', name='03_train')
artifact = run.use_artifact('tfrecords:latest')<TensorFlow Cloud Code>artifact = wandb.Artifact(name='model', type='model')
artifact.add_reference(MODEL_PATH)
run.log_artifact(artifact)

This simple Artifacts API stores the metadata and lineage of each run and artifact so you have full clarity on your workflow. The UI also has a nice tree diagram to view it graphically.

Summary

If you feel intimidated by the endless machine learning topics, tools and technologies, you are not alone. I feel imposter syndrome everyday talking to co-workers, partners and customers about various aspects of data science, MLOps, hardware accelerators, data pipelines, AutoML, etc. Just remember it is important to:

Be practical: It is not realistic for all of us to be full stack engineers across DevOps, Data, and Machine Learning. Pick 1 or 2 areas that interest you and work with others to solve system-wide problems.
Focus on your problem: We all get carried away with the new framework, the new research paper, the new tool. Start with your business problem, your dataset, your end-user requirements. Not everyone needs 100s of ML models in production being re-trained daily and served to millions of users (at least not yet).
Identify superpower tools: Find a core set of tools that act as efficiency multipliers that provide scalability and remove complexity. I walked through my tool kit for Tensorflow (TFRecorder + TensorFlow Cloud + AI Platform Predictions + Weights & Biases), but find the right toolkit that maps to your problem and workflow.

Have a question or want to chat? Find me on Twitter

The notebook examples from this blog can be found on my GitHub.

Big thanks to Mike Bernico, Rajesh Thallam and Vaibhav Singh for helping me with this example solution.