- CatBoost, an OSS gradient boosting framework , is the new kid on the block with intriguing benchmark results on model quality and training/serving speeds with CPUs & GPUs
- Catboost is simple to run on Google Cloud’s AI Platform for both the Notebook & Training services. GitHub examples to accompany this post HERE
- While AI Platform Training only lists TensorFlow, XGBoost, & SKLearn as officially supported hosted frameworks, CatBoost works seamlessly by including it in the setup.py file during training job submission
CatBoost for Gradient Boosting
My focus the last few years has been heavily skewed towards deep learning & neural networks but this weekend I wanted to quickly catch up on the world of gradient boosting on structured data. The biggest item I seem to have missed over the last 2.5 years is the launch and growth of CatBoost.
- OSS ML framework from Russia developed by Yandex researchers & engineers. Used as part of key Yandex services including search, weather, and personal assistant
- Model Quality: Better accuracy on several benchmarks vs. XGBoost, LightGBM, & H2O
- Speed: Faster training & serving times on CPUs & GPUs
- Out-of-the-box Usability: Better model quality with default parameters, sophisticated categorical/text support and visual model analysis tools
As with all data science & machine learning problems you should set a baseline and try out a different techniques to see which one might be the best fit for your dataset but CatBoost paints an intriguing picture.
I had 2 quick objectives after reading about CatBoost:
- Run through an end-to-end example on AI Platform Notebooks using the dataset used in the XGBoost tutorial (This can also be run for free on Google Colab with a few changes)
- Put together a simple example of running CatBoost on AI Platform Training using both CPUs & GPUs
AI Platform Notebook
Notebook on GitHub HERE
If you love Jupyter Notebooks, AI Platform Notebooks is Google Cloud’s managed JupyterLab Service. It takes minutes to get started and provides you with pre-built environments for your favorite frameworks, libraries and GPU drivers.
- Create an AI Platform Notebook Instance with the ‘CUDA 10.1’ image with a Nvidia T4 GPU
- Open up the JupyterLab UI
- Open the Terminal and install CatBoost using ‘pip install catboost’. All other libraries (pandas, numpy, sklearn) and drivers (cuda, nccl) are already installed as part fo the base image
- Run through this notebook
- CatBoost works with all your existing python & cuda libraries so setup was a breeze. It took me <5 min to get up and running
- The CatBoost syntax and APIs are well designed and follow familiar patterns used by other libraries
- GPUs were simple to use. I first used a prebuilt utility to check if the T4 GPU was being recognized.
from catboost.utils import get_gpu_device_count
print('I see %i GPU devices' % get_gpu_device_count())
I then simply added task_type=“GPU” in the model definition and the model.fit automatically trained on the GPU.
**NOTE** In this example GPUs did not provide speedup over CPUs since the data was small (in both rows and features). I was however able to run through their GPU example with the epsilon dataset (400K examples, 2K features) and saw a 5–6x speedup with the T4 GPU.
AI Platform Training
Training Package on GitHub HERE
One of the main advantages of using a cloud provider is being able to leverage on-demand infrastructure only when you need it. AI Platform Training allows you submit code (or a docker container) to provision resources only for the time needed to process and train your data. Below are the steps I took to package the training code & dependencies, submit the code to AI Platform, and save the model to Google Cloud Storage.
- Install the Google Cloud SDK on your development machine. If you are using Google Cloud Shell, Google Cloud Compute Engine, or the Terminal on the AI Platform Notebook instance in the previous section the SDK should already be pre-installed.
- Create a Google Cloud Storage bucket to save your model after the training is completed.
- Package the training code in the following directory and file structure:
- train.py file includes the same code from the Jupyter notebook example in the previous section
- train-cloud.sh includes the AI Platform SDK command to submit the training job
- setup.py file includes any python libraries to install on the on-demand server(s) that are not by default part of the hosted service. In this case you only need to specify ‘catboost’.
4. Update the variables in train-cloud.sh and run the script.
gcloud ai-platform jobs submit training $JOB_NAME \
--job-dir $JOB_DIR \
--package-path $TRAINING_PACKAGE_PATH \
--module-name $MAIN_TRAINER_MODULE \
--region $REGION \
5. The training job progress and status will be viewable at http://console.cloud.google.com/ai-platform/jobs and the saved model (in my case ONNX format) will be saved in the appropriate Google Cloud Storage bucket.
- While AI Platform Training only lists TensorFlow, XGBoost, & SKLearn as officially supported hosted frameworks, CatBoost works seamlessly by including it in the setup.py file. You can also use custom containers to submit a job that has more complex runtime requirements.
- TODO: Need to add in a few custom arguments to refactor the code to support both CPUs & GPUs automatically. Currently I have to manually comment in/out the GPU task_type in the train.py code.
- Serve the model for both batch and online needs
- Play with tuning options to maximize model quality
- Play with the built-in visualization and explainability tools