https://ift.tt/3C1GOrv Write training pipelines that will make your MLOps team happy Write ML Pipelines that will make your MLOps team hap...
Write training pipelines that will make your MLOps team happy
Write ML Pipelines that will make your MLOps team happy: follow a clean separation of responsibility between model code and ops code. This article show you how to do that.
Why separate the responsibilities?
In my two previous articles on Vertex AI, I showed you how to use the web console to create and deploy an AutoML model and how to take a TensorFlow model that you somehow trained and deploy it to Vertex AI. But both those approaches don’t really scale to hundreds of models and large teams.
When you create an AutoML model using the Google Cloud web console, you get back an end-point that can be monitored and on which you can set up continuous evaluation. If you find that the model is drifting, retraining it on new data automatically is difficult — you don’t want to wake up at 2am to use the user interface to train the model. It would be much better if you could train and deploy the model using just code. Code is much easier for your MLOps team to automate.
Taking a TensorFlow model that you trained in your Jupyter notebook and deploying the SavedModel to Vertex AI has the same problem. Retraining is going to be difficult because the ops team will have to set up all of the ops and monitoring and scheduling on top of something that is really clunky and totally non-minimal.
For retraining, it’s much better for the entire process — from dataset creation to training to deployment to be driven by code. Do this, and your operations team will thank you for making their life easy in terms of clearly separating out the model code from the ops code, and expressing everything in Python rather than in notebooks.
How to get this separation in Vertex AI is what I’m going to show you in this article.
Do it in Python files
Jupyter notebooks are great for development, but I strongly recommend against putting those notebooks directly into production (Yes, I do know about Papermill).
What I recommend is that you convert your initial prototyping model code into a Python file and then continue all development in it. Throw away the Jupyter notebook. You will invoke the extracted (and maintained) Python code from a scratch notebook for future experimentation.
You can see my example in https://github.com/GoogleCloudPlatform/data-science-on-gcp/tree/edition2/09_vertexai. See the files model.py and train_on_vertexai.py and use them to follow along.
Writing model.py
The file model.py contains all the Keras model code from my Jupyter notebook (flights_model_tf2.ipynb in the same GitHub directory). The difference is that it is executable, and much of the notebook code is extracted into a function called train_and_evaluate.py:
def train_and_evaluate(train_data_pattern, eval_data_pattern, test_data_pattern, export_dir, output_dir):
...
train_dataset = read_dataset(train_data_pattern, train_batch_size)
eval_dataset = read_dataset(eval_data_pattern, eval_batch_size, tf.estimator.ModeKeys.EVAL, num_eval_examples)
model = create_model()
history = model.fit(train_dataset,
validation_data=eval_dataset,
epochs=epochs,
steps_per_epoch=steps_per_epoch,
callbacks=[cp_callback])
# export
logging.info('Exporting to {}'.format(export_dir))
tf.saved_model.save(model, export_dir)
There are three key things to note:
- The data is read from URIs specified by train_data_pattern, eval_data_pattern, and test_data_pattern for training, validation, and test datasets respectively.
- The model creation code is extracted out to a function called create_model
- The model is written out to export_dir, and any other intermediate outputs are written to output_dir.
The data patterns and output directories are obtained in model.py from environment variables:
OUTPUT_DIR = 'gs://{}/ch9/trained_model'.format(BUCKET)
OUTPUT_MODEL_DIR = os.getenv("AIP_MODEL_DIR")
TRAIN_DATA_PATTERN = os.getenv("AIP_TRAINING_DATA_URI")
EVAL_DATA_PATTERN = os.getenv("AIP_VALIDATION_DATA_URI")
TEST_DATA_PATTERN = os.getenv("AIP_TEST_DATA_URI")
This is very important, because it is the contract between your code and Vertex AI and is needed in order for all the automagical things to happen.
Chances are, however, that you will need to run this code outside Vertex AI (for example, during development). In such a case, the environment variable will not be set, and so the variables will all be None. Look for this case, and set them to values in your development environment:
if not OUTPUT_MODEL_DIR:
OUTPUT_MODEL_DIR = os.path.join(OUTPUT_DIR,
'export/flights_{}'.format(time.strftime("%Y%m%d-%H%M%S")))
if not TRAIN_DATA_PATTERN:
TRAIN_DATA_PATTERN = 'gs://{}/ch9/data/train*'.format(BUCKET)
if not EVAL_DATA_PATTERN:
EVAL_DATA_PATTERN = 'gs://{}/ch9/data/eval*'.format(BUCKET)
These files can be very small because they are only for development. Actual production runs will run inside Vertex AI where the environment variables will be set.
Once you finish writing model.py, make sure it works:
python3 model.py --bucket <bucket-name>
Now, you are ready to invoke it from a Vertex AI pipeline.
Writing the training pipeline
The training pipeline (See train_on_vertexai.py) needs to do five things in code:
- Load up a managed dataset in Vertex AI
- Set up training infrastructure to run model.py
- Run model.py, and pass in the managed dataset.
- Find the endpoint to which to deploy the model.
- Deploy the model to the endpoint
1. Managed Dataset
This is how to load up a tabular dataset (options exist for image, text, etc. datasets, and for tabular data in BigQuery):
data_set = aiplatform.TabularDataset.create(
display_name='data-{}'.format(ENDPOINT_NAME),
gcs_source=['gs://{}/ch9/data/all.csv'.format(BUCKET)]
)
Note that I am passing in *all* of the data. Vertex AI will take care of splitting the data into train, validate, and test datasets and sending it to the training program.
2. Training setup
Next, create a training job passing in model.py, the training container image, and the serving container image:
model_display_name = '{}-{}'.format(ENDPOINT_NAME, timestamp)
job = aiplatform.CustomTrainingJob(
display_name='train-{}'.format(model_display_name),
script_path="model.py",
container_uri=train_image,
requirements=[], # any extra Python packages
model_serving_container_image_uri=deploy_image
)
(for why you want to assign a timestamped name to the model, please see How to Deploy a TensorFlow Model to Vertex AI)
3. Run training job
Running the job involves running model.py on the managed dataset on some hardware:
model = job.run(
dataset=data_set,
model_display_name=model_display_name,
args=['--bucket', BUCKET],
replica_count=1,
machine_type='n1-standard-4',
accelerator_type=aip.AcceleratorType.NVIDIA_TESLA_T4.name,
accelerator_count=1,
sync=develop_mode
)
4. Find endpoint
We want to deploy to a preexisting endpoint (read see How to Deploy a TensorFlow Model to Vertex AI for an explanation of what an endpoint is). So, find an existing endpoint or create one if not:
endpoints = aiplatform.Endpoint.list(
filter='display_name="{}"'.format(ENDPOINT_NAME),
order_by='create_time desc',
project=PROJECT, location=REGION,
)
if len(endpoints) > 0:
endpoint = endpoints[0] # most recently created
else:
endpoint = aiplatform.Endpoint.create(
display_name=ENDPOINT_NAME, project=PROJECT, location=REGION
)
5. Deploy model
Finally, deploy the model to the endpoint:
model.deploy(
endpoint=endpoint,
traffic_split={"0": 100},
machine_type='n1-standard-2',
min_replica_count=1,
max_replica_count=1
)
That’s it! Now, you have a Python program that you can run anytime you want to retrain and/or deploy the trained model. Of course, the MLOps person will typically not replace the model wholesale, but send only a small fraction of the traffic to the model. They’ll probably also set up monitoring and continuous evaluation on the endpoint in Vertex AI. But you’ve made it easy for them to do that.
End-to-end Auto ML in code
What changes in the above pipeline if I want to use AutoML instead of my custom training job? Well, I don’t need my own model.py of course. So, instead of the CustomTrainingJob, I’ll use AutoML.
Setting and running the training job (Steps 3 and 4 above) now become:
def train_automl_model(data_set, timestamp):
# train
model_display_name = '{}-{}'.format(ENDPOINT_NAME, timestamp)
job = aiplatform.AutoMLTabularTrainingJob(
display_name='train-{}'.format(model_display_name),
optimization_prediction_type='classification'
)
model = job.run(
dataset=data_set,
target_column='ontime',
model_display_name=model_display_name,
budget_milli_node_hours=(300 if develop_mode else 2000),
disable_early_stopping=False
)
return job, model
That’s the only change! The rest of the pipeline stays the same. That’s what we mean when we say that you have a unified platform for ML development.
In fact, you can similarly change the ML framework to PyTorch or to sklearn or XGBoost and, as far as the MLOps people are concerned, there are only minimal changes. In my train_on_vertexai.py, I switch between custom Keras code and AutoML with a command-line parameter.
Splitting the data in a non-default way
By default, Vertex AI does a fractional split of the data (80% to training, 10% each for validation and testing). What if you want to control the split? There are several options available (based on time, etc.).
Suppose you want to add a column to your dataset that controls the split, you can do this when creating the data:
CREATE OR REPLACE TABLE dsongcp.flights_all_data AS
SELECT
IF(arr_delay < 15, 1.0, 0.0) AS ontime,
dep_delay,
taxi_out,
...
IF (is_train_day = 'True',
IF(ABS(MOD(FARM_FINGERPRINT(CAST(f.FL_DATE AS STRING)), 100)) < 60, 'TRAIN', 'VALIDATE'),
'TEST') AS data_split
FROM dsongcp.flights_tzcorr f
...
Basically, there is a column that I’m calling data_split that takes the values TRAIN, VALIDATE or TEST. So, every row in the managed dataset is assigned to one of these three splits.
Then, when I’m training the job (whether it’s custom model or automl), I specify what the predefined splitting column is:
model = job.run(
dataset=data_set,
# See https://googleapis.dev/python/aiplatform/latest/aiplatform.html#
predefined_split_column_name='data_split',
model_display_name=model_display_name,
That’s it! Vertex AI will take care of the rest, including assigning all the necessary metadata to the models being trained.
Bottom-line: MLOps is getting easier as more and more of it becomes automatically managed. Lean into this, by following a clean separation of responsibilities in your code.
Enjoy!
More Reading on Vertex AI:
- Giving Vertex AI, the New Unified ML Platform on Google Cloud, a Spin:
Why do we need it, how good is the code-free ML training, really, and what does all this mean for data science jobs? - How to Deploy a TensorFlow Model to Vertex AI: Working with saved models and endpoints in Vertex AI
- Developing and Deploying a Machine Learning Model on Vertex AI using Python: Write training pipelines that will make your MLOps team happy
Developing and Deploying a Machine Learning Model on Vertex AI using Python was originally published in Towards Data Science on Medium, where people are continuing the conversation by highlighting and responding to this story.
from Towards Data Science - Medium https://ift.tt/3GWU0lr
via RiYo Analytics
No comments