TensorFlow Distributed Training (TFJob)¶
Introduction¶
TFJob
is a training operator in Kubeflow on vSphere. It is specifically designed to run distributed TensorFlow training jobs on Kubernetes clusters. It provides a simple and consistent way to define, manage, and scale TensorFlow training jobs in a distributed manner, allowing you to easily leverage the power of Kubernetes to accelerate your machine learning (ML) workloads.
With TFJob
, you define a TensorFlow training job with a YAML configuration file, specifying the details of the job such as the number of workers and parameter servers, the location of the training data, the type of cluster to use, and so on. TFJob
then creates and manages the Kubernetes resources required to run the job, including pods, services, and volumes.
TFJob
also supports advanced features such as distributed training with data parallelism, model parallelism, and synchronous or asynchronous updates, as well as monitoring and visualization of training metrics using TensorBoard. This makes it a powerful tool for running large-scale TensorFlow training jobs on Kubernetes clusters, whether on-premises or in the cloud.
Get started¶
In this section, you create a training job by defining a TFJob
configuration file to train a model. Before that, you need a working Kubeflow on vSphere deployment with TFJob
Operator up and running.
Verify TFJob
is running¶
Check that the TensorFlow custom resource is installed:
$ kubectl get crd
NAME CREATED AT
...
tfjobs.kubeflow.org 2023-01-31T06:02:59Z
...
Check that the training operator is running via:
$ kubectl get pods -n kubeflow
NAME READY STATUS RESTARTS AGE
...
training-operator-0 2/2 Running 4 (6d1h ago) 6d2h
...
Create a TF training job¶
You may create a training job by defining a TFJob
configuration file. See the manifests for the MNIST example. You may change the configuration file based on your requirements.
You may deploy the TFJob
resource with CPU or GPU, but you just provide YAML file with CPU to deploy TFJob
due for certain reasons. Thus, you follow the step to deploy the TFJob
resource with CPU. If you want to deploy TFJob
resource with GPU, please refer to TFJob deployment using GPUs.
Deploy the TFJob resource with CPU to start training:
USER_NAMESPACE=user
kubectl config set-context --current --namespace=$USER_NAMESPACE
# Deploy the TFJob resource with CPU
cat <<EOF | kubectl create -n $USER_NAMESPACE -f -
apiVersion: "kubeflow.org/v1"
kind: TFJob
metadata:
name: tfjob-simple
spec:
tfReplicaSpecs:
Worker:
replicas: 2
restartPolicy: OnFailure
template:
metadata:
annotations:
sidecar.istio.io/inject: "false"
template:
spec:
containers:
- name: tensorflow
image: projects.registry.vmware.com/models/kubeflow-docs/model-training-tf-mnist-with-summaries:1.0
command:
- "python"
- "/var/tf_mnist/mnist_with_summaries.py"
EOF
To verify the number of created pods matches the specified number of replicas:
$ kubectl get pods -l job-name=tfjob-simple -n $USER_NAMESPACE
Monitoring a TFJob
¶
Check the events for your job to see if the pods are created:
$ kubectl describe tfjobs tfjob-simple -n $USER_NAMESPACE
...
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal SuccessfulCreatePod 78s tfjob-controller Created pod: tfjob-simple-worker-0
Normal SuccessfulCreatePod 77s tfjob-controller Created pod: tfjob-simple-worker-1
Normal SuccessfulCreateService 77s tfjob-controller Created service: tfjob-simple-worker-0
Normal SuccessfulCreateService 77s tfjob-controller Created service: tfjob-simple-worker-1
Check the logs to see the training result after the training process completes:
$ kubectl logs -f tfjob-simple-worker-0 -n $USER_NAMESPACE
$ kubectl logs -f tfjob-simple-worker-1 -n $USER_NAMESPACE