Managing and Investigating Service Incidents on GCP

About

Managing and investigating service incidents is an important part of the maintenance process. It is a necessity that can be laboring but with the right organization, understanding of the systems, the knowledge of processes, and the discipline to adhere to best practices, it can be optimized. This lesson will focus on the predominant parts of managing service incidents and utilizing Google Cloud Platform to aid in the endeavor.

Perhaps the most important aspect of managing service incidents is managing the personnel involved. With that comes the need to manage their roles and responsibilities. This lesson will discuss the strategy for managing such roles and effectively managing the team. Part of managing the team is having a process for turnover of team members; managing the workload of the team, developing and scaling a reporting structure, and maintaining team productivity.

Perhaps the second most important aspect of managing service incidents is establishing effective communication. Constant and effective communication within the team and external to the team is paramount. This is especially true for keeping stakeholders informed.

The lesson will also discuss tooling to aid in monitoring and incident resolution, specifically Google Cloud Platform’s Stackdriver service. The service makes investigating service incidents easier by giving the response team the information needed.

If you have any feedback relating to this lesson, please contact us at support@cloudacademy.com.

Learning Objectives

Understand how to handle personnel to aid incident response
Learn how to manage roles within a team
Learn how to investigate incidents effectively

Intended Audience

This lesson is suited to anyone wanting to learn about incident handling using Google Cloud Platform.

Prerequisites

An active Google Cloud Platform account with admin permissions in order to administer roles, create test infrastructure, and configure operational tooling
A good understanding of managing service issues
Knowledge of issue mitigation practices
An understanding of logging and monitoring concepts
High-level knowledge of how roles should interact

Unit UUID

Course UUID