How do you promote operational excellence

Operational Excellence

Best practices


To set the priorities that will drive business success, your teams need to work together to understand what each workload looks like, what role each team plays in it, and what business goals are to be achieved with it. With well-defined priorities, your efforts will get the most benefit. Assess the needs of internal and external customers. Engage all key stakeholders, including the business, development, and operations teams, to determine which areas the effort should be focused on. Assessing customer needs ensures that you have a thorough understanding of the support required to achieve the business results you want. Make sure you are aware of the policies or obligations that have been defined by the leadership of your company. Assess external factors such as: B. Legal compliance requirements and industry standards that require or may reinforce a certain focus. Verify that you have mechanisms in place to identify changes to internal governance and external compliance requirements. If no requirements are found, ensure that this check has been carried out carefully. Check your priorities regularly so they can be updated as needed.

Assess threats to the company (e.g. business risks and obligations and information security threats) and maintain this information in a risk register. Assess the effects of risk and tradeoffs between competing interests or alternative approaches. For example, accelerating new features to market may take precedence over cost optimization, or you can choose a relational database for non-relational data to make it easier to migrate a system without refactoring. Weigh the benefits and risks to make informed decisions about which areas to focus efforts on. Some risks or decisions may be acceptable for a period of time. There may be an opportunity to minimize the associated risks, or at a certain point in time it is no longer acceptable for a risk to persist. If so, take steps to remedy the risk.

Your teams need to understand their role in achieving business results. Teams need to understand their roles in other teams' success, the role of other teams in their own success, and they need to have common goals. Understanding accountability, authority, and decision-making, and knowing who is authorized to make decisions, will help you focus efforts and help your teams achieve maximum benefit. The requirements of a team are influenced by the customer supported, the company, the composition of the team and the characteristics of the respective workloads. It doesn't make sense to assume that a single operating model can support all teams and workloads in your organization.

Make sure that there are proper owners for every application, workload, platform, and infrastructure component, and that every process and procedure has a permanent owner who is responsible for definition and owners who are responsible for performance. Understanding the business value of each component, process, and procedure, why those resources exist or activities are performed, and why that responsibility exists, will base your team members' actions on informed intelligence. Clearly define the responsibilities of team members so that they act accordingly and have mechanisms to identify accountability and responsibility. Use appropriate mechanisms for requesting additions, changes and exceptions so that you do not restrict innovation. Define agreements between teams describing how they will work together for mutual and business results support.

Empower your team members so they can act more effectively and contribute positively to your bottom line. The executives involved should set expectations and measure success. You should act as the sponsor, advocate, and driving force behind the adoption of best practice and the advancement of the business. Team members must be able to take action when results are compromised to minimize impact. You need to be encouraged to inform decision-makers and stakeholders about identified risks so that they can be addressed and incidents avoided. Communicate known risks and planned events in a timely, clear and actionable manner so that team members can take appropriate action in a timely manner.

Encourage trying new approaches so that insights are gained faster and team members stay interested and motivated. Teams need to broaden their skills to adopt new technologies and support changes as needed and responsibilities. They should support and encourage this through special, structured learning times. Make sure your teams have the necessary resources (tools and team members) to make a positive contribution to your business results. Take advantage of the diversity across the company to learn different unique viewpoints. Use this perspective to fuel innovation, challenge your assumptions, and reduce the risk of bias from automatic confirmation. Strengthen inclusion, diversity and accessibility within your teams to gain useful perspectives.

If there are any external government regulations or compliance requirements that apply to your organization, you should use the resources provided by AWS Cloud Compliance to train your teams on the implications for your priorities. The Well-Architected Framework focuses on learning, measuring and improving. It provides a consistent approach that allows you to evaluate architectures and implement designs that scale over time. AWS provides the AWS Well-Architected Tool that enables you to review your approach before development, the status of your workloads before production, and the status of your workloads in production. You can compare them to the latest best practices for AWS architecture, monitor the overall health of your workloads, and gain insight into potential risks. AWS Trusted Advisor is a tool that provides access to various key checks that provide optimization recommendations. This information can help you set your priorities. Business and enterprise support customers have access to additional security, reliability, performance, and cost optimization exams that are even more helpful in setting priorities.


You should use tools or services that allow you to manage your environments across accounts, such as: B. AWS Organizations. This supports you in managing your operating models. Services like AWS Control Tower extend this management function so that you can define plans (that support your operating models) for setting up accounts, applying ongoing governance with AWS Organizations, and automating the provisioning of new accounts. Providers of managed services such as AWS Managed Services, AWS Managed Services partners or providers of managed services in the AWS partner network provide expertise for the implementation of cloud environments and support your security and compliance requirements and business goals. By adding managed services to your operating model, you can save time and resources, keep your internal teams small, and focus on strategic outcomes that set your business apart rather than developing new skills and competencies.

The following questions deal with these considerations (lower column).

Sometimes it can happen that too much attention is paid to a small selection of operational priorities. Use a well-balanced approach over the long term to ensure that required skills are developed and risks are managed. Check the priorities regularly and adapt them to changing requirements. When accountability and jurisdiction are undefined or unknown, there is a risk that required actions will not be taken in a timely manner and redundant and potentially conflicting efforts will be made to meet these requirements. The corporate culture has a direct impact on the satisfaction and loyalty of team members. Enable interaction and activate the skills of your team members for the success of your company. Experiments make innovations possible and ideas become results. You should recognize that undesirable results can be successful experiments that have shown a path that does not lead to success.


To prepare for operational excellence, you need to understand what workloads you can expect and what they are likely to be. Then you can design the workloads to give you visibility into their status and design procedures to support them.

Design your workload to provide the information you need to understand internal status (such as metrics, logs, events, and traces) across all components. This increases transparency and makes it easier to investigate problems. Iterate to develop the telemetry needed to monitor the health of your workload, determine when outcomes are at risk, and respond effectively. When instrumenting your workload, collect as much situational information as possible (e.g. status changes, user activities, access with an authorization, usage counter) - knowing that you can filter out the really useful information later.

Use strategies that improve the propagation of changes to the production environment, including refactoring, quick quality feedback, and quick troubleshooting. This allows useful changes to flow into production faster and fewer issues with deployment. In addition, problems caused by provisioning activities or detected in your environment can be quickly identified and resolved.

Use approaches that provide quick feedback on quality and allow immediate restoration of changes that do not produce the desired results. You can use these procedures to reduce the impact of problems caused by deploying changes. Factor in unsuccessful changes so that you can react faster if necessary and test and validate the changes made. Be aware of planned activities in your surroundings so that you can manage the risk of changes that affect planned activities. Make frequent, small, and reversible changes to limit the scope of the changes. This makes troubleshooting easier and allows for faster correction, as there is the possibility of reverting a change. It also means that you will reap the benefits of valuable changes more often.

Assess the operational readiness of your workload, processes and procedures, and people so that you are fully aware of the operational risks associated with your workload. You should use a consistent process (including manual and automated checklists) so that you know when you are ready to go live with your workload or a change. This is also a great way to find all the areas you need to plan. Your routine activities should be recorded in runbooks, and playbooks will help you solve problems. Understand the benefits and risks so that you can make informed decisions and enable changes to be made in production.

With AWS, you can view all workloads (applications, infrastructure, policies, governance, and operations) as code. Everything can be defined in code and updated using code. This means that for every element of your stack you can use the same technical approach that you use for application code. You can share these across teams or organizations, thereby increasing the impact of development efforts. Use Operations-as-Code in the cloud and have the opportunity to safely experiment, develop your workload and operational procedures, and practice failures. By using AWS CloudFormation, you have consistent, template-based, sandboxed development, test, and production environments with increased operational control.

The following questions deal with these considerations (lower column).

Invest in implementing operational activities as code to maximize worker productivity, minimize error rates, and enable automated responses. Prevent errors as far as possible and set up appropriate processes. Apply metadata using resource tags and AWS Resource Groups following a consistent tagging strategy to help identify your resources. Tag your resources for organization, costing, access control and targeting the execution of automated operational activities.Adopt deployment methods that take advantage of the elasticity of the cloud to enable development, system pre-deployment, and faster implementations. When making changes to checklists used to assess your workloads, consider what to do with live systems that are no longer compatible with the changes.


The successful operation of a workload is measured by whether it achieves business results and meets customer requirements. Define expected results, determine how success will be measured, and specify which metrics to use in calculations that determine whether the workload and operations are successful. Operational status includes both the status of the workload and the status and success of the operational processes that are performed to support the workload (e.g., deployment and incident response). Set metric baselines for improvement, investigation, and intervention. Collect and analyze your metrics and then review how closely they align with your understanding of operational success and what changes have occurred over time. Use the metrics captured to determine if customer and business needs are being met and identify areas that can be improved.

Efficient and effective management of operational events is required to achieve operational excellence. This applies to both planned and unplanned operational events. Use runbooks prepared in advance for known events. Get help investigating and troubleshooting playbooks. Prioritize your responses to events based on the impact that the event will have on business operations and customers. Make sure that a procedure to be carried out, including a responsible owner, is defined for an alarm that is to be triggered in the event of a certain event. Determine in advance which employees should be responsible for resolving an event. This also includes triggers for an escalation process, via which additional employees are to be called in in an emergency based on the urgency and impact. In the event that an unspecified incident response is required that could potentially impact business operations, assign people with the authority to make decisions.

Share information on the operational status of workloads through dashboards and messages tailored to the target audience (e.g. customer, company, developer, operations team) so that they can take appropriate action and know when normal operations will resume goes on.

In AWS, you can generate dashboard views of your metrics collected from workloads or natively from AWS. You can use CloudWatch or third-party applications to compile and display views of operational activity at the business, workload, and operational levels. AWS provides insights into workloads through its logging capabilities (such as AWS X-Ray, CloudWatch, CloudTrail, and VPC Flow Logs). In this way, workload problems can be identified, which helps with root cause analysis and troubleshooting.

The following questions deal with these considerations (lower column).

Any metrics you collect should be tailored to the business needs and outcomes that support them. Develop scripted responses to known events and automate their performance in response to event detection.

Further development

For continued operational excellence, you need to learn, share knowledge, and strive for continuous improvement. Schedule work cycles to continuously make minor improvements. After an incident, analyze all events that affect the customer. Identify the contributing factors and preventive measures to limit or prevent repetition. Share the contributing factors with affected communities as needed. Assess and prioritize opportunities for improvement at regular intervals (e.g. requests for features, resolving problems, compliance requirements), including workload and operating procedures. Include feedback loops in your procedures in order to quickly find out about opportunities for improvement and to document feedback from practical operations.

Share the things you learn with other teams so that everyone can benefit. Examine whether your new insights might be trending, and retrospectively conduct cross-team analyzes of operational metrics to identify opportunities and methods for improvement. Implement changes that will lead to improvements and assess their results.

In AWS, you can export your log data to Amazon S3 or send logs directly to Amazon S3 for long-term storage. With AWS Glue, you can explore and prepare your historical data in Amazon S3 for analysis and store the associated metadata in the AWS Glue data catalog. Amazon Athena can then be used to parse your historical data and queries using standard SQL through native integration with Glue. With a business intelligence tool like Amazon QuickSight, you can visualize, examine, and analyze your data. Identify trends and events that can lead to improvement.

The following questions deal with these considerations (lower column).

The foundation for a successful further development of the company is constant minor improvements, the provision of safe environments and time frames for experimentation, development and testing of improvements, and the creation of an environment in which everyone is encouraged to learn from mistakes. The operational support for sandbox, development, test and production environments, with increasing levels of operational control, facilitates development and increases the predictability that changes will lead to successful results.