Skip to main content

97
Things
SRE

Collective Wisdom from the Experts

Edited by: Emil Stolarsky & Jaime Woo
Publisher: O'Reilly Media
Published: November 2020
Pages: 250

πŸ“š

About This Book

"97 Things Every SRE Should Know" brings together 97 concise and actionable insights from 70+ site reliability engineers across the industry. Edited by Emil Stolarsky and Jaime Woo, co-founders of Incident Labs, this O'Reilly publication explores a broad range of conversations happening in SRE.

From foundational concepts for newcomers to advanced practices for enterprise-level teams, this collection provides practical guidance on SRE adoption, service level objectives, incident response enhancement, and the critical distinction between monitoring and observability.

What Makes This Book Special

  • ✦ Diverse Perspectives: Contributions from 70+ industry practitioners including Nicole Forsgren, Charity Majors, Liz Fong-Jones, Tanya Reilly, and Julia Evans
  • ✦ Practical & Concise: Each of the 97 chapters offers actionable advice you can apply immediately
  • ✦ Scales with You: Organized to address teams from startup to enterprise scale
  • ✦ Timeless Wisdom: Focuses on principles and practices that remain relevant despite technological changes

Book
Structure

🌱

Part I: New to SRE

Chapters 1-21

Foundational concepts and principles for those starting their SRE journey. Covers the basics of reliability engineering, essential practices, and core mindsets.

πŸš€

Part II: Zero to One

Chapters 22-35

Implementation strategies for teams establishing their first SRE practices. Focuses on getting started, building cultural foundations, and early wins.

πŸ“ˆ

Part III: One to Ten

Chapters 36-65

Scaling operational practices and team dynamics as you grow. Addresses challenges of expanding SRE teams and maturing processes.

🏒

Part IV: Ten to Hundred

Chapters 66-97

Enterprise-level practices for large-scale SRE organizations. Covers organizational structure, standardization, and sophisticated operations.

My Contribution

Chapter 38 β€’ Part III: One to Ten

Unpacking the On-Call Divide

My chapter explores the often-overlooked challenges and disparities in on-call practices. As teams scale from small startups to larger organizations, the on-call experience can vary dramatically, creating divides that impact team health, service reliability, and individual well-being.

I examine how different approaches to on-call responsibilities can create inequities, leading to burnout and reduced effectiveness. The chapter provides practical strategies for creating fair, sustainable on-call practices that work for teams at this critical growth stage.

Key Insights

  • β†’ Understanding different on-call models and their impacts
  • β†’ Identifying and addressing on-call inequities
  • β†’ Building sustainable on-call rotations
  • β†’ Preventing burnout in on-call engineers

Practical Applications

  • βœ“ Designing fair on-call schedules
  • βœ“ Compensating on-call work appropriately
  • βœ“ Measuring on-call burden and impact
  • βœ“ Creating supportive on-call cultures

"Unpacking the On-Call Divide" appears in Part III of the book, which focuses on teams scaling from 1-10 SREsβ€”a critical phase where on-call practices can make or break team morale and service reliability.

Key
Topics

🎯

SLOs & Error Budgets

Service level objectives, error budgets, and reliability targets

🚨

Incident Response

Effective incident management and post-mortem practices

πŸ“Š

Observability

Monitoring, metrics, logging, and understanding system behavior

πŸ”„

Automation

Toil reduction, automation strategies, and tooling

πŸ‘₯

Team Culture

Building effective SRE teams and fostering collaboration

⚑

Performance

Scalability, capacity planning, and optimization

πŸ”₯

On-Call & Burnout

Sustainable on-call practices and preventing burnout

πŸ—οΈ

Infrastructure

Cloud infrastructure, distributed systems, and architecture

πŸ“ˆ

Continuous Improvement

Learning from failures and evolving practices

Featured Contributors

This book features insights from 70+ industry leaders and practitioners who share their hard-won lessons from building and operating reliable systems at scale.

Nicole Forsgren

"The Best Advice I Can Give to Teams"

Charity Majors

Co-founder & CTO, Honeycomb

Liz Fong-Jones

Developer Advocate & SRE

Tanya Reilly

Senior Principal Engineer

Julia Evans

Software Engineer & Educator

Alex Hidalgo

SLO Expert & Author

Laura Nolan

SRE & Author

Jason Hand

"Unpacking the On-Call Divide"

...and 60+ more industry practitioners

About the Editors

Emil Stolarsky

Co-founder of Incident Labs, Emil brings deep expertise in incident management and SRE practices. His work focuses on building tools and processes that help teams respond to and learn from incidents.

Jaime Woo

Co-founder of Incident Labs, Jaime specializes in organizational resilience and effective incident communication. Together with Emil, they curated this collection to share the SRE community's collective wisdom.