97
Things
SRE
Collective Wisdom from the Experts
Edited by: Emil Stolarsky & Jaime Woo
Publisher: O'Reilly Media
Published: November 2020
Pages: 250
About This Book
"97 Things Every SRE Should Know" brings together 97 concise and actionable insights from 70+ site reliability engineers across the industry. Edited by Emil Stolarsky and Jaime Woo, co-founders of Incident Labs, this O'Reilly publication explores a broad range of conversations happening in SRE.
From foundational concepts for newcomers to advanced practices for enterprise-level teams, this collection provides practical guidance on SRE adoption, service level objectives, incident response enhancement, and the critical distinction between monitoring and observability.
What Makes This Book Special
- β¦ Diverse Perspectives: Contributions from 70+ industry practitioners including Nicole Forsgren, Charity Majors, Liz Fong-Jones, Tanya Reilly, and Julia Evans
- β¦ Practical & Concise: Each of the 97 chapters offers actionable advice you can apply immediately
- β¦ Scales with You: Organized to address teams from startup to enterprise scale
- β¦ Timeless Wisdom: Focuses on principles and practices that remain relevant despite technological changes
Book
Structure
Part I: New to SRE
Chapters 1-21
Foundational concepts and principles for those starting their SRE journey. Covers the basics of reliability engineering, essential practices, and core mindsets.
Part II: Zero to One
Chapters 22-35
Implementation strategies for teams establishing their first SRE practices. Focuses on getting started, building cultural foundations, and early wins.
Part III: One to Ten
Chapters 36-65
Scaling operational practices and team dynamics as you grow. Addresses challenges of expanding SRE teams and maturing processes.
Part IV: Ten to Hundred
Chapters 66-97
Enterprise-level practices for large-scale SRE organizations. Covers organizational structure, standardization, and sophisticated operations.
My Contribution
Unpacking the On-Call Divide
My chapter explores the often-overlooked challenges and disparities in on-call practices. As teams scale from small startups to larger organizations, the on-call experience can vary dramatically, creating divides that impact team health, service reliability, and individual well-being.
I examine how different approaches to on-call responsibilities can create inequities, leading to burnout and reduced effectiveness. The chapter provides practical strategies for creating fair, sustainable on-call practices that work for teams at this critical growth stage.
Key Insights
- β Understanding different on-call models and their impacts
- β Identifying and addressing on-call inequities
- β Building sustainable on-call rotations
- β Preventing burnout in on-call engineers
Practical Applications
- β Designing fair on-call schedules
- β Compensating on-call work appropriately
- β Measuring on-call burden and impact
- β Creating supportive on-call cultures
"Unpacking the On-Call Divide" appears in Part III of the book, which focuses on teams scaling from 1-10 SREsβa critical phase where on-call practices can make or break team morale and service reliability.
Key
Topics
SLOs & Error Budgets
Service level objectives, error budgets, and reliability targets
Incident Response
Effective incident management and post-mortem practices
Observability
Monitoring, metrics, logging, and understanding system behavior
Automation
Toil reduction, automation strategies, and tooling
Team Culture
Building effective SRE teams and fostering collaboration
Performance
Scalability, capacity planning, and optimization
On-Call & Burnout
Sustainable on-call practices and preventing burnout
Infrastructure
Cloud infrastructure, distributed systems, and architecture
Continuous Improvement
Learning from failures and evolving practices
Featured Contributors
This book features insights from 70+ industry leaders and practitioners who share their hard-won lessons from building and operating reliable systems at scale.
Nicole Forsgren
"The Best Advice I Can Give to Teams"
Charity Majors
Co-founder & CTO, Honeycomb
Liz Fong-Jones
Developer Advocate & SRE
Tanya Reilly
Senior Principal Engineer
Julia Evans
Software Engineer & Educator
Alex Hidalgo
SLO Expert & Author
Laura Nolan
SRE & Author
Jason Hand
"Unpacking the On-Call Divide"
...and 60+ more industry practitioners
About the Editors
Emil Stolarsky
Co-founder of Incident Labs, Emil brings deep expertise in incident management and SRE practices. His work focuses on building tools and processes that help teams respond to and learn from incidents.
Jaime Woo
Co-founder of Incident Labs, Jaime specializes in organizational resilience and effective incident communication. Together with Emil, they curated this collection to share the SRE community's collective wisdom.
Related
Post-Incident Reviews
Deep dive into effective post-incident analysis processes and learning cultures.
VictorOps Journey to SRE
Transformation story from traditional operations to modern SRE practices.
All Publications
View all of my books, reports, and publications on DevOps and SRE.