97
Things
SRE

Collective Wisdom from the Experts

Edited by: Emil Stolarsky & Jaime Woo
Publisher: O'Reilly Media
Published: November 2020
Pages: 250

📚

About This Book

"97 Things Every SRE Should Know" brings together 97 concise and actionable insights from 70+ site reliability engineers across the industry. Edited by Emil Stolarsky and Jaime Woo, co-founders of Incident Labs, this O'Reilly publication explores a broad range of conversations happening in SRE.

From foundational concepts for newcomers to advanced practices for enterprise-level teams, this collection provides practical guidance on SRE adoption, service level objectives, incident response enhancement, and the critical distinction between monitoring and observability.

What Makes This Book Special

✦ Diverse Perspectives: Contributions from 70+ industry practitioners including Nicole Forsgren, Charity Majors, Liz Fong-Jones, Tanya Reilly, and Julia Evans
✦ Practical & Concise: Each of the 97 chapters offers actionable advice you can apply immediately
✦ Scales with You: Organized to address teams from startup to enterprise scale
✦ Timeless Wisdom: Focuses on principles and practices that remain relevant despite technological changes

Book
Structure

🌱

Part I: New to SRE

Chapters 1-21

Foundational concepts and principles for those starting their SRE journey. Covers the basics of reliability engineering, essential practices, and core mindsets.

🚀

Part II: Zero to One

Chapters 22-35

Implementation strategies for teams establishing their first SRE practices. Focuses on getting started, building cultural foundations, and early wins.

📈

Part III: One to Ten

Chapters 36-65

Scaling operational practices and team dynamics as you grow. Addresses challenges of expanding SRE teams and maturing processes.

🏢

Part IV: Ten to Hundred

Chapters 66-97

Enterprise-level practices for large-scale SRE organizations. Covers organizational structure, standardization, and sophisticated operations.

My Contribution

Chapter 38 • Part III: One to Ten

Unpacking the On-Call Divide

My chapter explores the often-overlooked challenges and disparities in on-call practices. As teams scale from small startups to larger organizations, the on-call experience can vary dramatically, creating divides that impact team health, service reliability, and individual well-being.

I examine how different approaches to on-call responsibilities can create inequities, leading to burnout and reduced effectiveness. The chapter provides practical strategies for creating fair, sustainable on-call practices that work for teams at this critical growth stage.

Key Insights

→ Understanding different on-call models and their impacts
→ Identifying and addressing on-call inequities
→ Building sustainable on-call rotations
→ Preventing burnout in on-call engineers

Practical Applications

✓ Designing fair on-call schedules
✓ Compensating on-call work appropriately
✓ Measuring on-call burden and impact
✓ Creating supportive on-call cultures

"Unpacking the On-Call Divide" appears in Part III of the book, which focuses on teams scaling from 1-10 SREs—a critical phase where on-call practices can make or break team morale and service reliability.

Key
Topics

🎯

SLOs & Error Budgets

Service level objectives, error budgets, and reliability targets

🚨

Incident Response

Effective incident management and post-mortem practices

📊

Observability

Monitoring, metrics, logging, and understanding system behavior

🔄

Automation

Toil reduction, automation strategies, and tooling

👥

Team Culture

Building effective SRE teams and fostering collaboration

⚡

Performance

Scalability, capacity planning, and optimization

🔥

On-Call & Burnout

Sustainable on-call practices and preventing burnout

🏗️

Infrastructure

Cloud infrastructure, distributed systems, and architecture

📈

Continuous Improvement

Learning from failures and evolving practices

Featured Contributors

This book features insights from 70+ industry leaders and practitioners who share their hard-won lessons from building and operating reliable systems at scale.

Nicole Forsgren

"The Best Advice I Can Give to Teams"

Charity Majors

Co-founder & CTO, Honeycomb

Liz Fong-Jones

Developer Advocate & SRE

Tanya Reilly

Senior Principal Engineer

Julia Evans

Software Engineer & Educator

Alex Hidalgo

SLO Expert & Author

Laura Nolan

SRE & Author

Jason Hand

"Unpacking the On-Call Divide"

...and 60+ more industry practitioners

About the Editors

Emil Stolarsky

Co-founder of Incident Labs, Emil brings deep expertise in incident management and SRE practices. His work focuses on building tools and processes that help teams respond to and learn from incidents.

Jaime Woo

Co-founder of Incident Labs, Jaime specializes in organizational resilience and effective incident communication. Together with Emil, they curated this collection to share the SRE community's collective wisdom.

🔍

97
Things
SRE

About This Book

What Makes This Book Special

Book
Structure

Part I: New to SRE

Part II: Zero to One

Part III: One to Ten

Part IV: Ten to Hundred

My Contribution

Unpacking the On-Call Divide

Key Insights

Practical Applications

Key
Topics

SLOs & Error Budgets

Incident Response

Observability

Automation

Team Culture

Performance

On-Call & Burnout

Infrastructure

Continuous Improvement

Featured Contributors

Nicole Forsgren

Charity Majors

Liz Fong-Jones

Tanya Reilly

Julia Evans

Alex Hidalgo

Laura Nolan

Jason Hand

About the Editors

Emil Stolarsky

Jaime Woo

Related

Post-Incident Reviews

VictorOps Journey to SRE

All Publications

97 Things SRE

About This Book

What Makes This Book Special

Book Structure

Part I: New to SRE

Part II: Zero to One

Part III: One to Ten

Part IV: Ten to Hundred

My Contribution

Unpacking the On-Call Divide

Key Insights

Practical Applications

Key Topics

SLOs & Error Budgets

Incident Response

Observability

Automation

Team Culture

Performance

On-Call & Burnout

Infrastructure

Continuous Improvement

Featured Contributors

Nicole Forsgren

Charity Majors

Liz Fong-Jones

Tanya Reilly

Julia Evans

Alex Hidalgo

Laura Nolan

Jason Hand

About the Editors

Emil Stolarsky

Jaime Woo

Related

Post-Incident Reviews

VictorOps Journey to SRE

All Publications

97
Things
SRE

Book
Structure

Key
Topics