What to Ask in an SRE Technical Interview
Share: 

What to Ask in an SRE Technical Interview

24 March 2021

This post follows directly on from [What I Look for in Interviews](), applying it specifically to SRE.

What is an SRE?

I’m not going to try to answer this, but you should. You should sit back and think about what you need now to keep the lights on, and what you want in the future to make your platform better. What is your platform team’s engagement model? The great SRE wall? The cuddly DevOps kumbaya? The you-build-it-you-run-it approach of embedding platform people in product teams? You need to be able to tell the candidate that, and you need to work out what skills and behaviours it’s going to require.

Then write a Role Spec. Ideally, write a Career Framework, ie define what SRE I, SRE II, SRE III, and Senior SRE look like to you. Each of those levels essentially contains a Role Spec, against which you ask questions.

Breadth

Ask questions across the whole range of skills they’re meant to have. An SRE is meant to be a “software engineer who happens to be working on infra problems at the moment.” It requires a lot of soft skills I won’t go into, and two broad technical areas.

Software engineering

Proper SWE. Your test code should be just as high quality as your functional code, and so should your infrastructure “scripts”. A big part of the job these days is writing Kubernetes Operators and Terraform Providers. If you want to be pushing the boundaries, you need a team that could write the next Envoy or Vault.

So not just “I can write spaghetti and the computer will print ‘hello.’” Not “scripting.” Do they understand modularity, testing, commenting, version control, branching and PRs, DI, ORMs, etc?

That’s not to say that day-to-day they won’t be writing a lot of 100-line bash and python scripts. But they should be able to do better when necessary, and always have those good software engineering principals in mind.

I would also check two more things that are very relevant and important these days, but where I think interviews often suck: complexity, and distributed systems. When I say interviews suck, I’m looking at you Google. This isn’t a college exam. When they see a piece of code linearly searches and array, do they quickly check whether it’s on a hot path, and add an index if so? That’s what’s needed to do the job - a sympathy for slow vs fast code. Advanced algebra isn’t; they’re not quants.

Operations

This is where a lot of recent graduates and converted software engineers fall down. Some examples of things they don’t teach in Java school, but are needed day-to-day:

  • Networks
  • Linux - leave the OS wars at the door, all production systems are linux. They need to know their way around the shell, /proc, virtualisation, containerisation. Can they tell you how paged memory works? If they can’t they’re never gonna debug Prometheus OOMing at 3am.
  • Playbooks & Runbooks
  • Backups - what’s a snapshot vs a journal? How do you test a backup?
  • On call procedures
  • “Security”
  • Public key infrastructures
  • RED metrics, USE metrics, alert tuning

Depth

Do they really grok what they’re seeing? Do they have enough understanding to synthesis novel things? Have the seen enough to form good, general mental models? Can they use those to teach junior people?

Check if they know how to find what process is bound to particular port. What’s disk sleep? What’s a zombie process? What does git rebase do to the graph, and when should we use it? How does a Deployment resource in Kuberntes actually work? What’s a container made of? What do level-triggering and eventual consistency actually mean?

Open-Ended Questions

Very useful for checking depth.

  • What happens when I curl google.com ?
  • I’ve cracked your server. How do I exfiltrate data without being seen?
  • Explain what a container is made of

How to Ask it

Especially for SRE roles, I prefer practical tests. They can take some setting up, but they’re so much better at finding on-call buddies than “tell me about a time you helped a team migrate to the cloud.”

Give them a broken k8s cluster and get them to fix it, CKA style. Get them to write practical but difficult code - emit a log that can be quickly searched later - merge the set of files in two directories.