Director Site Reliability Engineering

Location: Redwood City, CA

Department: Engineering

Type: Full Time

Min. Experience: Senior Manager/Supervisor

Director, Site Reliability Engineering

Redwood City, CA - Full Time

About AutoGrid:

AutoGrid’s Distributed Energy Resources (DER) flexibility management software allows utilities, electricity retailers, energy service providers and owners of large asset portfolios to deliver clean, affordable and reliable energy. The company’s patented Predictive Controls™ technology leverages advanced data science, machine learning and high-performance computing to predict, control and optimize all DERs at scale and in real time. AutoGrid has more than 2,500 megawatts of flexible capacity from distributed generation, storage and DR resources under contract. Its solutions help more than 30 global energy companies such as National Grid, Florida Power & Light, Eneco, CPS Energy and NextEra Energy to balance supply and demand, increase grid reliability, integrate renewables and deliver value-added services to customers in regulated and deregulated markets worldwide.

Key Responsibilities:

  • Own end-to-end availability, reliability, and performance of AutoGrid’s business critical software services
  • Lead, manage, and develop a distributed team of Site Reliability Engineers
  • Provide training for all personnel to ensure highest level of support and customer satisfaction
  • Manage and maintain AutoGrid’s globally distributed deployments to ensure rapid implementation with maximum support.
  • Design and implement processes for rolling out software and security updates to deployments, ideally with zero downtime
  • Build and maintain our platform and automation frameworks to ensure maximum uptime and predictability while preventing outages and service interruptions or degradations
  • Analyze system failures and develop rapid response processes to ensure such failures do not reoccur  
  • Work cross-functionally with Engineering, Program Management and Finance for capacity-planning, forecasting and production budget management. 
  • Partner with Engineering to provide the infrastructure and services required to enable innovation and ensure the highest level of quality and service
  • Work closely with Head of Cybersecurity to ensure all deployments meet and exceed industry required security posture
  • Predict and provide notice of potential system vulnerabilities for current and future solutions and implementations.  Provide specific recommendations and guidance to address such vulnerabilities 
  • Develop and manage processes and metrics that ensure maximum reliability and uptime for our customers
  • Analyze, build and maintain all automation tools and processes to ensure the highest standards of reliability and robustness
  • Ensure SRE adoption of Agile and DevOps methodologies and help the SRE teams fully leverage the source control, CI, quality engineering and release management resources available to them
  • Fully understand our customers’ service needs and ensure we meet these needs
  • Participate in 24x7 site reliability rotations and escalation workflows


  • 5+ years of experience in site reliability and technical operations leadership with experience building large and geographically disperse infrastructure supporting business critical cloud services
  • 5+ years of people management and team leadership experience including headcount planning and developing strong and motivated teams
  • Experience creating effective resource plans and cost-effective budgets that ensure a high level of performance 
  • Experience developing repeatable processes and metrics that maximum uptime, reliability, and predictability
  • Experience managing complex deployments
  • Successful track record of building and maintaining infrastructure including required monitoring, testing, and tooling
  • Experience with AWS strongly preferred 
  • Solid background in Linux system administration and understanding of deployment automation scripting languages (such as Python) and frameworks (such as Ansible, Chef, etc.)
  • Passion for securing cloud infrastructure (security monitoring, PAM, key-based authentication, role-based authorization, audit logging and patching)
  • Assist in navigating security certification audits a plus
  • Experience with Agile and DevOps methodologies
  • Effective verbal, written communication and interpersonal skills including interfacing with customers on a professional and cooperative level
  • Able to develop and maintain strong relationships with AutoGrid customers
  • BS degree in Computer Science or related degree or equivalent experience

This role reports to the VP of Engineering.

AutoGrid is an equal opportunity employer.


Apply for this Position
* Required fields
First name*
Last name*
Email address*
Phone number*

Attach resume as .pdf, .doc, or .docx (limit 5MB) or paste resume

Paste your resume here or attach resume file

Cover Letter*
In 150 characters or fewer, tell us what makes you unique. Try to be creative and say something that will catch our eye!*