Site Reliability Engineer, Google Cloud Engine AI SRE
Company: Google
Location: Seattle
Posted on: April 2, 2026
|
|
|
Job Description:
info_outline X In accordance with Washington state law, we are
highlighting our comprehensive benefits package, which is available
to all eligible US based employees. Benefits for this role include:
Health, dental, vision, life, disability insurance Retirement
Benefits: 401(k) with company match Paid Time Off: 20 days of
vacation per year, accruing at a rate of 6.15 hours per pay period
for the first five years of employment Sick Time: 40 hours/year
(statutory, where applicable); 5 days/event (discretionary)
Maternity Leave (Short-Term Disability Baby Bonding): 28-30 weeks
Baby Bonding Leave: 18 weeks Holidays: 13 paid days per year.
Minimum qualifications: Bachelor's degree or equivalent practical
experience. 5 years of experience working on cloud distributed
systems that demand scalability, reliability, throughput and low
latency. 3 years of experience coding with one or more programming
languages (e.g., Java, C/C++, Python). 2 years of experience with
debugging and troubleshooting software issues. Preferred
qualifications: Master's degree in a technical field or equivalent
practical experience. Experience designing, analyzing and
troubleshooting large-scale distributed systems. Experience
designing and developing software oriented towards systems or
network automation. Understanding of Unix/Linux operating systems.
Ability to debug, optimize code, and to automate routine tasks.
Excellent problem-solving and communication skills. About the job
Site Reliability Engineering (SRE) combines software and systems
engineering to build and run large-scale, massively distributed,
fault-tolerant systems. SRE ensures that Google Cloud's
services—both our internally critical and our externally-visible
systems—have reliability, uptime appropriate to customer's needs
and a fast rate of improvement. Additionally SRE’s will keep an
ever-watchful eye on our systems capacity and performance. Much of
our software development focuses on optimizing existing systems,
building infrastructure and eliminating work through automation. On
the SRE team, you’ll have the opportunity to manage the complex
challenges of scale which are unique to Google Cloud, while using
your expertise in coding, algorithms, complexity analysis and
large-scale system design. SRE's culture of intellectual curiosity,
problem solving and openness is key to its success. Our
organization brings together people with a wide variety of
backgrounds, experiences and perspectives. We encourage them to
collaborate, think big and take risks in a blame-free environment.
We promote self-direction to work on meaningful projects, while we
also strive to create an environment that provides the support and
mentorship needed to learn and grow. Based in Seattle and London,
we manage Google Cloud Engine (GCE) AI/ML workloads and the
critical infrastructure powering them. As a Site Reliability
Engineer (SREs) you will deliver a seamless customer experience.
You will act as a first responder for AI workload health and
customer-facing issues. You will build and support capabilities for
managing ML workloads and influence architecture, standards, and
operational methods for AI services. You will develop advanced
monitoring and alerting to improve GCE visibility and collaborate
with development teams on novel, emerging technologies.Behind
everything our users see online is the architecture built by the
Technical Infrastructure team to keep it running. From developing
and maintaining our data centers to building the next generation of
Google platforms, we make Google's product portfolio possible.
We're proud to be our engineers' engineers and love voiding
warranties by taking things apart so we can rebuild them. We keep
our networks up and running, ensuring our users have the best and
fastest experience possible. The US base salary range for this
full-time position is $174,000-$252,000 bonus equity benefits. Our
salary ranges are determined by role, level, and location. Within
the range, individual pay is determined by work location and
additional factors, including job-related skills, experience, and
relevant education or training. Your recruiter can share more about
the specific salary range for your preferred location during the
hiring process. Please note that the compensation details listed in
US role postings reflect the base salary only, and do not include
bonus, equity, or benefits. Learn more about benefits at Google .
Responsibilities Act as a first responder for AI workload health
and customer-facing issues. Build and support capabilities for
managing ML workloads. Influence architecture, standards, and
operational methods for AI services. Develop advanced monitoring
and alerting to improve GCE visibility. Collaborate with
development teams on novel, emerging technologies. Bridge the gap
between the infrastructure and AI.
Keywords: Google, Redmond , Site Reliability Engineer, Google Cloud Engine AI SRE, IT / Software / Systems , Seattle, Washington