Let’s get started
By clicking ‘Next’, I agree to the Terms of Service
and Privacy Policy
Jobs / Job page
Site Reliability Engineer image - Rise Careers
Job details

Site Reliability Engineer

Voltage Park’s mission is to make AI infrastructure accessible to all. Today, we own 24,000+ H100s and operate 7+ data-centers across the US. We serve customers of all sizes, from small research labs to large enterprises. As part of this effort, we’re hiring a Site Reliability Engineer to be responsible for building out and operating our core infrastructure, including bare metal provisioning, telemetry, storage, and container / VM orchestration. 

To succeed in this role, you will need to be comfortable owning the care and feeding of thousands of GPU servers and related support infrastructure, including logging, analytics, automations, testing, and SOPs. You’ll play a pivotal role as a member of the team, responsible for bringing a substantial amount of infrastructure online across multiple data centers. You’ll also have an important role in defining the company’s culture and ensuring mission success.

This is a fully remote role, however some overlap with core PST work hours is required. You must be located in the United States, and we are unable to provide visa sponsorship at this time.

Responsibilities

  • At the direction of the Manager of Site Reliability Engineering, design, build, and roll out new platforms and patterns to minimize incidents and enable customer facing and internal features.

  • Deploy updates and improvements to support both Voltage Park’s internal and end customer use cases.

  • Collaborate with colleagues in network engineering, software development, and customer support in a flat organization.

  • Participate in the SRE on-call rotation (1 week on, 5+ weeks off).

Qualifications

  • 8+ years working with Linux as a server / hosting platform, extra points for Ubuntu experience.

  • 5+ years experience with AWS.

  • 2+ years experience with Kubernetes and strong container fundamentals.

  • 2+ years experience with Terraform and Ansible

  • 2+ years with network attached storage management (via NFS, ceph, or other protocols). Extra points for experience with VAST storage systems.

  • Experience working in a Slack-first, asynchronous remote work environment.

  • Experience with monitoring systems (Prometheus, ELK stack).

  • Familiarity with the gitops workflow. 

  • Software development experience using Python, Go, bash,  or other languages for the purposes of automation & connecting systems & APIs together.

  • Deep networking fundamentals, extra points for experience with datacenter level networks, 400Gb ethernet, and Infiniband.

  • Experience architecting, building, and delivering complex systems from 0 to 1.

  • Adept at balancing pragmatic development and ideal architectures. Effective at navigating tradeoffs between design, risk, cost, and outcomes.

  • Comfortable with navigating ambiguity.

  • Strong written and oral communication.

Ideal Experiences

  • Experience with bare metal hardware troubleshooting and provisioning, extra points for working with Dell hardware.

  • Experience with GPU servers, both in bare metal form or under virtualization.

  • Deep experience with network switches, routers, and firewalls, particularly SONiC switches, Palo Alto firewalls. 

  • Experience with VAST storage systems.

Culture

  • You enjoy working with a small group of friendly, highly motivated, execution focused colleagues.

  • You’re comfortable with a high degree of autonomy. We expect you to independently prioritize your work and understand how it maps to the overall needs and goals of the company.

  • You’re knowledgeable in your domain but also enjoy wearing multiple hats and venturing outside of your comfort zone when the need arises.

  • You value the ability to write well and understand the importance of good documentation.


Voltage Park is an equal opportunity employer and makes employment decisions on the basis of merit. All qualified applicants will receive consideration for employment without regard to race, color, religion, sex, sexual orientation, gender identity, national origin, disability, protected veteran status, or any other characteristic under federal, state, or local law. If you require an accommodation during the job application process, please notify your recruiter. 

Average salary estimate

$135000 / YEARLY (est.)
min
max
$120000K
$150000K

If an employer mentions a salary or salary range on their job, we display it as an "Employer Estimate". If a job has no salary data, Rise displays an estimate if available.

What You Should Know About Site Reliability Engineer, Voltage Park

At Voltage Park, we believe in making AI infrastructure available to everyone, and as a Site Reliability Engineer, you will be at the heart of this mission! You'll join a passionate team responsible for managing our vast infrastructure that includes over 24,000 H100 GPUs and multiple data centers throughout the US. Your role will involve everything from building out core infrastructure to optimizing operations related to bare metal provisioning, telemetry, and container orchestration. With over 8 years of Linux experience and a solid grasp of AWS, Kubernetes, and innovative automation tools like Terraform and Ansible, you’ll ensure our systems run like a well-oiled machine. You'll collaborate closely with diverse teams including network engineers and developers, helping to shape the company's culture and mission success. Your work will be pivotal in keeping our infrastructure robust and ready to tackle the demands of our clients, ranging from research labs to massive enterprises. If you're someone who thrives on autonomy and loves the challenge of working in a fast-paced, remote environment—this opportunity is for you! And while this is a fully remote position, we do ask for some overlap with Pacific Standard Time. Join us at Voltage Park and help push the boundaries of AI infrastructure!

Frequently Asked Questions (FAQs) for Site Reliability Engineer Role at Voltage Park
What does a Site Reliability Engineer do at Voltage Park?

A Site Reliability Engineer at Voltage Park is responsible for designing, building, and maintaining our core infrastructure. This includes everything from bare metal provisioning to telemetry and container orchestration. In this role, you will tackle challenges related to running thousands of GPU servers, ensuring our systems are optimized for both internal and external users.

Join Rise to see the full answer
What qualifications are needed for the Site Reliability Engineer position at Voltage Park?

Candidates for the Site Reliability Engineer position at Voltage Park should have at least 8 years of experience with Linux systems, and must also be proficient in AWS for a minimum of 5 years. Familiarity with Kubernetes, Terraform, Ansible, and NAS management is crucial, along with a solid grounding in network setups. A good mix of technical skills and soft skills is essential.

Join Rise to see the full answer
Is the Site Reliability Engineer role at Voltage Park remote?

Yes, the Site Reliability Engineer role at Voltage Park is fully remote! However, some overlap with core Pacific Standard Time work hours is required to facilitate collaboration with the team.

Join Rise to see the full answer
What kind of team will I be working with as a Site Reliability Engineer at Voltage Park?

At Voltage Park, you will be part of a small, friendly, and highly motivated team. Our flat organizational structure means you’ll collaborate closely with other teams including network engineering, software development, and customer support, fostering a seamless work environment.

Join Rise to see the full answer
How does Voltage Park ensure a good company culture for the Site Reliability Engineer role?

Voltage Park emphasizes a culture of autonomy, collaboration, and effective communication. As a Site Reliability Engineer, you’ll be valued for your knowledge and expected to actively contribute to both team goals and the broader company mission, which fosters an inclusive and engaging work environment.

Join Rise to see the full answer
What are the opportunities for growth in the Site Reliability Engineer role at Voltage Park?

The Site Reliability Engineer role at Voltage Park offers ample opportunities for professional development. You will be encouraged to learn new technologies, contribute to strategic projects, and enhance your skills in a supportive environment, positioning you for future leadership or specialized roles within the company.

Join Rise to see the full answer
What tools and technologies should I be familiar with as a Site Reliability Engineer at Voltage Park?

As a Site Reliability Engineer at Voltage Park, you should have extensive knowledge of Linux, AWS, Kubernetes, Terraform, and Ansible. Familiarity with monitoring systems such as Prometheus and ELK is also valuable, along with experience in software development, particularly using Python or Go for automation.

Join Rise to see the full answer
Common Interview Questions for Site Reliability Engineer
Can you describe your experience with Linux and how it relates to Site Reliability Engineering?

In interviews, emphasize your hands-on experience with Linux systems, particularly any challenges you’ve faced and overcome. Discuss how you've tailored configurations to improve system performance or security, and illustrate how this experience prepares you for the Site Reliability Engineer role at Voltage Park.

Join Rise to see the full answer
How have you utilized AWS in your previous roles?

Prepare to talk about specific AWS services you’ve used, the scale of your implementation, and how it benefitted the organization. Highlight your understanding of cost efficiency, security, and scalability, showing how your AWS expertise aligns with the responsibilities of the Site Reliability Engineer at Voltage Park.

Join Rise to see the full answer
What is your experience working with Kubernetes?

Describe your experience with Kubernetes, including deployments, monitoring, and orchestration of containerized applications. Offer examples of problems you've solved using Kubernetes, demonstrating your ability to leverage this technology effectively in the Site Reliability Engineer role.

Join Rise to see the full answer
How do you approach incident management and postmortems?

In your response, articulate your approach to incident detection, response protocols, and postmortem analysis. Highlight your commitment to continuous improvement and how you've implemented learnings from past incidents to prevent future occurrences, a key responsibility for any Site Reliability Engineer.

Join Rise to see the full answer
Can you give an example of automation you have created?

Share a specific instance where you developed automation scripts using tools like Terraform or Ansible. Discuss the challenges you faced and the impact this automation had on your previous projects, underscoring its relevance for the Site Reliability Engineer position at Voltage Park.

Join Rise to see the full answer
What strategies do you use for effective monitoring and alerting?

Discuss your experience with monitoring systems such as Prometheus and ELK. Elaborate on how you set up alerts that balance between being informative without causing alert fatigue, which is vital for the responsibilities of a Site Reliability Engineer.

Join Rise to see the full answer
How would you prioritize tasks in a fast-paced environment?

Demonstrate your prioritization methods, perhaps using frameworks like the Eisenhower Matrix, and provide examples from past experiences where your prioritization led to successful outcomes. This will show your readiness for the fast-paced nature of the Site Reliability Engineer role.

Join Rise to see the full answer
What is your approach to documentation?

Expound on your belief in maintaining thorough documentation for processes, system architecture, and incident responses. Share how effective documentation supports team collaboration and knowledge sharing, key elements for a Site Reliability Engineer.

Join Rise to see the full answer
How do you handle ambiguity in a project?

Address how you navigate uncertainty, maintaining flexibility while keeping your focus on the end goals. Provide examples of situations where you managed unclear requirements or unexpected changes, emphasizing your problem-solving skills as a Site Reliability Engineer.

Join Rise to see the full answer
What motivates you in your work, particularly in Site Reliability Engineering?

Reflect on what drives you as a Site Reliability Engineer. Whether it’s the satisfaction of solving complex problems, collaborating with innovative teams, or contributing to cutting-edge AI infrastructure, this insight will help interviewers understand your passion and fit for Voltage Park.

Join Rise to see the full answer
Similar Jobs
Photo of the Rise User
Auria Hybrid No location specified
Posted 2 days ago
Photo of the Rise User
Posted 5 days ago
Photo of the Rise User
IBI Group Hybrid W Commercial Blvd, Tamarac, FL, USA
Posted 6 days ago
Photo of the Rise User
Posted 4 days ago

voltage park is building a new class of cloud infrastructure from the ground up. join us, we're hiring!

13 jobs
MATCH
VIEW MATCH
FUNDING
SENIORITY LEVEL REQUIREMENT
TEAM SIZE
EMPLOYMENT TYPE
Full-time, remote
DATE POSTED
November 28, 2024

Subscribe to Rise newsletter

Risa star 🔮 Hi, I'm Risa! Your AI
Career Copilot
Want to see a list of jobs tailored to
you, just ask me below!