Let’s get started
By clicking ‘Next’, I agree to the Terms of Service
and Privacy Policy
Jobs / Job page
MLOps Professional Services Engineer (Cloud & AI Infra) image - Rise Careers
Job details

MLOps Professional Services Engineer (Cloud & AI Infra)

About the Company

Our client is at the forefront of the AI revolution, providing cutting-edge infrastructure that's reshaping the landscape of artificial intelligence. They offer an AI-centric cloud platform that empowers Fortune 500 companies, top-tier innovative startups, and AI researchers to drive breakthroughs in AI. This publicly traded company is committed to building full-stack infrastructure to service the explosive growth of the global AI industry, including large-scale GPU clusters, cloud platforms, tools, and services for developers.

  • Company Type: Publicly traded

  • Product: AI-centric GPU cloud platform & infrastructure for training AI models

  • Candidate Location: Remote anywhere in the US

Their mission is to democratize access to world-class AI infrastructure, enabling organizations of all sizes to turn bold AI ambitions into reality. At the core of their success is a culture that celebrates creativity, embraces challenges, and thrives on collaboration.

The Opportunity

As an MLOps Professional Services Engineer (Remote), you’ll play a key role in designing, implementing, and maintaining large-scale machine learning (ML) training and inference workflows for clients. Working closely with a Solutions Architect and support teams, you’ll provide expert, hands-on guidance to help clients achieve optimal ML pipeline performance and efficiency. 

What You'll Do

  • Design and implement scalable ML training and inference workflows using Kubernetes and Slurm, focusing on containerization (e.g., Docker) and orchestration.

  • Optimize ML model training and inference performance with data scientists and engineers

  • Develop and expand a library of training and inference solutions by designing, deploying, and managing Kubernetes and Slurm clusters for large-scale ML training with ready-to-deploy, standardized solutions

  • Integrate with ML frameworks: integrate K8s and Slurm with popular ML frameworks like TensorFlow, PyTorch, or MXNet, ensuring seamless execution of distributed ML training workloads

  • Develop monitoring and logging tools to track distributed training performance, identify bottlenecks, and troubleshoot issues

  • Create automation scripts and tools to streamline ML training workflows, leveraging technologies like Ansible, Terraform, or Python

  • Participate in industry conferences, meetups, and online forums to stay up-to-date with the latest developments in MLOps, K8S, Slurm, and ML


What You Bring

  • At least 3 years of experience in MLOps, DevOps, or a related field

  • Strong experience with Kubernetes and containerization (e.g., Docker)

  • Experience with cloud providers like AWS, GCP, or Azure

  • Familiarity with Slurm or other distributed computing frameworks

  • Proficiency in Python, with experience in ML frameworks such as TensorFlow, PyTorch, or MXNet

  • Knowledge of ML model serving and deployment

  • Familiarity with CI/CD pipelines and tools like Jenkins, GitLab CI/CD or CircleCI

  • Experience with monitoring and logging tools like Prometheus, Grafana or ELK Stack 

  • Solid understanding of distributed computing principles, parallel processing, and job scheduling

  • Experience with automation tools like Ansible, Terraform

Key Attributes for Success

  • Passion for AI and transformative technologies

  • A genuine interest in optimizing and scaling ML solutions for high-impact results

  • Results-driven mindset and problem-solver mentality

  • Adaptability and ability to thrive in a fast-paced startup environment

  • Comfortable working with an international team and diverse client base

  • Communication and collaboration skills, with experience working in cross-functional teams

Why Join?

  • Competitive compensation: $130,000-$175,000 (negotiable based on experience and skills)

  • Full medical benefits and life - insurance: 100% coverage for health, vision, and dental insurance for employees and their families

  • 401(k) match program with up to a 4% company match

  • PTO and paid holidays 

  • Flexible remote work environment

  • Reimbursement of up to $85/month for mobile and internet

  • Work with state-of-the-art AI and cloud technologies, including the latest NVIDIA GPUs (H100, L40S, with H200 and Blackwell chips coming soon)

  • Be part of a team that operates one of the most powerful commercially available supercomputers

  • Contribute to sustainable AI infrastructure with energy-efficient data centers that recover waste heat to warm nearby residential buildings

Interviewing Process

  • Level 1: Virtual interview with the Talent Acquisition Lead (General fit, Q&A)

  • Level 2: Virtual interview with the Hiring Manager (Skills assessment)

  • Level 3: Interview with the C-level (Final round)

  • Reference and Background Checks: Conducted post-interviews

  • Offer: Extended to the selected candidate

We are proud to be an equal opportunity workplace and are committed to equal employment opportunity regardless of race, color, religion, national origin, age, sex, marital status, ancestry, physical or mental disability, genetic information, veteran status, gender identity, or expression, sexual orientation, or any other characteristic protected by applicable federal, state or local law.

Average salary estimate

$152500 / YEARLY (est.)
min
max
$130000K
$175000K

If an employer mentions a salary or salary range on their job, we display it as an "Employer Estimate". If a job has no salary data, Rise displays an estimate if available.

What You Should Know About MLOps Professional Services Engineer (Cloud & AI Infra), Lavendo

Are you ready to dive into the exciting world of MLOps with a cutting-edge company that's reshaping the AI landscape? Our client, a publicly traded firm based in San Francisco, is on a mission to democratize access to world-class AI infrastructure. The MLOps Professional Services Engineer role offers a remarkable opportunity for innovation as you design and implement scalable ML training and inference workflows to help clients, from Fortune 500 to innovative startups. In this remote position, you'll collaborate closely with Solutions Architects and support teams, ensuring optimal ML pipeline performance. Picture yourself working with technologies like Kubernetes, Docker, and Slurm, all while being part of a culture that values creativity and embraces challenges. Not only will you develop a library of robust training solutions, but you'll also gain exposure to the latest in cloud services, engage in industry events, and contribute to sustainable AI infrastructure. With a competitive salary range of $130,000 to $175,000 plus a comprehensive benefits package, this role offers not just a job but a pathway to influence the future of AI.

Frequently Asked Questions (FAQs) for MLOps Professional Services Engineer (Cloud & AI Infra) Role at Lavendo
What does an MLOps Professional Services Engineer do at my company?

An MLOps Professional Services Engineer at our company focuses on designing and implementing large-scale machine learning workflows. This includes optimizing ML model performance, managing Kubernetes and Slurm clusters, and working closely with clients to ensure their ML pipelines are both efficient and effective.

Join Rise to see the full answer
What qualifications are required for an MLOps Professional Services Engineer?

To excel as an MLOps Professional Services Engineer, candidates should have at least 3 years of experience in MLOps or a related field, strong expertise with Kubernetes and Docker, and familiarity with cloud providers such as AWS, GCP, and Azure. Knowledge of Slurm and experience with Python and ML frameworks are also key.

Join Rise to see the full answer
Is this MLOps Professional Services Engineer position remote?

Yes, this position for an MLOps Professional Services Engineer is fully remote, allowing you to work from anywhere in the US while being part of our dynamic team that's leading the charge in AI infrastructure.

Join Rise to see the full answer
What technologies will I work with as an MLOps Professional Services Engineer?

As an MLOps Professional Services Engineer, you will engage with a variety of cutting-edge technologies including Kubernetes, Docker, Slurm, and popular ML frameworks like TensorFlow and PyTorch, along with cloud services from major providers.

Join Rise to see the full answer
What is the salary range for the MLOps Professional Services Engineer role?

The salary for the MLOps Professional Services Engineer role ranges from $130,000 to $175,000, depending on experience and skills, along with a competitive benefits package.

Join Rise to see the full answer
What benefits accompany the MLOps Professional Services Engineer position?

Benefits for the MLOps Professional Services Engineer position include full medical coverage, a 401(k) match program, flexible remote work options, reimbursement for mobile and internet expenses, and an inclusive culture that promotes professional growth.

Join Rise to see the full answer
How can I stay updated on the latest technologies in MLOps as an MLOps Professional Services Engineer?

As an MLOps Professional Services Engineer, you can participate in industry conferences, meetups, and online forums, which are great platforms to stay informed about the latest developments in MLOps and cutting-edge technologies.

Join Rise to see the full answer
Common Interview Questions for MLOps Professional Services Engineer (Cloud & AI Infra)
Can you explain your experience with Kubernetes and how it applies to MLOps?

In interviews, describe specific projects where you've deployed Kubernetes for MLOps solutions. Highlight how you've used Kubernetes to automate ML workflows, manage container orchestration, and ensure scalable model training.

Join Rise to see the full answer
What is Slurm, and how have you utilized it in your previous roles?

Discuss your experience with Slurm in managing batch processing and scheduling for distributed ML workloads. Provide details about specific challenges you faced and how Slurm helped optimize resource management in those scenarios.

Join Rise to see the full answer
How do you ensure the performance of ML models during training and inference?

Explain your approach to monitoring model performance through logging and debugging tools. Mention any tools you've implemented, like Prometheus or Grafana, and how your methodologies have led to improved model efficiency.

Join Rise to see the full answer
Describe a challenging ML pipeline issue you encountered and how you resolved it.

Be prepared to recount a specific situation where you faced a challenge in an ML pipeline. Discuss your problem-solving process, the teamwork involved, and the successful outcomes from your actions.

Join Rise to see the full answer
Can you describe your automation experience using tools like Ansible or Terraform?

In your answer, share examples from projects where you utilized Ansible or Terraform to automate deployment processes. Emphasize the benefits that automation brought to your workflows, such as increased efficiency and reduced errors.

Join Rise to see the full answer
How do you integrate ML frameworks like TensorFlow or PyTorch with cloud services?

Discuss how you've set up environments using these frameworks on cloud platforms. Provide insights on the challenges of model deployment and how you ensured seamless integration and scalability.

Join Rise to see the full answer
What strategies do you implement for CI/CD pipelines in ML projects?

Describe your experiences with tools such as Jenkins or GitLab CI/CD to streamline ML model deployment. Highlight your understanding of the CI/CD cycle and how you've incorporated ML-specific practices to enhance deployment efficiency.

Join Rise to see the full answer
How do you stay current with the latest trends in MLOps?

Share your habits for staying updated, such as following industry blogs, attending webinars, and joining online MLOps communities. Express your passion for continuous learning in the ever-evolving AI sector.

Join Rise to see the full answer
What is your experience with monitoring and logging tools in MLOps?

Detail your experience with monitoring solutions like ELK Stack or Grafana. Discuss how these tools helped you gather insights on model performance and troubleshoot issues effectively.

Join Rise to see the full answer
Describe a successful collaboration with cross-functional teams in an MLOps project.

Outline a specific collaborative project, discussing your role, the teams involved, and the results achieved. Highlight the importance of communication and teamwork in reaching project goals.

Join Rise to see the full answer
Similar Jobs
Photo of the Rise User
Northstrat Remote No location specified
Posted 8 days ago
Photo of the Rise User
Venture Global LNG Hybrid Plaquemines Parish Louisiana, United States
Posted 12 days ago
Photo of the Rise User
Posted 3 days ago
Photo of the Rise User
Customer-Centric
Empathetic
Transparent & Candid
Growth & Learning
Work/Life Harmony
Maternity Leave
WFH Reimbursements
Fully Distributed
Company Retreats
Medical Insurance
Vision Insurance
Dental Insurance
Unlimited Vacation
Paid Time-Off
Paid Sick Days
Paid Holidays
Learning & Development
Health Savings Account (HSA)
Photo of the Rise User
Veolia Environnement SA Hybrid 617 Farm to Market 2404, Abilene, Texas, United States
Posted 5 days ago
Photo of the Rise User
Posted 5 days ago
MATCH
VIEW MATCH
FUNDING
DEPARTMENTS
SENIORITY LEVEL REQUIREMENT
TEAM SIZE
No info
LOCATION
No info
EMPLOYMENT TYPE
Full-time, remote
DATE POSTED
November 24, 2024

Subscribe to Rise newsletter

Risa star 🔮 Hi, I'm Risa! Your AI
Career Copilot
Want to see a list of jobs tailored to
you, just ask me below!