AI Compute Infrastructure Engineer - Internship

Cerebras

Cerebras

Software Engineering, Other Engineering, Data Science
Sunnyvale, CA, USA
Posted on Friday, June 14, 2024

Cerebras Systems has pioneered a groundbreaking chip and system that revolutionizes deep learning applications. Our system empowers ML researchers to achieve unprecedented speeds in training and inference workloads, propelling AI innovation to new horizons.

Condor Galaxy 1 (CG-1), a supercomputer set to revolutionize the world of artificial intelligence. With an astounding processing power of 4 ExaFLOPs, 54 million cores, and a cutting-edge 64-node architecture, the CG-1 is the first milestone of a larger project that will redefine the possibilities of AI.

About The Role

As a Software Quality Engineer on our team, you will use your knowledge to influence better software design, bug prevention strategies, testability, scalability, and other advanced quality concepts. This position will play a huge role on the quality of Cerebras software. We are looking for engineers that have a broad set of technical skills who are ready to tackle the biggest at-scale problems in HW-based deep learning accelerators.

The successful completion and deployment of the CG-1, the first of nine powerful supercomputers, is a significant achievement for Cerebras. As we enter phase 2 of the project with CG2, we are taking a bold step towards creating a network of interconnected supercomputers that will collectively deliver a mind-boggling 36 ExaFLOPs of AI compute power upon completion.

Cerebras is building a team of exceptional people to work together on big problems. Join us!

Responsibilities

  • Monitor and oversee CG health to ensure stability and security
  • Manage and customize k8s, cluster, cloud features on CGs
  • Provide solutions to ML users using tools and components available in a vast linux-based ecosystem - compute, storage, networking.
  • Configure, deploy and debug container-based services on orchestration platforms like Kubernetes.
  • Provide 24/7 monitoring, support – using automated tools and hands-on manual troubleshooting
  • Training and Inference in data center, LLM (50b to 500b parameter models), multi-modal, mistral etc.
  • Adapt and make progress in a fast-paced and constantly evolving environment.
  • Document processes and procedures needed to efficiently operate CGs.

Requirements

  • BS CS/EE, MS CS/EE
  • Relevant experience in managing/maintaining compute infrastructure
  • Proficiency with Python and other common programming languages
  • Experience in container orchestration platforms like Kubernetes and SLURM
  • Familiar with ML frameworks like PyTorch, Tensorflow, etc.
  • Strong knowledge and demonstrated experience with:
  • - Linux based ecosystems
  • - Good understanding of cloud infrastructure design, deployment and maintenance

Cerebras Systems is committed to creating an equal and diverse environment and is proud to be an equal opportunity employer. We celebrate different backgrounds, perspectives, and skills. We believe inclusive teams build better products and companies. We try every day to build a work environment that empowers people to do their best work through continuous learning, growth and support of those around them.


This website or its third-party tools process personal data. For more details, click here to review our CCPA disclosure notice.