Network Engineer



Sunnyvale, CA, USA · San Diego, CA, USA · Toronto, ON, Canada
Posted on Friday, April 19, 2024

Cerebras Systems has pioneered a groundbreaking chip and system that revolutionizes deep learning applications. Our system empowers ML researchers to achieve unprecedented speeds in training and inference workloads, propelling AI innovation to new horizons.

The Condor Galaxy 1 (CG-1), unveiled in a recent announcement, stands as a testament to Cerebras' commitment to pushing the boundaries of AI computing. With a staggering 4 ExaFLOP processing power, 54 million cores, and 64-node architecture, the CG-1 is the first of nine powerful supercomputers to be built and operated through an exclusive partnership between Cerebras and G42. This strategic collaboration aims to redefine the possibilities of AI by creating a network of interconnected supercomputers that will collectively deliver a mind-boggling 36 ExaFLOPS of AI compute power upon completion in 2024.

The Role

We need to build and evolve our network infrastructure to scale to this AI compute power. We need to ensure that the network is running smoothly and meets stringent performance and availability requirements of RDMA workloads that expects a loss-less fabric interconnect. To improve performance of these systems we constantly look for opportunities across stack: network fabric and host networking, comms lib and scheduling infrastructure.

  • Design, develop, test and operate networking systems to support large scale AI training/inference jobs.
  • Develop and deploy numerous technologies and network topologies in order to evolve and scale our AI networks.
  • Work closely with our hardware, software and sourcing teams to develop new networking solutions and influence the future of networking and its associated infrastructure
  • Define and develop optimized network monitoring systems
  • Software modelling of network architecture, ML applications and other building blocks to realistically simulate end-to-end performance at scale
  • Be oncall to learn from real world production challenges and take the lessons to improve current and future generation products.
Minimum Qualifications
  • Engineering degree, or a related technical discipline or equivalent experience
  • Expert Knowledge of IB/RDMA/RoCE Networks
  • 4+ years of experience working on networks supporting large scale training workloads
  • Understanding of RDMA congestion control mechanisms on IB and RoCE Networks.
  • Hands on experience working with state-of-the-art network equipment and vendors (e.g. Broadcom, Mellanox)
  • Experience coding in languages like Python, C++, Go, etc
  • Experience in network automation software leveraging software defined networking principles.
  • Developed or modified network telemetry and automation tools to make efficient use of infrastructure and resources, related to performance, operation, testing, and incident management.
  • Experience in designing, deploying and operating networks at scale. Built and managed large scale data center networks or experience with building transports for large scale networks.
Preferred Qualifications
  • MS of PhD in Computer Science or Computer Engineering with networking focus
  • Understanding of routing and switching - hardware design and knowledge of forwarding and data planes
  • Understanding of AI training workloads and demands they exert on networks.
Why Join Cerebras

People who are serious about software make their own hardware. At Cerebras we have built a breakthrough architecture that is unlocking new opportunities for the AI industry. With dozens of model releases and rapid growth, we’ve reached an inflection point in our business. Members of our team tell us there are five main reasons they joined Cerebras:

  1. Build a breakthrough AI platform beyond the constraints of the GPU
  2. Publish and open source their cutting-edge AI research
  3. Work on one of the fastest AI supercomputers in the world
  4. Enjoy job stability with startup vitality
  5. Our simple, non-corporate work culture that respects individual beliefs

Read our blog: Five Reasons to Join Cerebras in 2024.

Apply today and become part of the forefront of groundbreaking advancements in AI.

Cerebras Systems is committed to creating an equal and diverse environment and is proud to be an equal opportunity employer. We celebrate different backgrounds, perspectives, and skills. We believe inclusive teams build better products and companies. We try every day to build a work environment that empowers people to do their best work through continuous learning, growth and support of those around them.

This website or its third-party tools process personal data. For more details, click here to review our CCPA disclosure notice.