Cloud Intelligence at the speed of 5000 tok/s - with Ce Zhang and Vipul Ved Prakash of Together AI

FromLatent Space: The AI Engineer Podcast — Practitioners talking LLMs, CodeGen, Agents, Multimodality, AI UX, GPU Infra and all things Software 3.0

Start listening View podcast show

Cloud Intelligence at the speed of 5000 tok/s - with Ce Zhang and Vipul Ved Prakash of Together AI

FromLatent Space: The AI Engineer Podcast — Practitioners talking LLMs, CodeGen, Agents, Multimodality, AI UX, GPU Infra and all things Software 3.0

ratings:

Length:

63 minutes

Released:

Feb 8, 2024

Format:

Podcast episode

Description

Our first ever demo day aimed for 15-20 people and ended up ballooning to >200 and covered in the news. We are now running the 2024 edition in SF on Feb 23: Latent Space Final Frontiers, a startup and research competition in “The Autonomous Workforce”, ”Beyond Transformers & GPUs”, and “Embodied AI”. RSVP here! You can find all LS online/IRL events on our new calendar. Super Early Bird tickets have just gone on sale for AI Engineer World’s Fair, June 25-27!Today we have the honor of hosting two of Together AI’s co-founders: Ce Zhang (CTO) and Vipul Ved Prakash (CEO). This is a rare opportunity to recap the history of the company since our last check-in with Tri Dao (Chief Scientist), some of their big releases, and do a deep dive into the state of the AI inference market. Together has emerged as one of the most consequential new startups in the new AI summer, last announcing a ~$100m Series A raise in November (at a ~$360-565m valuation). But there are at least three Togethers - Together the Research Lab, Together the Fine Tuning & Inference platform, and Together the custom models service. As we clarify on the pod, the overarching philosophy of Together is the ability to improve on all these fronts simultaneously by being “full stack”, from the lowest level kernel and systems programming to the highest level mathematical abstractions driving new model architectures and inference algorithms.Bringing Research and Industry TogetherIn just one year, Together has been behind some of the most exciting research in AI:* RedPajama, a fully open source dataset for model pre-training which mirrored the Llama1 recipe. Then followed by RedPajama2, a 30T tokens dataset of filtered and de-duplicated tokens. * RedPajama-INCITE-3B and 7B, which were SOTA in a few benchmarks at the time of release. * FlashAttention-2, developed by Together’s Chief Scientist Tri Dao. We covered FA-2 in a previous episode with him.* Mamba-3B, the most promising transformer-alternative model that they released in collaboration with Cartesia. * StripedHyena, a SOTA graft of Hyena state space models and transformer models together* Medusa, an alternative to speculative decoding that lets you use multiple decoding heads instead of a draft model. * MonarchMixer, which was one of the most popular orals at NeurIPS 2023. It’s an approach to transformers that replaces many of its core parts with Monarch matrices for better computational efficiency. And I’m sure we missed something! As Vipul reveals, almost 50% of Together staff is researchers, and two of their co-founders (Chris Ré and Percy Liang) are professors at Stanford, so we can expect a lot more here.Bringing “Disaggregated” GPUs TogetherOn their cloud, they offer inference as a service, fine-tuning, pre-training, etc, but unlike other providers they think of themselves as a disaggregated cloud. Today, they have ~8,000 A100 and H100 GPUs on their platform (an exclusive revealed on the pod!) totaling over 20 exaflops of compute, but instead of just buying more and putting them in a cluster and then exposing a `us-east-1` option for customers, they are taking heterogenous compute sources and adding a unified layer on top of it for developers to consume. Building on Ce’s research, Together’s GPU Clusters are taking on comparable AWS and GCP offerings in both cost and speed:Take the Hessian AI center in Germany or the DoE’s INCITE; they have GPUs that they want to share with researchers, but they lack the cloud layer over it. Similarly, there’s starting to be more and more differentiation amongst types of GPUs: H100s, A100s, MI3000s, etc. Each of them has different availability and performance based on task, and the end user shouldn’t have to be an hardware expert to run inference on a model, so Together abstracts a lot of that away.A big theme of the Together inference stack, a “bag of 50 tricks” that we discuss on the pod, is also “hardware-aware” algorithms like FlashAttention and Mamba, which further emphasiz

Released:

Feb 8, 2024

Format:

Podcast episode

Titles in the series (67)

The podcast by and for AI Engineers! We are the first place over 50k developers hear news and interviews about Software 3.0 - Foundation Models changing every domain in Code Generation, Computer Vision, AI Agents, and more, directly from the founders, builders, and thinkers involved in pushing the cutting edge. Striving to give you both the definitive take on the Current Thing down to the first introduction to the tech you'll be using in the next 3 months! We break news and exclusive interviews from tiny (George Hotz), Databricks, Glean, Replit, Roboflow, MosaicML, UC Berkeley, OpenAI, and more. www.latent.space

Skip carousel

More Episodes from Latent Space: The AI Engineer Podcast — Practitioners talking LLMs, CodeGen, Agents, Multimodality, AI UX, GPU Infra and all things Software 3.0

Skip carousel

Related podcast episodes

Skip carousel

Discover this podcast and so much more

Cloud Intelligence at the speed of 5000 tok/s - with Ce Zhang and Vipul Ved Prakash of Together AI

Cloud Intelligence at the speed of 5000 tok/s - with Ce Zhang and Vipul Ved Prakash of Together AI

Description

Titles in the series (67)

More Episodes from Latent Space: The AI Engineer Podcast — Practitioners talking LLMs, CodeGen, Agents, Multimodality, AI UX, GPU Infra and all things Software 3.0

Related podcast episodes