Building an LLM pre-training data pipeline in Rust at Aleph Alpha

By Andreas Hartel

Talk - Wednesday, 29 May

At Aleph Alpha, we have built a scalable LLM pre-training data pipeline in Rust to collect trillions of tokens for training our models. This talk highlights some of the internal technology choices and challenges that were addressed. The pipeline was built on Linux and ran on a Kubernetes cluster with a Grafana/Loki/Prometheus stack, a PostgreSQL instance and a RabbitMQ instance. It is therefore very undemanding in terms of its infrastructure requirements. Since it was built in Rust, it ran very stable and with great performance once it had passed our CI pipeline.


Andreas Hartel

Andreas is a Senior Software Engineer at Aleph Alpha. He has been with Aleph Alpha for more than 2 years and has helped build the Aleph Alpha LLM API, the underlying inference stack and their pre-training data pipeline.
Andreas has a background and a PhD in Physics and has worked as a C++ engineer in SAP’s HANA database for 4 years.