At Aleph Alpha, we have built a scalable LLM pre-training data pipeline in Rust to collect trillions of tokens for training our models. This talk highlights some of the internal technology choices and challenges that were addressed. The pipeline was built on Linux and ran on a Kubernetes cluster with a Grafana/Loki/Prometheus stack, a PostgreSQL instance and a RabbitMQ instance. It is therefore very undemanding in terms of its infrastructure requirements. Since it was built in Rust, it ran very stable and with great performance once it had passed our CI pipeline.