Cloudflare, the leading content delivery network and cloud security platform, wants to make AI accessible to developers. It has added GPU-powered infrastructure and model-serving capabilities to its edge network, bringing state-of-the-art foundation models to the masses. Any developer can tap into Cloudflare’s AI platform with a simple REST API call.
Cloudflare introduced Workers, a serverless compute platform at the edge, in 2017. Developers can use this serverless platform to create JavaScript Service Workers that run directly in Cloudflare’s edge locations around the world. With a Worker, a developer can modify a site’s HTTP requests and responses, make parallel requests, and even respond directly from the edge. Cloudflare Workers use an API that is similar to the W3C Service Workers standard.
The rise of generative AI prompted Cloudflare to augment its Workers with AI capabilities. The platform has three new elements to support AI inference:
- Workers AI operates on NVIDIA GPUs within Cloudflare’s global network, enabling the serverless model for AI. Users only pay for what they use, allowing them to spend less time on infrastructure management and more time on their applications.
- Vectorize, a vector database, enables easy, rapid, and cost-effective vector indexing and storage, supporting use cases that require access not only to operational models but also to customized data.
- AI Gateway enables organizations to cache, rate limit, and monitor their AI deployments regardless of the hosting environment.
Cloudflare has partnered with NVIDIA, Microsoft, Hugging Face, Databricks, and Meta to bring the GPU infrastructure and foundation models to its edge. The platform also hosts embedding models to convert text to vectors. The Vectorize database can be used to store, index and query the vectors to add context to the LLMs in order to reduce hallucinations in responses. The AI Gateway provides observability, rate limiting and caching frequent queries, reducing the cost while improving the performance of applications.
The model catalog for Workers AI boasts the most recent and some of the best foundation models. From Meta’s Llama 2 to Stable Diffusion XL to Mistral 7B, it has everything developers need to build modern applications powered by generative AI.
Behind the scenes, Cloudflare uses ONNX Runtime, an open neural network exchange runtime, an open source project led by Microsoft, to optimize running models in resource-constrained environments. It’s the same technology that Microsoft relies on to run foundation models in Windows.
While developers can use JavaScript to write AI inference code and deploy it to Cloudflare’s edge network, it is possible to invoke the models through a simple REST API using any language. This makes it easy to infuse generative AI into web, desktop and mobile applications that run in diverse environments.
In September 2023, Workers AI was initially launched with inference capabilities in seven cities. However, Cloudflare’s ambitious goal was to support Workers AI inference in 100 cities by the end of the year, with near-ubiquitous coverage by the end of 2024.
Cloudflare is one of the first CDN and edge network providers to enhance its edge network with AI capabilities through GPU-powered Workers AI, vector database and an AI Gateway for AI deployment management. Partnering with tech giants like Meta and Microsoft, it is offering a wide model catalog and ONNX Runtime optimization.
Read the full article here