Deepinfra

Deepinfra offers a low-cost, production-ready infrastructure that allows developers to deploy the latest state-of-the-art ML models into production environments quickly and cost-effectively. DeepInfra is particularly suited for businesses looking to leverage AI models without the need for extensive infrastructure setup, offering serverless GPUs for faster and cheaper deployment.

Key Features

Experiment with different ideas using open-source LLMs quickly.
Offers low-latency, cost-effective inference infrastructure for deploying ML models.
Serverless Inference as a Service: Simplifies deployment of LLMs and embedding models.
Transparent Pricing: Clear pricing for text models, making budget planning more straightforward.
Deepinfra is not limited to text models.

Advantages

Cost-Effectiveness: Competitive pricing, significantly cheaper that OpenAI
Ease of Deployment: Supports both open-source and custom models with minimal setup.
The cost depends on the model size. It's a flat price per million tokens for both incoming and outgoing. At OpenAI, the price difference between input and output tokens is threefold.
Deepinfra has already deployed dozens of open-source models, which can save significant time. For instance, deploying these models often involves navigating the complexities of AWS
They've adopted the same interface as the OpenAI Python library for all models. This is beneficial as it eliminates the need for developers, who are already familiar with the OpenAI API, to learn new documentation.

Custom model deployment

Custom model deployment is available by request. Below are approximate prices for that:

I would like to deploy a 70B-parameter LLM with 5-bit quantization, gguf format
A100 80G is 2 USD /h. H100 80G is 3.5 USD/h You can do a single card with 5 bit quantization. It would not take long, a couple of days. You get the same API endpoint as our public models, but to your private model, accessible just by you. When you deploy custom models we will charge for the time the model is deployed. We will offer you option to spin down the model to 0, but if you choose this there will be cold start times.

Author: t.me/grayskripko