Limits
Workers AI is now Generally Available. We've updated our rate limits to reflect this.
Note that model inferences in local mode using Wrangler will also count towards these limits. Beta models may have lower rate limits while we work on performance and scale.
Rate limits are default per task type, with some per-model limits defined as follows:
- 720 requests per minute
- 3000 requests per minute
- 720 requests per minute
- 3000 requests per minute
- 1500 requests per minute
- 2000 requests per minute
- 3000 requests per minute
- @cf/baai/bge-large-en-v1.5 is 1500 requests per minute
When using @cf/baai/bge
embedding models, the following limits apply:
- The maximum token limit per input is 512 tokens.
- The maximum batch size is100 inputs per request.
- The total number of tokens across all inputs in the batch must not exceed internal processing limits.
- Larger inputs (closer to 512 tokens) may reduce the maximum batch size due to these constraints.
- Exceeding the batch size limit:If more than 100 inputs are provided, a
400 Bad Request
error is returned. - Exceeding the token limit per input: If a single input exceeds 512 tokens, the request will fail with a
400 Bad Request
error. - Combined constraints:Requests with both a high batch size and large token inputs may fail due to exceeding the model's processing limits.
- 300 requests per minute
- @hf/thebloke/mistral-7b-instruct-v0.1-awq is 400 requests per minute
- @cf/microsoft/phi-2 is 720 requests per minute
- @cf/qwen/qwen1.5-0.5b-chat is 1500 requests per minute
- @cf/qwen/qwen1.5-1.8b-chat is 720 requests per minute
- @cf/qwen/qwen1.5-14b-chat-awq is 150 requests per minute
- @cf/tinyllama/tinyllama-1.1b-chat-v1.0 is 720 requests per minute
- 720 requests per minute
- @cf/runwayml/stable-diffusion-v1-5-img2img is 1500 requests per minute
- 720 requests per minute