Skip to main content

AI Glossary

Inference

The process of running a trained AI model to generate predictions or outputs. Inference costs are a key factor in the economics of AI — faster inference means lower operational costs.

Understanding Inference

Training builds the model; inference uses it. Every time you send a prompt to ChatGPT or Claude, you're running inference. For businesses, inference is the ongoing operational cost of AI — and it's where architecture decisions have the biggest financial impact.

Inference costs depend on model size, input/output length, and infrastructure. A small fine-tuned model running on optimized hardware can deliver similar quality at 10-50x lower cost per query than a large general-purpose model.

As your AI usage scales from hundreds to millions of queries, inference optimization becomes critical. Strategies include model distillation, caching common queries, batching requests, and choosing the right model size for each task.

Inference in Canada

Canadian businesses can reduce inference latency for local users by deploying models on Canadian cloud regions (AWS ca-central-1, Azure Canada Central) rather than routing to US data centers.

Frequently Asked Questions

Costs vary widely. Cloud API inference ranges from $0.25 to $60 per million tokens depending on the model. Self-hosted inference on optimized hardware can be 5-10x cheaper at scale.

Model size, hardware (GPU vs CPU), input/output length, batch size, and optimization techniques like quantization. Smaller models on dedicated GPUs deliver the fastest inference.

See Inference in Action

Book a free 30-minute strategy call. We'll show you how inference can drive real results for your business.