What's an acceptable API response time?

For user-facing APIs, target under 200ms for perceived instant response. Under 1 second maintains user attention. Over 3 seconds causes significant abandonment. Backend-to-backend APIs can tolerate higher latency depending on use case.

What causes API latency?

Common causes include database queries (especially N+1 problems), network round trips, serialization overhead, lack of caching, synchronous external API calls, and inefficient code. Profiling identifies your specific bottlenecks.

Should I optimize for average latency or P99?

Both matter, but P99 (99th percentile) often reveals more. If your average is 100ms but P99 is 2 seconds, 1% of users have terrible experiences. Optimize P99 to ensure consistent performance for all users.

How much does API latency affect business metrics?

Amazon found every 100ms of latency cost 1% in sales. Google found 500ms slower load times reduced traffic by 20%. For APIs powering user experiences, latency directly impacts conversion, engagement, and revenue.

API Latency Optimization: Cut Response Times by 50% or More

Every 100 milliseconds of API latency costs you users. Amazon calculated that figure at 1% of sales per 100ms. Google found that 500ms of additional load time reduced search traffic by 20%. These aren't edge cases — they're the economics of speed in 2026.

Your API might work correctly, but if it's slow, it's failing. Users don't wait. Mobile connections amplify delays. And slow APIs cascade through systems, turning minor latency into major outages under load.

Here's how to find what's slowing you down and fix it.

Measuring What Matters

You can't optimize what you don't measure. Before changing anything, establish baseline metrics and identify bottlenecks.

Essential latency metrics:

P50 (median): Half of requests are faster, half are slower. Your "typical" experience.
P95: 95% of requests are faster than this. Catches most slow requests.
P99: 99% of requests are faster. Reveals worst-case scenarios.
P99.9: For high-traffic APIs, even 0.1% of requests being slow affects thousands of users.

Why percentiles matter more than averages:

An average of 100ms sounds great. But if 95% of requests complete in 50ms and 5% take 1,100ms, your average is 100ms while 5% of users have terrible experiences. Percentiles expose this.

Setting up measurement:

1. Total request duration (client perspective)
2. Server processing time (excluding network)
3. Database query time
4. External API call time
5. Serialization/deserialization time
6. Queue wait time (if applicable)

Tools like Datadog, New Relic, or open-source alternatives (Jaeger, Prometheus + Grafana) provide this visibility. At minimum, log timestamps at each stage of request processing.

Establishing baselines:

Before optimizing, document current performance:

P50, P95, P99 for each endpoint
Throughput (requests per second)
Error rates
Resource utilization (CPU, memory, database connections)

This baseline lets you measure improvement and catch regressions.

The Usual Suspects: Common Latency Causes

Most API latency comes from a handful of common issues. Check these first.

Database Queries

Database operations are the #1 cause of API latency in most applications.

N+1 query problems:

The classic mistake: fetching a list of items, then making a separate query for each item's related data.

# Bad: N+1 queries
users = User.all()  # 1 query
for user in users:
    orders = user.orders()  # N queries

# Good: Eager loading
users = User.all().prefetch_related('orders')  # 2 queries total

N+1 problems turn a 10ms operation into a 500ms operation as data grows.

Missing indexes:

A query scanning millions of rows instead of using an index can take seconds instead of milliseconds. Use EXPLAIN ANALYZE to identify full table scans.

Over-fetching:

Selecting all columns when you need three wastes bandwidth and processing time. Select only what you need.

Connection pooling:

Opening a new database connection takes 20-50ms. Connection pools maintain ready connections, eliminating this overhead.

Network Round Trips

Every network call adds latency. Minimize them.

External API calls:

If your API calls three external services sequentially, you're adding their latencies together. Parallelize when possible:

# Bad: Sequential (300ms total if each takes 100ms)
result1 = await api1.call()
result2 = await api2.call()
result3 = await api3.call()

# Good: Parallel (100ms total)
result1, result2, result3 = await asyncio.gather(
    api1.call(),
    api2.call(),
    api3.call()
)

Chatty protocols:

Multiple small requests are slower than one larger request due to connection overhead. Batch operations when possible.

Serialization Overhead

Converting data to JSON (or other formats) takes time, especially for large payloads.

Optimization strategies:

Use faster serialization libraries (orjson vs. standard json in Python, for example)
Reduce payload size by excluding unnecessary fields
Consider binary formats (Protocol Buffers, MessagePack) for internal APIs
Compress responses for large payloads (gzip typically reduces size 70-90%)

Synchronous Blocking

Operations that block the request thread while waiting for I/O waste resources and limit throughput.

Move to async:

Modern frameworks (FastAPI, Node.js, Go) handle I/O asynchronously, allowing thousands of concurrent requests without thread exhaustion.

Offload heavy work:

CPU-intensive operations (image processing, complex calculations) should move to background workers, returning results via webhooks or polling.

Caching: The Biggest Win

Caching is often the single most impactful optimization. A cached response in 1ms beats a computed response in 100ms.

What to cache:

Frequently accessed, rarely changed data: User profiles, configuration, reference data
Expensive computations: Aggregations, reports, search results
External API responses: When freshness requirements allow

Caching layers:

| Layer | Latency | Use Case | |-------|---------|----------| | In-memory (application) | under 1ms | Hot data, session state | | Distributed cache (Redis) | 1-5ms | Shared across instances | | CDN | 10-50ms | Static assets, API responses | | Database query cache | 5-20ms | Repeated identical queries |

Cache invalidation strategies:

TTL (Time to Live): Simple but may serve stale data
Write-through: Update cache when data changes
Cache-aside: Application manages cache explicitly
Event-driven: Invalidate on relevant events

Cache hit rate:

Track your cache hit rate. Below 80%, you're not getting full benefit. Above 95%, you're doing well. 99%+ is excellent for read-heavy workloads.

Architecture-Level Optimizations

Sometimes the fix isn't code — it's architecture.

Edge computing:

Move computation closer to users. A request from Tokyo to a server in Virginia adds 150ms of network latency minimum. Edge functions or regional deployments eliminate this.

Read replicas:

Distribute read queries across database replicas. Writes go to the primary; reads go to replicas. This scales read capacity and reduces primary database load.

Async processing:

Not everything needs to happen during the request. Send emails, generate reports, and update analytics asynchronously. Return a 202 Accepted and process in the background.

Request coalescing:

If 100 users request the same data simultaneously, compute it once and serve to all. This prevents thundering herd problems and reduces redundant work.

Optimization Process

Follow a systematic approach rather than guessing.

Step 1: Profile

Identify where time is actually spent. Don't assume — measure. Flame graphs and distributed tracing reveal the truth.

Step 2: Prioritize

Focus on the biggest bottlenecks first. A 50% improvement to a 500ms operation saves 250ms. A 50% improvement to a 10ms operation saves 5ms.

Step 3: Implement

Make one change at a time. Multiple simultaneous changes make it impossible to know what worked.

Step 4: Measure

Compare against your baseline. Did P99 improve? Did throughput increase? Did you introduce any regressions?

Step 5: Iterate

Repeat until you hit your targets. Diminishing returns eventually set in — know when to stop.

Key Takeaways

Measure with percentiles (P50, P95, P99), not averages. Averages hide problems.
Database queries are usually the biggest bottleneck. Fix N+1 problems, add indexes, use connection pooling.
Parallelize external calls. Sequential API calls add latencies together; parallel calls take only as long as the slowest.
Cache aggressively. A 1ms cache hit beats a 100ms computation every time.
Consider architecture. Edge computing, read replicas, and async processing solve problems that code optimization can't.
Profile before optimizing. Guessing wastes time. Measure, identify the actual bottleneck, then fix it.

API Latency Optimization: Cut Response Times by 50% or More

Measuring What Matters

The Usual Suspects: Common Latency Causes

Database Queries

Network Round Trips

Serialization Overhead

Synchronous Blocking

Caching: The Biggest Win

Architecture-Level Optimizations

Optimization Process

Key Takeaways

Frequently Asked Questions

Continue Reading

Website Speed Optimization: What Actually Moves the Needle in 2026

The Superpower of Speed: Why Fast Wins in 2026

From Zero to Hero: How AI Accelerates the Modern Workflow