feat: add per-user API rate limiting (Phase 1 — tracking only)#932
feat: add per-user API rate limiting (Phase 1 — tracking only)#932AchoArnold wants to merge 16 commits into
Conversation
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
- Returns subscription.Limit() * 2 for daily API request budgets - Add 4 test cases covering Free, Pro, Ultra, and 200K subscriptions - RateLimit(): Free=400, Pro=10000, Ultra=20000, 200K=400000 Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…d Redis sync - Add RateLimitService with per-user API request tracking - Implement in-memory counters with 24h sliding windows - Add background flush to Redis every 30s with pipelining - Emit RateLimitExceeded events when limits are exceeded - Support graceful nil handling for tracer, logger, client, dispatcher - Add comprehensive tests for basic counting, weighted costs, limit enforcement, window expiry Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
- Add rateLimitService and redisClient fields to Container struct - Extract RedisClient() method for shared Redis client creation - Refactor Cache() to reuse RedisClient() singleton - Add RateLimitService() lazy singleton initialization - Register RateLimit middleware in App() after API key auth - Add RATE_LIMIT_ENABLED=false to .env.docker Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…fication errors - Add Container.Close() to flush rate limit counters on shutdown - Call defer container.Close() in main.go - Check and log errors from Redis notification flag persistence Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…n logging - All Redis calls in RateLimitService use 1s context timeout - If Redis is slow/down, rate limiting continues in-memory (fail open) - Added logging to flush loop start/stop and graceful shutdown Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Not up to standards ⛔🔴 Issues
|
| Category | Results |
|---|---|
| CodeStyle | 50 minor |
🟢 Metrics 74 complexity · 2 duplication
Metric Results Complexity 74 Duplication 2
NEW Get contextual insights on your PRs based on Codacy's metrics, along with PR and Jira context, without leaving GitHub. Enable AI reviewer
TIP This summary will be updated as you push new changes.
Greptile SummaryThis PR adds Phase 1 of per-user, plan-based API rate limiting: a Fiber middleware tracks requests in memory with periodic Redis sync, emitting a
Confidence Score: 3/5The change is tracking-only and never blocks requests, so merging cannot degrade the user-facing API — but the rate limiting data it produces will be unreliable until the flush and context bugs are fixed. Three concrete defects exist in the new code: dirty counters are zeroed before the Redis write succeeds so usage counts are silently dropped on any Redis hiccup; the exceeded-event goroutine inherits the Fiber request context which is cancelled by the time the goroutine runs, meaning the Redis deduplication flag is never durably written and events fire repeatedly after restarts; and every authenticated request now incurs an extra database round-trip for the subscription plan lookup. These issues do not crash the service but do mean the observability data this feature is designed to produce will be incorrect from day one. api/pkg/services/rate_limit_service.go (flush data-loss and goroutine context), api/pkg/middlewares/rate_limit_middleware.go (per-request DB load) Important Files Changed
Sequence Diagram%%{init: {'theme': 'neutral'}}%%
sequenceDiagram
participant Client
participant Fiber as Fiber (Auth MW)
participant RL as RateLimit MW
participant Repo as UserRepository
participant SVC as RateLimitService
participant Redis
participant Dispatcher as EventDispatcher
Client->>Fiber: HTTP request
Fiber->>RL: next (AuthContext in Locals)
RL->>RL: "check RATE_LIMIT_ENABLED & path exclusion"
RL->>Repo: Load(ctx, userID) [DB query per request]
Repo-->>RL: user.SubscriptionName
RL->>SVC: Increment(ctx, userID, plan, cost)
SVC->>SVC: lock mutex
alt counter not in memory
SVC->>Redis: "GET rate_limit:{userID}"
Redis-->>SVC: count + TTL
SVC->>Redis: "EXISTS rate_limit_notified:{userID}"
Redis-->>SVC: notified flag
end
SVC->>SVC: update counter, check limit
alt limit exceeded and not notified
SVC->>SVC: "notified[key] = true"
SVC-->>Dispatcher: go emitExceededEvent(requestCtx)
Note over SVC,Dispatcher: goroutine runs with already-cancelled ctx
end
SVC-->>RL: count, exceeded
RL->>Fiber: c.Next()
Fiber-->>Client: response
loop every 30s (background)
SVC->>SVC: "lock, collect dirty=0 before flush"
SVC->>Redis: INCRBY / ExpireNX
alt Redis error
Note over SVC,Redis: dirty counts permanently lost
end
end
%%{init: {'theme': 'base', 'themeVariables': {"darkMode": true, "background": "#0d1117", "primaryColor": "#21262d", "primaryTextColor": "#e6edf3", "primaryBorderColor": "#8b949e", "lineColor": "#8b949e", "textColor": "#e6edf3", "edgeLabelBackground": "#161b22", "actorBkg": "#21262d", "actorBorder": "#8b949e", "actorTextColor": "#e6edf3", "actorLineColor": "#8b949e", "signalColor": "#8b949e", "signalTextColor": "#e6edf3", "noteBkgColor": "#373320", "noteBorderColor": "#d4a72c", "noteTextColor": "#f0e6c0", "labelBoxBkgColor": "#21262d", "labelBoxBorderColor": "#8b949e", "labelTextColor": "#e6edf3", "loopTextColor": "#e6edf3", "activationBkgColor": "#30363d", "activationBorderColor": "#8b949e"}}}%%
sequenceDiagram
participant Client
participant Fiber as Fiber (Auth MW)
participant RL as RateLimit MW
participant Repo as UserRepository
participant SVC as RateLimitService
participant Redis
participant Dispatcher as EventDispatcher
Client->>Fiber: HTTP request
Fiber->>RL: next (AuthContext in Locals)
RL->>RL: "check RATE_LIMIT_ENABLED & path exclusion"
RL->>Repo: Load(ctx, userID) [DB query per request]
Repo-->>RL: user.SubscriptionName
RL->>SVC: Increment(ctx, userID, plan, cost)
SVC->>SVC: lock mutex
alt counter not in memory
SVC->>Redis: "GET rate_limit:{userID}"
Redis-->>SVC: count + TTL
SVC->>Redis: "EXISTS rate_limit_notified:{userID}"
Redis-->>SVC: notified flag
end
SVC->>SVC: update counter, check limit
alt limit exceeded and not notified
SVC->>SVC: "notified[key] = true"
SVC-->>Dispatcher: go emitExceededEvent(requestCtx)
Note over SVC,Dispatcher: goroutine runs with already-cancelled ctx
end
SVC-->>RL: count, exceeded
RL->>Fiber: c.Next()
Fiber-->>Client: response
loop every 30s (background)
SVC->>SVC: "lock, collect dirty=0 before flush"
SVC->>Redis: INCRBY / ExpireNX
alt Redis error
Note over SVC,Redis: dirty counts permanently lost
end
end
|
| if counter.dirty > 0 { | ||
| entries = append(entries, flushEntry{ | ||
| key: rateLimitKeyPrefix + key, | ||
| delta: counter.dirty, | ||
| ttl: time.Until(counter.windowExpiry), | ||
| }) | ||
| counter.dirty = 0 | ||
| } | ||
| } | ||
| svc.mu.Unlock() |
There was a problem hiding this comment.
Dirty counter zeroed before flush succeeds — permanent data loss
counter.dirty = 0 is set inside the mutex on line 171, before the Redis pipeline is executed. When Redis is unavailable or the 1-second timeout fires, the error is logged but the increments that were collected in entries are silently discarded — no subsequent flush will know to retry them. On an instance that receives heavy traffic, this means up to 30 seconds of usage counts vanish on every Redis hiccup, causing the in-memory counter to permanently undercount relative to Redis.
The delta should only be zeroed on a successful pipeline exec. One approach is to collect the dirty value into the entry snapshot but leave counter.dirty untouched inside the lock, then subtract the successfully flushed delta after pipe.Exec returns without error (taking the lock again for the subtract, or using a separate failure-requeue path).
| if exceeded && !svc.notified[key] { | ||
| svc.notified[key] = true | ||
| go svc.emitExceededEvent(ctx, userID, counter.count, limit, plan) |
There was a problem hiding this comment.
Request-scoped context captured in background goroutine — Redis SET and event dispatch will fail after the request completes
ctx here is the Fiber request context, which is cancelled (and recycled by Fiber's buffer pool) as soon as the handler chain returns. The goroutine launched on line 104 inherits this cancelled context, so redisCtx, cancel := context.WithTimeout(ctx, ...) inside emitExceededEvent inherits a done parent and the Redis Set call fails immediately. The result is that svc.notified[key] is set to true in memory (preventing duplicate events within the same instance), but the Redis rate_limit_notified: key is never written. Every time a Cloud Run instance cold-starts, the first request past the limit re-fires the event, violating the "once per window" guarantee.
Use context.Background() (or a service-owned context) when spawning emitExceededEvent.
| user, err := userRepository.Load(ctx, authUser.ID) | ||
| if err != nil { | ||
| ctxLogger := tracer.CtxLogger(logger, span) | ||
| ctxLogger.Error(err) | ||
| return c.Next() | ||
| } |
There was a problem hiding this comment.
Unconditional database query on every authenticated request
userRepository.Load is called on every request solely to retrieve user.SubscriptionName. With ~20M requests/month this adds an equivalent number of DB queries — one per API call — just for rate-limit plan lookups. The auth middlewares that run before this one (BearerAuth, APIKeyAuth) already load and validate the user; if the AuthContext stored in c.Locals already carries the subscription plan (or if there is a short-lived cache in the repository), this load could be avoided entirely. If AuthContext doesn't include the plan today, it would be worth adding it there rather than performing a separate load per request.
Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!
| func (svc *RateLimitService) Close() { | ||
| if svc.logger != nil { | ||
| svc.logger.Info("RateLimitService shutting down, flushing counters") | ||
| } | ||
| close(svc.done) | ||
| if svc.client != nil { | ||
| svc.flush(context.Background()) | ||
| } | ||
| if svc.logger != nil { | ||
| svc.logger.Info("RateLimitService shutdown complete") | ||
| } | ||
| } |
There was a problem hiding this comment.
Close() panics on second call — no double-close guard
close(svc.done) on line 115 panics if the channel is already closed. In the current wiring defer container.Close() is the sole caller, so this is unlikely in practice, but it is a fragile contract. If tests or a future caller invokes Close() more than once the process crashes.
A common Go pattern is to protect with a sync.Once or use select { case svc.done <- struct{}{}: default: } with a buffered channel.
| notifiedKey := rateLimitNotifiedPrefix + userID | ||
| if exists, _ := svc.client.Exists(redisCtx, notifiedKey).Result(); exists > 0 { | ||
| svc.notified[userID] = true | ||
| } |
There was a problem hiding this comment.
Notification-suppression key TTL drifts from count key — duplicate events in new window
The rate_limit_notified:{user_id} key is set with a fresh 24-hour TTL at the moment the limit is first exceeded. The rate_limit:{user_id} count key has its own 24-hour TTL starting from the user's first request in that window. These two timers are independent and will drift. If a user hits the limit at hour 23 of their window, the notified key gets a fresh 24h TTL that extends one hour into the next window. When the new window opens and a cold-started instance hydrates, it sees the stale notified key and sets svc.notified[userID] = true, suppressing the event for the entire new window even though this is a new 24-hour period. Tying the notified key's TTL to the remaining TTL of the count key would fix the drift.
Replace defer container.Close() in main.go with app.Hooks().OnPreShutdown() which is the idiomatic Fiber v3 approach for graceful shutdown. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…Shutdown - Logger is always injected via DI, nil checks are unnecessary - Use Fiber's app.Hooks().OnPreShutdown() for graceful shutdown - Remove defer container.Close() from main.go Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…rely when disabled Move RATE_LIMIT_ENABLED check to container.go so the middleware, service, and Redis client are never initialized when rate limiting is disabled. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Instead of creating a separate pipeline per user, all INCRBY+ExpireNX commands are batched into one pipeline.Exec() — one round-trip to Redis regardless of how many users have dirty counters. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Replace plain service.logger calls with ctxLogger (tracer-aware context logger) in flush, hydrate, and emitExceededEvent methods for trace correlation. Close/flushLoop keep plain logger since no request context. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Summary
Adds per-user, plan-based API rate limiting that tracks usage without blocking requests (Phase 1). When a user exceeds their daily budget, a
ate.limit.exceeded\ CloudEvent is emitted for observability.
Key Design Decisions
Changes
Testing
Future Work (Phase 2)