Skip to content

feat: add per-user API rate limiting (Phase 1 — tracking only)#932

Open
AchoArnold wants to merge 16 commits into
mainfrom
feat/api-rate-limiting
Open

feat: add per-user API rate limiting (Phase 1 — tracking only)#932
AchoArnold wants to merge 16 commits into
mainfrom
feat/api-rate-limiting

Conversation

@AchoArnold

Copy link
Copy Markdown
Member

Summary

Adds per-user, plan-based API rate limiting that tracks usage without blocking requests (Phase 1). When a user exceeds their daily budget, a
ate.limit.exceeded\ CloudEvent is emitted for observability.

Key Design Decisions

  • In-memory counters with periodic Redis sync — counters live in memory and flush to Redis every 30s via atomic \INCRBY. This reduces Redis costs from ~20M ops/month to ~200K-400K ops/month.
  • Fail open — all Redis calls have a 1-second timeout. If Redis is slow/down, rate limiting continues in-memory without impacting request latency.
  • Weighted counting — GET list endpoints with a \limit\ query param count that value (capped at 200) instead of 1.
  • Optional — controlled by \RATE_LIMIT_ENABLED\ env var (default \ alse). Self-hosted users are unaffected.
  • Selective — /v1/events\ is excluded (system-only endpoint).
  • Plan-based — each subscription tier gets \Limit() * 2\ requests per 24h sliding window.

Changes

File Description
\�pi/pkg/entities/user.go\ Add \RateLimit()\ method to \SubscriptionName\
\�pi/pkg/events/rate_limit_exceeded_event.go\ New CloudEvent type
\�pi/pkg/services/rate_limit_service.go\ Core service: in-memory counters, flush loop, hydration, event emission
\�pi/pkg/services/rate_limit_service_test.go\ Unit tests (5 cases)
\�pi/pkg/middlewares/rate_limit_middleware.go\ Fiber middleware
\�pi/pkg/di/container.go\ DI wiring, shared Redis client, shutdown hook
\�pi/main.go\ Graceful shutdown via \defer container.Close()\
\�pi/.env.docker\ Add \RATE_LIMIT_ENABLED=false\

Testing

  • 9 unit tests passing (4 entity + 5 service)
  • Build verification passing
  • \go vet\ clean on new code (pre-existing warnings in other middleware files)

Future Work (Phase 2)

  • Return 429 when limit exceeded
  • Add \X-RateLimit-*\ response headers
  • Usage dashboard for users

AchoArnold and others added 9 commits June 30, 2026 11:21
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
- Returns subscription.Limit() * 2 for daily API request budgets
- Add 4 test cases covering Free, Pro, Ultra, and 200K subscriptions
- RateLimit(): Free=400, Pro=10000, Ultra=20000, 200K=400000

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…d Redis sync

- Add RateLimitService with per-user API request tracking
- Implement in-memory counters with 24h sliding windows
- Add background flush to Redis every 30s with pipelining
- Emit RateLimitExceeded events when limits are exceeded
- Support graceful nil handling for tracer, logger, client, dispatcher
- Add comprehensive tests for basic counting, weighted costs, limit enforcement, window expiry

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
- Add rateLimitService and redisClient fields to Container struct
- Extract RedisClient() method for shared Redis client creation
- Refactor Cache() to reuse RedisClient() singleton
- Add RateLimitService() lazy singleton initialization
- Register RateLimit middleware in App() after API key auth
- Add RATE_LIMIT_ENABLED=false to .env.docker

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…fication errors

- Add Container.Close() to flush rate limit counters on shutdown
- Call defer container.Close() in main.go
- Check and log errors from Redis notification flag persistence

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…n logging

- All Redis calls in RateLimitService use 1s context timeout
- If Redis is slow/down, rate limiting continues in-memory (fail open)
- Added logging to flush loop start/stop and graceful shutdown

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@codacy-production

codacy-production Bot commented Jun 30, 2026

Copy link
Copy Markdown

Not up to standards ⛔

🔴 Issues 50 minor

Alerts:
⚠ 50 issues (≤ 0 issues of at least minor severity)

Results:
50 new issues

Category Results
CodeStyle 50 minor

View in Codacy

🟢 Metrics 74 complexity · 2 duplication

Metric Results
Complexity 74
Duplication 2

View in Codacy

NEW Get contextual insights on your PRs based on Codacy's metrics, along with PR and Jira context, without leaving GitHub. Enable AI reviewer
TIP This summary will be updated as you push new changes.

@greptile-apps

greptile-apps Bot commented Jun 30, 2026

Copy link
Copy Markdown
Contributor

Greptile Summary

This PR adds Phase 1 of per-user, plan-based API rate limiting: a Fiber middleware tracks requests in memory with periodic Redis sync, emitting a rate.limit.exceeded CloudEvent when a user's daily budget is exceeded — without ever blocking requests. The implementation is gated behind RATE_LIMIT_ENABLED and excluded for the /v1/events system path.

  • RateLimitService maintains per-user in-memory counters flushed to Redis every 30s via atomic INCRBY; on cold-start, counters are lazily hydrated from Redis to preserve counts across Cloud Run restarts.
  • flush() zeros counter.dirty inside the mutex before the Redis pipeline executes, so any Redis timeout silently discards those increments with no retry path.
  • emitExceededEvent is launched as a goroutine with the Fiber request context, which Fiber cancels and recycles after the handler returns — causing the Redis notified-flag write and event dispatch to fail consistently, breaking the "once per window" deduplication across instance restarts.
  • The middleware calls userRepository.Load on every authenticated request just to read the subscription plan, adding an unconditional DB round-trip to every API call.

Confidence Score: 3/5

The change is tracking-only and never blocks requests, so merging cannot degrade the user-facing API — but the rate limiting data it produces will be unreliable until the flush and context bugs are fixed.

Three concrete defects exist in the new code: dirty counters are zeroed before the Redis write succeeds so usage counts are silently dropped on any Redis hiccup; the exceeded-event goroutine inherits the Fiber request context which is cancelled by the time the goroutine runs, meaning the Redis deduplication flag is never durably written and events fire repeatedly after restarts; and every authenticated request now incurs an extra database round-trip for the subscription plan lookup. These issues do not crash the service but do mean the observability data this feature is designed to produce will be incorrect from day one.

api/pkg/services/rate_limit_service.go (flush data-loss and goroutine context), api/pkg/middlewares/rate_limit_middleware.go (per-request DB load)

Important Files Changed

Filename Overview
api/pkg/services/rate_limit_service.go Core rate limiting logic with three issues: dirty counter zeroed before Redis flush succeeds (data loss), request context passed to background goroutine causing failed notified-flag writes, and no protection against double-close panic.
api/pkg/middlewares/rate_limit_middleware.go Correct tracking-only middleware structure, but calls userRepository.Load on every authenticated request adding an unconditional DB round-trip per API call.
api/pkg/di/container.go Clean refactoring of Redis client into a shared singleton; RateLimitService and its background goroutine are unconditionally instantiated regardless of RATE_LIMIT_ENABLED value.
api/pkg/entities/user.go Simple, correct addition of RateLimit() method delegating to Limit() * 2, well-tested.
api/pkg/events/rate_limit_exceeded_event.go Minimal, correct CloudEvent type definition; no issues.
api/main.go Adds defer container.Close() for graceful shutdown — correct and necessary for the flush goroutine.
api/pkg/services/rate_limit_service_test.go Five unit tests covering basic increment, weighted cost, limit exceedance, multi-user isolation, and window expiry; no Redis integration tests.

Sequence Diagram

%%{init: {'theme': 'neutral'}}%%
sequenceDiagram
    participant Client
    participant Fiber as Fiber (Auth MW)
    participant RL as RateLimit MW
    participant Repo as UserRepository
    participant SVC as RateLimitService
    participant Redis
    participant Dispatcher as EventDispatcher

    Client->>Fiber: HTTP request
    Fiber->>RL: next (AuthContext in Locals)
    RL->>RL: "check RATE_LIMIT_ENABLED & path exclusion"
    RL->>Repo: Load(ctx, userID) [DB query per request]
    Repo-->>RL: user.SubscriptionName
    RL->>SVC: Increment(ctx, userID, plan, cost)
    SVC->>SVC: lock mutex
    alt counter not in memory
        SVC->>Redis: "GET rate_limit:{userID}"
        Redis-->>SVC: count + TTL
        SVC->>Redis: "EXISTS rate_limit_notified:{userID}"
        Redis-->>SVC: notified flag
    end
    SVC->>SVC: update counter, check limit
    alt limit exceeded and not notified
        SVC->>SVC: "notified[key] = true"
        SVC-->>Dispatcher: go emitExceededEvent(requestCtx)
        Note over SVC,Dispatcher: goroutine runs with already-cancelled ctx
    end
    SVC-->>RL: count, exceeded
    RL->>Fiber: c.Next()
    Fiber-->>Client: response

    loop every 30s (background)
        SVC->>SVC: "lock, collect dirty=0 before flush"
        SVC->>Redis: INCRBY / ExpireNX
        alt Redis error
            Note over SVC,Redis: dirty counts permanently lost
        end
    end
Loading
%%{init: {'theme': 'base', 'themeVariables': {"darkMode": true, "background": "#0d1117", "primaryColor": "#21262d", "primaryTextColor": "#e6edf3", "primaryBorderColor": "#8b949e", "lineColor": "#8b949e", "textColor": "#e6edf3", "edgeLabelBackground": "#161b22", "actorBkg": "#21262d", "actorBorder": "#8b949e", "actorTextColor": "#e6edf3", "actorLineColor": "#8b949e", "signalColor": "#8b949e", "signalTextColor": "#e6edf3", "noteBkgColor": "#373320", "noteBorderColor": "#d4a72c", "noteTextColor": "#f0e6c0", "labelBoxBkgColor": "#21262d", "labelBoxBorderColor": "#8b949e", "labelTextColor": "#e6edf3", "loopTextColor": "#e6edf3", "activationBkgColor": "#30363d", "activationBorderColor": "#8b949e"}}}%%
sequenceDiagram
    participant Client
    participant Fiber as Fiber (Auth MW)
    participant RL as RateLimit MW
    participant Repo as UserRepository
    participant SVC as RateLimitService
    participant Redis
    participant Dispatcher as EventDispatcher

    Client->>Fiber: HTTP request
    Fiber->>RL: next (AuthContext in Locals)
    RL->>RL: "check RATE_LIMIT_ENABLED & path exclusion"
    RL->>Repo: Load(ctx, userID) [DB query per request]
    Repo-->>RL: user.SubscriptionName
    RL->>SVC: Increment(ctx, userID, plan, cost)
    SVC->>SVC: lock mutex
    alt counter not in memory
        SVC->>Redis: "GET rate_limit:{userID}"
        Redis-->>SVC: count + TTL
        SVC->>Redis: "EXISTS rate_limit_notified:{userID}"
        Redis-->>SVC: notified flag
    end
    SVC->>SVC: update counter, check limit
    alt limit exceeded and not notified
        SVC->>SVC: "notified[key] = true"
        SVC-->>Dispatcher: go emitExceededEvent(requestCtx)
        Note over SVC,Dispatcher: goroutine runs with already-cancelled ctx
    end
    SVC-->>RL: count, exceeded
    RL->>Fiber: c.Next()
    Fiber-->>Client: response

    loop every 30s (background)
        SVC->>SVC: "lock, collect dirty=0 before flush"
        SVC->>Redis: INCRBY / ExpireNX
        alt Redis error
            Note over SVC,Redis: dirty counts permanently lost
        end
    end
Loading

Comments Outside Diff (1)

  1. api/pkg/di/container.go, line 57-63 (link)

    P2 RateLimitService always instantiated — background goroutine runs even when disabled

    container.RateLimitService() is called unconditionally inside App(), which creates the RateLimitService, starts the background flush goroutine, and ensures a Redis connection is opened — all regardless of whether RATE_LIMIT_ENABLED is "true". For self-hosted users who set RATE_LIMIT_ENABLED=false to opt out, the infrastructure overhead is identical to having the feature enabled. Wrapping the app.Use registration (and therefore the container.RateLimitService() call) in an if os.Getenv("RATE_LIMIT_ENABLED") == "true" guard would avoid the unnecessary goroutine and connection.

    Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!

Reviews (1): Last reviewed commit: "feat(services): add 1s Redis timeout, fa..." | Re-trigger Greptile

Comment thread api/pkg/services/rate_limit_service.go Outdated
Comment on lines +165 to +174
if counter.dirty > 0 {
entries = append(entries, flushEntry{
key: rateLimitKeyPrefix + key,
delta: counter.dirty,
ttl: time.Until(counter.windowExpiry),
})
counter.dirty = 0
}
}
svc.mu.Unlock()

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Dirty counter zeroed before flush succeeds — permanent data loss

counter.dirty = 0 is set inside the mutex on line 171, before the Redis pipeline is executed. When Redis is unavailable or the 1-second timeout fires, the error is logged but the increments that were collected in entries are silently discarded — no subsequent flush will know to retry them. On an instance that receives heavy traffic, this means up to 30 seconds of usage counts vanish on every Redis hiccup, causing the in-memory counter to permanently undercount relative to Redis.

The delta should only be zeroed on a successful pipeline exec. One approach is to collect the dirty value into the entry snapshot but leave counter.dirty untouched inside the lock, then subtract the successfully flushed delta after pipe.Exec returns without error (taking the lock again for the subtract, or using a separate failure-requeue path).

Comment thread api/pkg/services/rate_limit_service.go Outdated
Comment on lines +102 to +104
if exceeded && !svc.notified[key] {
svc.notified[key] = true
go svc.emitExceededEvent(ctx, userID, counter.count, limit, plan)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Request-scoped context captured in background goroutine — Redis SET and event dispatch will fail after the request completes

ctx here is the Fiber request context, which is cancelled (and recycled by Fiber's buffer pool) as soon as the handler chain returns. The goroutine launched on line 104 inherits this cancelled context, so redisCtx, cancel := context.WithTimeout(ctx, ...) inside emitExceededEvent inherits a done parent and the Redis Set call fails immediately. The result is that svc.notified[key] is set to true in memory (preventing duplicate events within the same instance), but the Redis rate_limit_notified: key is never written. Every time a Cloud Run instance cold-starts, the first request past the limit re-fires the event, violating the "once per window" guarantee.

Use context.Background() (or a service-owned context) when spawning emitExceededEvent.

Comment on lines +60 to +65
user, err := userRepository.Load(ctx, authUser.ID)
if err != nil {
ctxLogger := tracer.CtxLogger(logger, span)
ctxLogger.Error(err)
return c.Next()
}

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Unconditional database query on every authenticated request

userRepository.Load is called on every request solely to retrieve user.SubscriptionName. With ~20M requests/month this adds an equivalent number of DB queries — one per API call — just for rate-limit plan lookups. The auth middlewares that run before this one (BearerAuth, APIKeyAuth) already load and validate the user; if the AuthContext stored in c.Locals already carries the subscription plan (or if there is a short-lived cache in the repository), this load could be avoided entirely. If AuthContext doesn't include the plan today, it would be worth adding it there rather than performing a separate load per request.

Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!

Comment thread api/pkg/services/rate_limit_service.go Outdated
Comment on lines +111 to +122
func (svc *RateLimitService) Close() {
if svc.logger != nil {
svc.logger.Info("RateLimitService shutting down, flushing counters")
}
close(svc.done)
if svc.client != nil {
svc.flush(context.Background())
}
if svc.logger != nil {
svc.logger.Info("RateLimitService shutdown complete")
}
}

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Close() panics on second call — no double-close guard

close(svc.done) on line 115 panics if the channel is already closed. In the current wiring defer container.Close() is the sole caller, so this is unlikely in practice, but it is a fragile contract. If tests or a future caller invokes Close() more than once the process crashes.

A common Go pattern is to protect with a sync.Once or use select { case svc.done <- struct{}{}: default: } with a buffered channel.

Comment on lines +228 to +231
notifiedKey := rateLimitNotifiedPrefix + userID
if exists, _ := svc.client.Exists(redisCtx, notifiedKey).Result(); exists > 0 {
svc.notified[userID] = true
}

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Notification-suppression key TTL drifts from count key — duplicate events in new window

The rate_limit_notified:{user_id} key is set with a fresh 24-hour TTL at the moment the limit is first exceeded. The rate_limit:{user_id} count key has its own 24-hour TTL starting from the user's first request in that window. These two timers are independent and will drift. If a user hits the limit at hour 23 of their window, the notified key gets a fresh 24h TTL that extends one hour into the next window. When the new window opens and a cold-started instance hydrates, it sees the stale notified key and sets svc.notified[userID] = true, suppressing the event for the entire new window even though this is a new 24-hour period. Tying the notified key's TTL to the remaining TTL of the count key would fix the drift.

AchoArnold and others added 7 commits June 30, 2026 12:10
Replace defer container.Close() in main.go with app.Hooks().OnPreShutdown()
which is the idiomatic Fiber v3 approach for graceful shutdown.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…Shutdown

- Logger is always injected via DI, nil checks are unnecessary
- Use Fiber's app.Hooks().OnPreShutdown() for graceful shutdown
- Remove defer container.Close() from main.go

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…rely when disabled

Move RATE_LIMIT_ENABLED check to container.go so the middleware,
service, and Redis client are never initialized when rate limiting
is disabled.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Instead of creating a separate pipeline per user, all INCRBY+ExpireNX
commands are batched into one pipeline.Exec() — one round-trip to Redis
regardless of how many users have dirty counters.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Replace plain service.logger calls with ctxLogger (tracer-aware context
logger) in flush, hydrate, and emitExceededEvent methods for trace
correlation. Close/flushLoop keep plain logger since no request context.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant