"Today I'll design a metrics monitoring and visualization system similar to Datadog or Grafana. This system collects time-series metrics from servers, stores them efficiently, and provides real-time dashboards and alerting. As a fullstack engineer, I'll focus on how the frontend and backend work together: shared type definitions, API contracts, real-time data flow, and end-to-end feature implementation."
"Let me confirm the core end-to-end functionality:
- Metrics Ingestion: Agents push metrics to API, stored in time-series database
- Dashboard Viewing: Frontend queries backend, renders charts with auto-refresh
- Dashboard Editing: Drag-and-drop UI, changes persist to backend
- Alert Configuration: Create rules in UI, backend evaluates and sends notifications
- Time Range Selection: Frontend controls time range, backend queries appropriate tables"
"For a fullstack monitoring system:
- End-to-End Latency: User action to UI update < 200ms
- API Contract Stability: Breaking changes require versioning
- Type Safety: Shared types between frontend and backend
- Real-Time Feel: 10-second refresh without flicker"
Types shared by frontend and backend include:
Metrics Types: MetricPoint (name, value, tags, timestamp), MetricDataPoint (time, value), QueryParams (query, start, end, aggregation, step, tags), QueryResult (data array, meta with table/resolution/cached)
Dashboard Types: Dashboard (id, name, description, ownerId, panels, layout, timestamps), Panel (id, dashboardId, title, type, query, options, position), PanelType ('line' | 'area' | 'bar' | 'gauge' | 'stat'), Position (x, y, w, h), PanelLayout (i, x, y, w, h)
Alert Types: AlertRule (id, name, query, condition, threshold, duration, severity, enabled, notification), AlertCondition ('gt' | 'gte' | 'lt' | 'lte' | 'eq' | 'ne'), AlertSeverity ('info' | 'warning' | 'critical'), AlertEvent (id, ruleId, status, value, triggeredAt, resolvedAt)
API Response Types: ApiResponse with data and optional meta (total, page, pageSize), ApiError with error, code, and details
Validation schemas used by both frontend and backend:
- MetricPointSchema: Validates name (regex pattern), value (finite number), tags (optional record), timestamp (optional positive integer)
- QueryParamsSchema: Validates query, start/end (datetime with refinement that start < end), aggregation, step, tags
- CreatePanelSchema: Validates title, type, query, options (unit, color, showLegend, thresholds, calculation, min/max), position (x, y, w, h with constraints)
- CreateAlertRuleSchema: Validates name, query, condition, threshold, duration (regex for interval), severity, notification
┌─────────────────────────────────────────────────────────────────────────────┐
│ Dashboard View Flow │
└─────────────────────────────────────────────────────────────────────────────┘
1. User navigates to /dashboard/:id
┌──────────────────────┐
│ Frontend Router │
│ (TanStack Router) │
└──────────┬───────────┘
│ Route match → dashboardStore.fetchDashboard(id)
▼
┌──────────────────────┐ GET /api/v1/dashboards/:id
│ API Client │─────────────────────────────────────────┐
│ (fetch wrapper) │ │
└──────────┬───────────┘ ▼
│ ┌──────────────────────┐
│ │ API Server │
│ │ (Express) │
│ └──────────┬───────────┘
│ │
│ ▼
│ ┌──────────────────────┐
│ │ PostgreSQL │
│ │ SELECT dashboard, │
│ │ panels JOIN │
│ └──────────┬───────────┘
│ │
┌──────────▼───────────┐ { dashboard, panels } │
│ Zustand Store │◄────────────────────────────────────────┘
│ (dashboardStore) │
└──────────┬───────────┘
│ State update triggers re-render
▼
┌──────────────────────┐
│ DashboardGrid │ For each panel:
│ Component │
└──────────┬───────────┘
│
▼
┌──────────────────────┐ POST /api/v1/query
│ DashboardPanel │─────────────────────────────────────────┐
│ useQuery hook │ │
│ (with polling) │ ▼
└──────────┬───────────┘ ┌──────────────────────┐
│ │ Query Service │
│ │ - Cache check │
│ │ - Table selection │
│ │ - Query execution │
│ └──────────┬───────────┘
│ │
│ ▼
│ ┌──────────────────────┐
│ │ TimescaleDB │
│ │ - metrics_raw │
│ │ - metrics_1min │
│ │ - metrics_1hour │
│ └──────────┬───────────┘
│ │
┌──────────▼───────────┐ { data: [...], meta: {...} } │
│ Chart Component │◄────────────────────────────────────────┘
│ (Recharts) │
└──────────────────────┘
2. Auto-refresh every 10 seconds (polling in useQuery hook)
┌─────────────────────────────────────────────────────────────────────────────┐
│ Panel Edit Flow │
└─────────────────────────────────────────────────────────────────────────────┘
1. User drags panel to new position
┌──────────────────────┐
│ react-grid-layout │
│ onLayoutChange │
└──────────┬───────────┘
│ Debounced callback (500ms)
▼
┌──────────────────────┐
│ dashboardStore │
│ updateLayout() │
│ - Immediate local │
└──────────┬───────────┘
│
├──► Optimistic UI update (instant feedback)
│
▼
┌──────────────────────┐ PUT /api/v1/dashboards/:id
│ API Client │─────────────────────────────────────────┐
│ (async, fire once) │ │
└──────────────────────┘ ▼
┌──────────────────────┐
│ API Server │
│ - Validate layout │
│ - Check ownership │
└──────────┬───────────┘
│
▼
┌──────────────────────┐
│ PostgreSQL │
│ UPDATE panels │
│ SET position = ... │
└──────────┬───────────┘
│
│ 200 OK
▼
┌──────────────────────┐
│ Cache Invalidation │
│ DEL cache:dash:id │
└──────────────────────┘
Success: No visible change (already updated optimistically)
Failure: Show error toast, optionally revert to server state
Dashboard Routes (Express):
GET /dashboards- List dashboards for authenticated userGET /dashboards/:id- Get single dashboard with panels (checks ownership or public access)POST /dashboards- Create dashboard with name/descriptionPUT /dashboards/:id- Update dashboard (requires owner or admin), validates layout schema, invalidates cachePOST /dashboards/:id/panels- Add panel to dashboard using CreatePanelSchema validation
All routes use requireAuth middleware and return ApiResponse<T> format with proper error responses.
API client class wrapping fetch with:
- Consistent request method handling (method, path, body)
- Session cookie credentials
- Error parsing to ApiError type
- Methods: getDashboards, getDashboard, createDashboard, updateDashboard, deleteDashboard
- Panel methods: addPanel, updatePanel, deletePanel
- Query methods: executeQuery
- Alert methods: getAlertRules, createAlertRule, updateAlertRule, deleteAlertRule, getAlertHistory, evaluateAlertRule
- Metric methods: ingestMetrics, listMetrics, getMetricTags
The QueryService handles metric queries with automatic table selection and caching:
execute(params) workflow:
- Parse start/end dates
- Generate cache key (hash of normalized params)
- Check Redis cache, return if hit
- Select appropriate table based on time range
- Execute query with circuit breaker protection
- Cache result (shorter TTL for live data: 10s vs 300s for historical)
selectTable(start, end) logic:
- Range <= 1 hour → metrics_raw (1 second resolution)
- Range <= 24 hours → metrics_1min (1 minute resolution)
- Range > 24 hours → metrics_1hour (1 hour resolution)
executeQuery builds SQL with:
- time_bucket for aggregation
- JOIN with metric_definitions
- Optional tag filtering with JSONB @> operator
- GROUP BY and ORDER BY time
generateCacheKey normalizes query params (lowercase, round timestamps to 10s) and hashes with SHA-256.
evaluateAll(): Runs every 10 seconds, queries all enabled alert rules, evaluates each.
evaluateRule(rule) workflow:
- Query recent data based on rule.duration
- Get latest value from result
- Check if condition is met (gt, gte, lt, lte, eq, ne)
- Track state in Redis (firstTriggered, currentValue, firing)
- If condition met for duration → fire alert
- If condition not met and was firing → resolve alert
fireAlert(rule, value):
- Mark as firing in Redis
- Insert alert_event with status='firing'
- Send notification via notificationService
resolveAlert(rule):
- Update alert_events to status='resolved', set resolved_at
- Clear Redis state
useAlerts hook provides:
- State: rules, events, loading, error
- Actions: createRule, updateRule, deleteRule, evaluateRule, refetch
- Auto-polling every 30 seconds for fresh data
- Optimistic updates with error rollback
TimescaleDB Schema:
- users: id, email, password_hash, role, created_at
- metric_definitions: id, name (unique), description, unit, type, created_at (indexed by name)
- metrics_raw: hypertable with time, metric_id, value, tags (JSONB). Indexed on (metric_id, time DESC) and tags with GIN.
Continuous Aggregates:
- metrics_1min: bucket, metric_id, tags, avg_value, min_value, max_value, sample_count. Policy: 1 hour offset, 1 minute schedule.
- metrics_1hour: Same structure, built from metrics_1min. Policy: 1 day offset, 1 hour schedule.
Retention Policies: metrics_raw (7 days), metrics_1min (30 days), metrics_1hour (365 days)
Dashboard/Panel Tables:
- dashboards: id (UUID), name, description, owner_id, is_public, layout (JSONB), timestamps
- panels: id (UUID), dashboard_id (FK with CASCADE), title, type, query, options (JSONB), position (JSONB), timestamps
Alert Tables:
- alert_rules: id (UUID), name, query, condition, threshold, duration (INTERVAL), severity, enabled, notification (JSONB), timestamps
- alert_events: id (UUID), rule_id (FK), status, value, triggered_at, resolved_at. Indexed on (rule_id, triggered_at DESC).
Dashboard Panel Pattern:
- useQuery hook with refetchInterval matching refreshInterval
- staleTime set to 90% of refresh interval to prevent flicker
- Automatic polling without loading state on refetch
Dashboard Layout Pattern:
- Local state update immediate via updateLayout()
- Debounced save to server (500ms)
- On failure: show toast, optionally revert
Alert Toggle Pattern:
- Optimistic local state change
- Async API call
- Revert and show error toast on failure
Backend: On dashboard/panel update, DEL cache:dashboard:{id} in Redis.
Frontend: Zustand store updates local state after mutation. Query cache is time-based, no explicit invalidation needed.
Express error middleware handles:
- ZodError: 400 with VALIDATION_ERROR code and field details
- NotFoundError: 404 with NOT_FOUND code
- UnauthorizedError: 401 with UNAUTHORIZED code
- ForbiddenError: 403 with FORBIDDEN code
- Unique constraint violations: 409 with CONFLICT code
- Generic errors: 500 with INTERNAL_ERROR code
ErrorBoundary Component: Catches React errors, displays error message with retry button.
API Error Handling in Hooks: Try/catch with error message extraction, optional onError callback, error state management.
| Decision | Chosen | Alternative | Reasoning |
|---|---|---|---|
| Type Sharing | Shared TypeScript types | OpenAPI codegen | Simpler for monorepo, direct imports |
| Validation | Zod (both ends) | Joi, Yup | Type inference, same library both ends |
| Real-time Updates | Polling | WebSocket | Simpler, caching-friendly, sufficient for 10s refresh |
| State Management | Zustand | Redux, Context | Lightweight, TypeScript support |
| Error Handling | Error boundaries + try/catch | Global error store | React-native pattern, localized recovery |
| Cache Strategy | Redis + short TTL | Stale-while-revalidate | Backend-controlled freshness |
"To summarize the fullstack architecture for this dashboarding system:
-
Shared Types: TypeScript interfaces and Zod schemas used by both frontend and backend ensure type safety across the stack
-
API Contract: RESTful endpoints with consistent response format, validation errors include field-level details
-
Data Flow: Frontend polls backend every 10 seconds, backend routes queries to appropriate TimescaleDB tables based on time range
-
State Management: Zustand stores on frontend mirror backend data, optimistic updates provide instant feedback
-
Error Handling: Zod validation on both ends, error boundaries in React, consistent error response format
Key fullstack insights:
- Shared types prevent drift between frontend and backend
- Optimistic updates + debounced saves provide responsive UX
- Table routing (raw vs. aggregated) is transparent to frontend
- Cache invalidation is time-based for simplicity
What aspect would you like me to elaborate on?"