Skip to content

feat: Add feature quality monitoring with statistical metrics, REST API, and CLI#6202

Draft
jyejare wants to merge 1 commit intofeast-dev:masterfrom
jyejare:monitoring_plus
Draft

feat: Add feature quality monitoring with statistical metrics, REST API, and CLI#6202
jyejare wants to merge 1 commit intofeast-dev:masterfrom
jyejare:monitoring_plus

Conversation

@jyejare
Copy link
Copy Markdown
Collaborator

@jyejare jyejare commented Mar 31, 2026


What this PR does / why we need it:

This PR introduces feature quality monitoring capabilities to Feast, enabling proactive tracking of feature distributions and data quality metrics. Currently, Feast has no built-in tools for monitoring feature health in production — ML teams must build custom solutions to detect issues like distribution shifts, elevated null rates, or degraded data quality before they silently impact model performance.

What it adds:

  • Monitoring storage layer (MonitoringStore) — Three dedicated Postgres tables (feast_monitoring_feature_metrics, feast_monitoring_feature_view_metrics, feast_monitoring_feature_service_metrics) with UPSERT operations, baseline management, and filtered reads.

  • PyArrow-based metrics computation (MetricsCalculator) — Backend-agnostic statistical computation supporting:

    • Numeric features: mean, stddev, min/max, percentiles (p50/p75/p90/p95/p99), null rate, histograms
    • Categorical features: top-N value counts with other/unique counts
    • Automatic feature type classification from Feast's PrimitiveFeastType and ValueType
  • Orchestration service (MonitoringService) — Ties registry, offline store, calculator, and storage together. Supports both batch source (via OfflineStore.pull_all_from_table_or_query()) and feature log source (via FeatureService.logging_config destination). Computes and aggregates metrics at feature, feature view, and feature service levels.

  • REST API (/monitoring/) — Six endpoints registered in the registry REST server:

    • POST /monitoring/compute — Trigger on-demand metrics computation
    • GET /monitoring/metrics/features — Feature-level metrics with filtering
    • GET /monitoring/metrics/feature_views — Feature view aggregates
    • GET /monitoring/metrics/feature_services — Feature service aggregates
    • GET /monitoring/metrics/baseline — Baseline distribution retrieval
    • GET /monitoring/metrics/timeseries — Time-series data for trend analysis
    • All endpoints support cascading filters: project, feature_service_name, feature_view_name, feature_name, data_source_type, date range
    • RBAC enforced using existing AuthzedAction.DESCRIBE (read) and AuthzedAction.UPDATE (compute)
  • CLI command (feast monitor run) — CLI entry point for cron/orchestrator integration with options for project, feature view, date range, data source type, and baseline flag.

Design decisions:

  • Python/PyArrow computation over SQL-based — Supports hybrid offline stores, is backend-agnostic, and extensible for future ML-based metrics (KL divergence, KS test)
  • Separate /monitoring/ route rather than extending existing /metrics/ — The existing metrics route serves registry inventory metadata (resource counts, popular tags); monitoring serves statistical feature quality data with a different data path (offline store vs registry)
  • No DQMJob abstraction — Synchronous in-process computation is sufficient for v1; the architecture supports adding async job dispatch later if scale demands it
  • User-provided baseline rather than automatic on feast apply — Gives users explicit control over what constitutes the reference distribution

Which issue(s) this PR fixes:

Partially Fixes #5919

Checks

  • I've made sure the tests are passing.
  • My commits are signed off (git commit -s)
  • My PR title follows conventional commits format

Testing Strategy

  • Unit tests
  • Integration tests

Test coverage (42 tests, all passing):

Test Suite Count Covers
test_metrics_calculator.py 19 Numeric/categorical computation, edge cases (empty, all-null, single value, high cardinality), type classification
test_monitoring_store.py 7 Table creation, UPSERT, query filters, baseline management, histogram serialization
test_monitoring_integration.py 16 End-to-end batch/log computation, baseline flow, view/service aggregation, REST API endpoints (compute, features, baseline, timeseries, validation), CLI (feast monitor run), RBAC enforcement

Snyk SAST scan: 0 vulnerabilities across all new files.


Open with Devin

@jyejare jyejare requested a review from a team as a code owner March 31, 2026 10:53
@jyejare jyejare marked this pull request as draft March 31, 2026 10:54
Signed-off-by: Jitendra Yejare <11752425+jyejare@users.noreply.github.com>
Copy link
Copy Markdown
Contributor

@devin-ai-integration devin-ai-integration bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Devin Review found 3 potential issues.

View 5 additional findings in Devin Review.

Open in Devin Review

Comment on lines +59 to +65
try:
fv = server.store.registry.get_feature_view(
name=feature_view_name, project=project
)
assert_permissions(fv, actions=[action])
except Exception:
pass
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔴 Permission check silently swallows all exceptions, completely bypassing RBAC

_assert_fv_permission wraps the entire permission check in except Exception: pass, which catches and ignores FeastPermissionError raised by assert_permissions when the user is unauthorized. As confirmed in feast/permissions/enforcer.py, FeastPermissionError inherits from Exception (via feast/errors.py:568). This means every call to _assert_fv_permission is a no-op — unauthorized users can compute metrics (UPDATE action) and read monitoring data (DESCRIBE action) for any feature view without restriction.

Suggested change
try:
fv = server.store.registry.get_feature_view(
name=feature_view_name, project=project
)
assert_permissions(fv, actions=[action])
except Exception:
pass
try:
fv = server.store.registry.get_feature_view(
name=feature_view_name, project=project
)
assert_permissions(fv, actions=[action])
except FeastObjectNotFoundException:
pass
Open in Devin Review

Was this helpful? React with 👍 or 👎 to provide feedback.

Comment on lines +100 to +108
if feature_service_name:
return self._get_metrics_by_service(
project,
feature_service_name,
lambda fv_name: self.monitoring_store.get_feature_metrics(
project=project, feature_view_name=fv_name, **kwargs
),
)
return self.monitoring_store.get_feature_metrics(project=project, **kwargs)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔴 TypeError from duplicate feature_view_name kwarg when feature_service_name is provided

In MonitoringService.get_feature_metrics and get_feature_view_metrics, when feature_service_name is truthy, the lambda passes feature_view_name=fv_name explicitly AND also spreads **kwargs which already contains feature_view_name from the caller. For example, the REST endpoint at monitoring.py:124-132 calls svc.get_feature_metrics(project=..., feature_service_name=..., feature_view_name=..., ...). In the service method (monitoring_service.py:94-108), feature_view_name ends up in **kwargs, and then the lambda at line 104-106 passes both feature_view_name=fv_name and **kwargs (which contains feature_view_name), causing TypeError: got multiple values for keyword argument 'feature_view_name'. This crashes any request that provides feature_service_name.

Suggested change
if feature_service_name:
return self._get_metrics_by_service(
project,
feature_service_name,
lambda fv_name: self.monitoring_store.get_feature_metrics(
project=project, feature_view_name=fv_name, **kwargs
),
)
return self.monitoring_store.get_feature_metrics(project=project, **kwargs)
if feature_service_name:
filtered_kwargs = {k: v for k, v in kwargs.items() if k != "feature_view_name"}
return self._get_metrics_by_service(
project,
feature_service_name,
lambda fv_name: self.monitoring_store.get_feature_metrics(
project=project, feature_view_name=fv_name, **filtered_kwargs
),
)
return self.monitoring_store.get_feature_metrics(project=project, **kwargs)
Open in Devin Review

Was this helpful? React with 👍 or 👎 to provide feedback.

Comment on lines +116 to +124
if feature_service_name:
return self._get_metrics_by_service(
project,
feature_service_name,
lambda fv_name: self.monitoring_store.get_feature_view_metrics(
project=project, feature_view_name=fv_name, **kwargs
),
)
return self.monitoring_store.get_feature_view_metrics(project=project, **kwargs)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔴 Same duplicate feature_view_name kwarg bug in get_feature_view_metrics

Identical to the issue in get_feature_metrics: when feature_service_name is truthy, the lambda at line 120-122 passes feature_view_name=fv_name explicitly while **kwargs also contains feature_view_name from the REST endpoint call at monitoring.py:149-156. This causes TypeError: got multiple values for keyword argument 'feature_view_name' for any request to /monitoring/metrics/feature_views that includes feature_service_name.

Suggested change
if feature_service_name:
return self._get_metrics_by_service(
project,
feature_service_name,
lambda fv_name: self.monitoring_store.get_feature_view_metrics(
project=project, feature_view_name=fv_name, **kwargs
),
)
return self.monitoring_store.get_feature_view_metrics(project=project, **kwargs)
if feature_service_name:
filtered_kwargs = {k: v for k, v in kwargs.items() if k != "feature_view_name"}
return self._get_metrics_by_service(
project,
feature_service_name,
lambda fv_name: self.monitoring_store.get_feature_view_metrics(
project=project, feature_view_name=fv_name, **filtered_kwargs
),
)
return self.monitoring_store.get_feature_view_metrics(project=project, **kwargs)
Open in Devin Review

Was this helpful? React with 👍 or 👎 to provide feedback.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Revamp Data Quality Monitoring

1 participant