diff --git a/docs/SUMMARY.md b/docs/SUMMARY.md index 7445f81af9b..05ddc3f7be7 100644 --- a/docs/SUMMARY.md +++ b/docs/SUMMARY.md @@ -36,7 +36,7 @@ * [Offline store](getting-started/components/offline-store.md) * [Online store](getting-started/components/online-store.md) * [Feature server](getting-started/components/feature-server.md) - * [Batch Materialization Engine](getting-started/components/batch-materialization-engine.md) + * [Compute Engine](getting-started/components/compute-engine.md) * [Provider](getting-started/components/provider.md) * [Authorization Manager](getting-started/components/authz_manager.md) * [OpenTelemetry Integration](getting-started/components/open-telemetry.md) @@ -139,10 +139,10 @@ * [Google Cloud Platform](reference/providers/google-cloud-platform.md) * [Amazon Web Services](reference/providers/amazon-web-services.md) * [Azure](reference/providers/azure.md) -* [Batch Materialization Engines](reference/batch-materialization/README.md) - * [Snowflake](reference/batch-materialization/snowflake.md) - * [AWS Lambda (alpha)](reference/batch-materialization/lambda.md) - * [Spark (contrib)](reference/batch-materialization/spark.md) +* [Compute Engines](reference/compute-engine/README.md) + * [Snowflake](reference/compute-engine/snowflake.md) + * [AWS Lambda (alpha)](reference/compute-engine/lambda.md) + * [Spark (contrib)](reference/compute-engine/spark.md) * [Feature repository](reference/feature-repository/README.md) * [feature\_store.yaml](reference/feature-repository/feature-store-yaml.md) * [.feastignore](reference/feature-repository/feast-ignore.md) diff --git a/docs/getting-started/architecture/write-patterns.md b/docs/getting-started/architecture/write-patterns.md index 53a092087f6..513102e4728 100644 --- a/docs/getting-started/architecture/write-patterns.md +++ b/docs/getting-started/architecture/write-patterns.md @@ -16,7 +16,7 @@ There are two ways a client (or Data Producer) can *_send_* data to the online s - Using a synchronous API call for a small number of entities or a single entity (e.g., using the [`push` or `write_to_online_store` methods](../../reference/data-sources/push.md#pushing-data)) or the Feature Server's [`push` endpoint](../../reference/feature-servers/python-feature-server.md#pushing-features-to-the-online-and-offline-stores)) 2. Asynchronously - Using an asynchronous API call for a small number of entities or a single entity (e.g., using the [`push` or `write_to_online_store` methods](../../reference/data-sources/push.md#pushing-data)) or the Feature Server's [`push` endpoint](../../reference/feature-servers/python-feature-server.md#pushing-features-to-the-online-and-offline-stores)) - - Using a "batch job" for a large number of entities (e.g., using a [batch materialization engine](../components/batch-materialization-engine.md)) + - Using a "batch job" for a large number of entities (e.g., using a [compute engine](../components/compute-engine.md)) Note, in some contexts, developers may "batch" a group of entities together and write them to the online store in a single API call. This is a common pattern when writing data to the online store to reduce write loads but we would diff --git a/docs/getting-started/components/compute-engine.md b/docs/getting-started/components/compute-engine.md index 9dd92fe514c..baf031b4284 100644 --- a/docs/getting-started/components/compute-engine.md +++ b/docs/getting-started/components/compute-engine.md @@ -1,4 +1,4 @@ -# Compute Engine (Batch Materialization Engine) +# Compute Engine Note: The materialization is now constructed via unified compute engine interface. @@ -20,8 +20,9 @@ engines. ```markdown | Compute Engine | Description | Supported | Link | |-------------------------|-------------------------------------------------------------------------------------------------|------------|------| -| LocalComputeEngine | Runs on Arrow + Pandas/Polars/Dask etc., designed for light weight transformation. | ✅ | | +| LocalComputeEngine | Runs on Arrow + Pandas/Polars/Dask etc., designed for light weight transformation. | ✅ | | | SparkComputeEngine | Runs on Apache Spark, designed for large-scale distributed feature generation. | ✅ | | +| SnowflakeComputeEngine | Runs on Snowflake, designed for scalable feature generation using Snowflake SQL. | ✅ | | | LambdaComputeEngine | Runs on AWS Lambda, designed for serverless feature generation. | ✅ | | | FlinkComputeEngine | Runs on Apache Flink, designed for stream processing and real-time feature generation. | ❌ | | | RayComputeEngine | Runs on Ray, designed for distributed feature generation and machine learning workloads. | ❌ | | @@ -31,7 +32,7 @@ engines. Batch Engine Config can be configured in the `feature_store.yaml` file, and it serves as the default configuration for all materialization and historical retrieval tasks. The `batch_engine` config in BatchFeatureView. E.g ```yaml batch_engine: - type: SparkComputeEngine + type: spark.engine config: spark_master: "local[*]" spark_app_name: "Feast Batch Engine" @@ -59,7 +60,7 @@ Then, when you materialize the feature view, it will use the batch_engine config Stream Engine Config can be configured in the `feature_store.yaml` file, and it serves as the default configuration for all stream materialization and historical retrieval tasks. The `stream_engine` config in FeatureView. E.g ```yaml stream_engine: - type: SparkComputeEngine + type: spark.engine config: spark_master: "local[*]" spark_app_name: "Feast Stream Engine" @@ -108,4 +109,51 @@ defined in the DAG. It handles the execution of transformations, aggregations, j The Feature resolver is the core component of the compute engine that constructs the execution plan for feature generation. It takes the definitions from feature views and builds a directed acyclic graph (DAG) of operations that -need to be performed to generate the features. \ No newline at end of file +need to be performed to generate the features. + +#### DAG +The DAG represents the directed acyclic graph of operations that need to be performed to generate the features. It +contains nodes for each operation, such as transformations, aggregations, joins, and filters. The DAG is built by the +Feature Resolver and executed by the Feature Builder. + +DAG nodes are defined as follows: +``` + +---------------------+ + | SourceReadNode | <- Read data from offline store (e.g. Snowflake, BigQuery, etc. or custom source) + +---------------------+ + | + v + +--------------------------------------+ + | TransformationNode / JoinNode (*) | <- Merge data sources, custom transformations by user, or default join + +--------------------------------------+ + | + v + +---------------------+ + | FilterNode | <- used for point-in-time filtering + +---------------------+ + | + v + +---------------------+ + | AggregationNode (*) | <- only if aggregations are defined + +---------------------+ + | + v + +---------------------+ + | DeduplicationNode | <- used if no aggregation and for history + +---------------------+ retrieval + | + v + +---------------------+ + | ValidationNode (*) | <- optional validation checks + +---------------------+ + | + v + +----------+ + | Output | + +----------+ + / \ + v v ++----------------+ +----------------+ +| OnlineStoreWrite| OfflineStoreWrite| ++----------------+ +----------------+ +``` \ No newline at end of file diff --git a/docs/getting-started/components/overview.md b/docs/getting-started/components/overview.md index 05c7503d842..2be7b1169bf 100644 --- a/docs/getting-started/components/overview.md +++ b/docs/getting-started/components/overview.md @@ -27,7 +27,7 @@ A complete Feast deployment contains the following components: * Retrieve online features. * **Feature Server:** The Feature Server is a REST API server that serves feature values for a given entity key and feature reference. The Feature Server is designed to be horizontally scalable and can be deployed in a distributed manner. * **Stream Processor:** The Stream Processor can be used to ingest feature data from streams and write it into the online or offline stores. Currently, there's an experimental Spark processor that's able to consume data from Kafka. -* **Batch Materialization Engine:** The [Batch Materialization Engine](batch-materialization-engine.md) component launches a process which loads data into the online store from the offline store. By default, Feast uses a local in-process engine implementation to materialize data. However, additional infrastructure can be used for a more scalable materialization process. +* **Compute Engine:** The [Compute Engine](compute-engine.md) component launches a process which loads data into the online store from the offline store. By default, Feast uses a local in-process engine implementation to materialize data. However, additional infrastructure can be used for a more scalable materialization process. * **Online Store:** The online store is a database that stores only the latest feature values for each entity. The online store is either populated through materialization jobs or through [stream ingestion](../../reference/data-sources/push.md). * **Offline Store:** The offline store persists batch data that has been ingested into Feast. This data is used for producing training datasets. For feature retrieval and materialization, Feast does not manage the offline store directly, but runs queries against it. However, offline stores can be configured to support writes if Feast configures logging functionality of served features. * **Authorization Manager**: The authorization manager detects authentication tokens from client requests to Feast servers and uses this information to enforce permission policies on the requested services. diff --git a/docs/getting-started/genai.md b/docs/getting-started/genai.md index e0d058420b0..9c8f0c955d2 100644 --- a/docs/getting-started/genai.md +++ b/docs/getting-started/genai.md @@ -162,4 +162,4 @@ For more detailed information and examples: * [MCP Feature Server Reference](../reference/feature-servers/mcp-feature-server.md) * [Spark Data Source](../reference/data-sources/spark.md) * [Spark Offline Store](../reference/offline-stores/spark.md) -* [Spark Batch Materialization](../reference/batch-materialization/spark.md) +* [Spark Compute Engine](../reference/compute-engine/spark.md) diff --git a/docs/how-to-guides/running-feast-in-production.md b/docs/how-to-guides/running-feast-in-production.md index 65ab2c82d9d..9d2eab50d64 100644 --- a/docs/how-to-guides/running-feast-in-production.md +++ b/docs/how-to-guides/running-feast-in-production.md @@ -57,8 +57,8 @@ To keep your online store up to date, you need to run a job that loads feature d Out of the box, Feast's materialization process uses an in-process materialization engine. This engine loads all the data being materialized into memory from the offline store, and writes it into the online store. This approach may not scale to large amounts of data, which users of Feast may be dealing with in production. -In this case, we recommend using one of the more [scalable materialization engines](./scaling-feast.md#scaling-materialization), such as [Snowflake Materialization Engine](../reference/batch-materialization/snowflake.md). -Users may also need to [write a custom materialization engine](../how-to-guides/customizing-feast/creating-a-custom-materialization-engine.md) to work on their existing infrastructure. +In this case, we recommend using one of the more [scalable compute engines](./scaling-feast.md#scaling-materialization), such as [Snowflake Compute Engine](../reference/compute-engine/snowflake.md). +Users may also need to [write a custom compute engine](../how-to-guides/customizing-feast/creating-a-custom-compute-engine.md) to work on their existing infrastructure. ### 2.2 Scheduled materialization with Airflow diff --git a/docs/how-to-guides/scaling-feast.md b/docs/how-to-guides/scaling-feast.md index 7e4f27b1dd3..d0bd6aef8a0 100644 --- a/docs/how-to-guides/scaling-feast.md +++ b/docs/how-to-guides/scaling-feast.md @@ -20,7 +20,7 @@ The recommended solution in this case is to use the [SQL based registry](../tuto The default Feast materialization process is an in-memory process, which pulls data from the offline store before writing it to the online store. However, this process does not scale for large data sets, since it's executed on a single-process. -Feast supports pluggable [Materialization Engines](../getting-started/components/batch-materialization-engine.md), that allow the materialization process to be scaled up. +Feast supports pluggable [Compute Engines](../getting-started/components/compute-engine.md), that allow the materialization process to be scaled up. Aside from the local process, Feast supports a [Lambda-based materialization engine](https://rtd.feast.dev/en/master/#alpha-lambda-based-engine), and a [Bytewax-based materialization engine](https://rtd.feast.dev/en/master/#bytewax-engine). Users may also be able to build an engine to scale up materialization using existing infrastructure in their organizations. \ No newline at end of file diff --git a/docs/reference/batch-materialization/README.md b/docs/reference/batch-materialization/README.md deleted file mode 100644 index a05d6d75e5d..00000000000 --- a/docs/reference/batch-materialization/README.md +++ /dev/null @@ -1,11 +0,0 @@ -# Batch materialization - -Please see [Batch Materialization Engine](../../getting-started/components/batch-materialization-engine.md) for an explanation of batch materialization engines. - -{% page-ref page="snowflake.md" %} - -{% page-ref page="bytewax.md" %} - -{% page-ref page="lambda.md" %} - -{% page-ref page="spark.md" %} diff --git a/docs/reference/codebase-structure.md b/docs/reference/codebase-structure.md index 7077e48fef3..80608b5929a 100644 --- a/docs/reference/codebase-structure.md +++ b/docs/reference/codebase-structure.md @@ -34,7 +34,7 @@ There are also several important submodules: * `ui/` contains the embedded Web UI, to be launched on the `feast ui` command. Of these submodules, `infra/` is the most important. -It contains the interfaces for the [provider](getting-started/components/provider.md), [offline store](getting-started/components/offline-store.md), [online store](getting-started/components/online-store.md), [batch materialization engine](getting-started/components/batch-materialization-engine.md), and [registry](getting-started/components/registry.md), as well as all of their individual implementations. +It contains the interfaces for the [provider](getting-started/components/provider.md), [offline store](getting-started/components/offline-store.md), [online store](getting-started/components/online-store.md), [compute engine](getting-started/components/compute-engine.md), and [registry](getting-started/components/registry.md), as well as all of their individual implementations. ``` $ tree --dirsfirst -L 1 infra diff --git a/docs/reference/compute-engine/README.md b/docs/reference/compute-engine/README.md index 163ce362ccc..c4a2f87f54d 100644 --- a/docs/reference/compute-engine/README.md +++ b/docs/reference/compute-engine/README.md @@ -48,18 +48,35 @@ An example of built output from FeatureBuilder: ## ✨ Available Engines + ### 🔥 SparkComputeEngine +{% page-ref page="spark.md" %} + - Distributed DAG execution via Apache Spark - Supports point-in-time joins and large-scale materialization - Integrates with `SparkOfflineStore` and `SparkMaterializationJob` ### 🧪 LocalComputeEngine +{% page-ref page="local.md" %} + - Runs on Arrow + Specified backend (e.g., Pandas, Polars) - Designed for local dev, testing, or lightweight feature generation - Supports `LocalMaterializationJob` and `LocalHistoricalRetrievalJob` +### 🧊 SnowflakeComputeEngine + +- Runs entirely in Snowflake +- Supports Snowflake SQL for feature transformations and aggregations +- Integrates with `SnowflakeOfflineStore` and `SnowflakeMaterializationJob` + +{% page-ref page="snowflake.md" %} + +### LambdaComputeEngine + +{% page-ref page="lambda.md" %} + --- ## 🛠️ Feature Builder Flow diff --git a/docs/reference/batch-materialization/lambda.md b/docs/reference/compute-engine/lambda.md similarity index 100% rename from docs/reference/batch-materialization/lambda.md rename to docs/reference/compute-engine/lambda.md diff --git a/docs/reference/batch-materialization/snowflake.md b/docs/reference/compute-engine/snowflake.md similarity index 75% rename from docs/reference/batch-materialization/snowflake.md rename to docs/reference/compute-engine/snowflake.md index c2fa441d6d2..e7b0dc5bd63 100644 --- a/docs/reference/batch-materialization/snowflake.md +++ b/docs/reference/compute-engine/snowflake.md @@ -2,7 +2,7 @@ ## Description -The [Snowflake](https://trial.snowflake.com) batch materialization engine provides a highly scalable and parallel execution engine using a Snowflake Warehouse for batch materializations operations (`materialize` and `materialize-incremental`) when using a `SnowflakeSource`. +The [Snowflake](https://trial.snowflake.com) compute engine provides a highly scalable and parallel execution engine using a Snowflake Warehouse for batch materializations operations (`materialize` and `materialize-incremental`) when using a `SnowflakeSource`. The engine requires no additional configuration other than for you to supply Snowflake's standard login and context details. The engine leverages custom (automatically deployed for you) Python UDFs to do the proper serialization of your offline store data to your online serving tables. diff --git a/docs/reference/batch-materialization/spark.md b/docs/reference/compute-engine/spark.md similarity index 51% rename from docs/reference/batch-materialization/spark.md rename to docs/reference/compute-engine/spark.md index 7a61c0ccb6d..510df58ddc1 100644 --- a/docs/reference/batch-materialization/spark.md +++ b/docs/reference/compute-engine/spark.md @@ -1,10 +1,19 @@ -# Spark (alpha) +# Spark ## Description -The Spark batch materialization engine is considered alpha status. It relies on the offline store to output feature values to S3 via `to_remote_storage`, and then loads them into the online store. +Spark Compute Engine provides a distributed execution engine for batch materialization operations (`materialize` and `materialize-incremental`) and historical retrieval operations (`get_historical_features`). + +It is designed to handle large-scale data processing and can be used with various offline stores, such as Snowflake, BigQuery, and Spark SQL. + +### Design +The Spark Compute engine is implemented as a subclass of `feast.infra.compute_engine.ComputeEngine`. +Offline store is used to read and write data, while the Spark engine is used to perform transformations and aggregations on the data. +The engine supports the following features: +- Support for reading different data sources, such as Spark SQL, BigQuery, and Snowflake. +- Distributed execution of feature transformations and aggregations. +- Support for custom transformations using Spark SQL or UDFs. -See [SparkMaterializationEngine](https://rtd.feast.dev/en/master/index.html?highlight=SparkMaterializationEngine#feast.infra.materialization.spark.spark_materialization_engine.SparkMaterializationEngineConfig) for configuration options. ## Example @@ -16,7 +25,12 @@ offline_store: ... batch_engine: type: spark.engine - partitions: [optional num partitions to use to write to online store] + partitions: 10 # number of partitions when writing to the online or offline store + spark_conf: + spark.master: "local[*]" + spark.app.name: "Feast Spark Engine" + spark.sql.shuffle.partitions: 100 + spark.executor.memory: "4g" ``` {% endcode %}