diff --git a/docs/ai-integration/ai-tasks-list-view.mdx b/docs/ai-integration/ai-tasks-list-view.mdx index fbe09464b0..f6853ea921 100644 --- a/docs/ai-integration/ai-tasks-list-view.mdx +++ b/docs/ai-integration/ai-tasks-list-view.mdx @@ -26,6 +26,10 @@ import LanguageContent from "@site/src/components/LanguageContent"; * In the **AI Tasks - List view**, you can manage RavenDB's AI tasks - create new tasks, edit existing ones, or delete them as needed. +* To inspect errors raised by AI tasks and how those errors affect each task's health, + use the [AI Task Errors view](../monitoring/task-errors/studio-views.mdx#ai-task-errors-view). + See the [Task errors overview](../monitoring/task-errors/overview.mdx) for an introduction. + * In this article: * [AI Tasks - list view](../ai-integration/ai-tasks-list-view.mdx#ai-tasks---list-view) diff --git a/docs/ai-integration/gen-ai-integration/overview.mdx b/docs/ai-integration/gen-ai-integration/overview.mdx index 4e9a825e44..88032e378c 100644 --- a/docs/ai-integration/gen-ai-integration/overview.mdx +++ b/docs/ai-integration/gen-ai-integration/overview.mdx @@ -33,6 +33,7 @@ import LanguageContent from "@site/src/components/LanguageContent"; * [How to create and run a GenAI task](../../ai-integration/gen-ai-integration/overview.mdx#how-to-create-and-run-a-genai-task) * [Runtime](../../ai-integration/gen-ai-integration/overview.mdx#runtime) * [Tracking of processed document parts](../../ai-integration/gen-ai-integration/overview.mdx#tracking-of-processed-document-parts) + * [Monitoring the tasks](../../ai-integration/gen-ai-integration/overview.mdx#monitoring-the-tasks) * [Licensing](../../ai-integration/gen-ai-integration/overview.mdx#licensing) * [Supported services](../../ai-integration/gen-ai-integration/overview.mdx#supported-services) * [Common use cases](../../ai-integration/gen-ai-integration/overview.mdx#common-use-cases) @@ -223,6 +224,24 @@ added or modified.
+## Monitoring the tasks + +* The status and state of each GenAI task are visible in the + [AI Tasks - list view](../../ai-integration/ai-tasks-list-view.mdx). + +* Task performance and activity over time can be analyzed in the _AI Tasks Stats_ view. + Learn more about the stats view in the + [Ongoing Tasks Stats](../../studio/database/stats/ongoing-tasks-stats/overview.mdx) article. + +* Errors raised by GenAI tasks, and how those errors affect each task's health, are tracked + in the [Task Errors view](../../monitoring/task-errors/studio-views.mdx#task-errors-view). + The [AI Task Errors view](../../monitoring/task-errors/studio-views.mdx#ai-task-errors-view), + opened from the `AI Hub`, shows the same errors pre-filtered to AI tasks only. + For an introduction to task error monitoring, see the + [Task errors overview](../../monitoring/task-errors/overview.mdx). + +
+ ## Licensing For RavenDB to support the GenAI Integration feature, you need a `RavenDB AI` license type. diff --git a/docs/ai-integration/generating-embeddings/content/_overview-csharp.mdx b/docs/ai-integration/generating-embeddings/content/_overview-csharp.mdx index 220eee939e..a2bbd44c61 100644 --- a/docs/ai-integration/generating-embeddings/content/_overview-csharp.mdx +++ b/docs/ai-integration/generating-embeddings/content/_overview-csharp.mdx @@ -130,6 +130,13 @@ import Panel from '@site/src/components/Panel'; * [5.1.11.25](../../../server/administration/snmp/snmp-overview.mdx#511125) – Total number of enabled embeddings generation tasks. * [5.1.11.26](../../../server/administration/snmp/snmp-overview.mdx#511126) – Total number of active embeddings generation tasks. +* Errors raised by embeddings generation tasks, and how those errors affect each task's + health, are tracked in the [Task Errors view](../../../monitoring/task-errors/studio-views.mdx#task-errors-view). + The [AI Task Errors view](../../../monitoring/task-errors/studio-views.mdx#ai-task-errors-view), + opened from the `AI Hub`, shows the same errors pre-filtered to AI tasks only. + For an introduction to task error monitoring, see the + [Task errors overview](../../../monitoring/task-errors/overview.mdx). + diff --git a/docs/ai-integration/generating-embeddings/content/_overview-nodejs.mdx b/docs/ai-integration/generating-embeddings/content/_overview-nodejs.mdx index 3bc012b0fa..5e5ef046ba 100644 --- a/docs/ai-integration/generating-embeddings/content/_overview-nodejs.mdx +++ b/docs/ai-integration/generating-embeddings/content/_overview-nodejs.mdx @@ -130,6 +130,13 @@ import Panel from '@site/src/components/Panel'; * [5.1.11.25](../../../server/administration/snmp/snmp-overview.mdx#511125) – Total number of enabled embeddings generation tasks. * [5.1.11.26](../../../server/administration/snmp/snmp-overview.mdx#511126) – Total number of active embeddings generation tasks. +* Errors raised by embeddings generation tasks, and how those errors affect each task's + health, are tracked in the [Task Errors view](../../../monitoring/task-errors/studio-views.mdx#task-errors-view). + The [AI Task Errors view](../../../monitoring/task-errors/studio-views.mdx#ai-task-errors-view), + opened from the `AI Hub`, shows the same errors pre-filtered to AI tasks only. + For an introduction to task error monitoring, see the + [Task errors overview](../../../monitoring/task-errors/overview.mdx). + diff --git a/docs/ai-integration/generating-embeddings/content/_overview-python.mdx b/docs/ai-integration/generating-embeddings/content/_overview-python.mdx index 774fbe340e..ba18b1d5d1 100644 --- a/docs/ai-integration/generating-embeddings/content/_overview-python.mdx +++ b/docs/ai-integration/generating-embeddings/content/_overview-python.mdx @@ -130,6 +130,13 @@ import Panel from '@site/src/components/Panel'; * [5.1.11.25](../../../server/administration/snmp/snmp-overview.mdx#511125) – Total number of enabled embeddings generation tasks. * [5.1.11.26](../../../server/administration/snmp/snmp-overview.mdx#511126) – Total number of active embeddings generation tasks. +* Errors raised by embeddings generation tasks, and how those errors affect each task's + health, are tracked in the [Task Errors view](../../../monitoring/task-errors/studio-views.mdx#task-errors-view). + The [AI Task Errors view](../../../monitoring/task-errors/studio-views.mdx#ai-task-errors-view), + opened from the `AI Hub`, shows the same errors pre-filtered to AI tasks only. + For an introduction to task error monitoring, see the + [Task errors overview](../../../monitoring/task-errors/overview.mdx). + diff --git a/docs/monitoring/_category_.json b/docs/monitoring/_category_.json new file mode 100644 index 0000000000..aaab234002 --- /dev/null +++ b/docs/monitoring/_category_.json @@ -0,0 +1,4 @@ +{ + "position": 1, + "label": "Monitoring" +} diff --git a/docs/monitoring/task-errors/_category_.json b/docs/monitoring/task-errors/_category_.json new file mode 100644 index 0000000000..91be008210 --- /dev/null +++ b/docs/monitoring/task-errors/_category_.json @@ -0,0 +1,4 @@ +{ + "position": 1, + "label": "Task Errors" +} diff --git a/docs/monitoring/task-errors/assets/snagit/task-errors_ai-task-errors-view.snagx b/docs/monitoring/task-errors/assets/snagit/task-errors_ai-task-errors-view.snagx new file mode 100644 index 0000000000..6742e37cc0 Binary files /dev/null and b/docs/monitoring/task-errors/assets/snagit/task-errors_ai-task-errors-view.snagx differ diff --git a/docs/monitoring/task-errors/assets/snagit/task-errors_health-indicators.snagx b/docs/monitoring/task-errors/assets/snagit/task-errors_health-indicators.snagx new file mode 100644 index 0000000000..0be0626a2b Binary files /dev/null and b/docs/monitoring/task-errors/assets/snagit/task-errors_health-indicators.snagx differ diff --git a/docs/monitoring/task-errors/assets/snagit/task-errors_ongoing-tasks-bar-expanded.snagx b/docs/monitoring/task-errors/assets/snagit/task-errors_ongoing-tasks-bar-expanded.snagx new file mode 100644 index 0000000000..035eeb0a04 Binary files /dev/null and b/docs/monitoring/task-errors/assets/snagit/task-errors_ongoing-tasks-bar-expanded.snagx differ diff --git a/docs/monitoring/task-errors/assets/snagit/task-errors_ongoing-tasks-view.snagx b/docs/monitoring/task-errors/assets/snagit/task-errors_ongoing-tasks-view.snagx new file mode 100644 index 0000000000..24ad557a34 Binary files /dev/null and b/docs/monitoring/task-errors/assets/snagit/task-errors_ongoing-tasks-view.snagx differ diff --git a/docs/monitoring/task-errors/assets/snagit/task-errors_task-errors-view.snagx b/docs/monitoring/task-errors/assets/snagit/task-errors_task-errors-view.snagx new file mode 100644 index 0000000000..b315b529fe Binary files /dev/null and b/docs/monitoring/task-errors/assets/snagit/task-errors_task-errors-view.snagx differ diff --git a/docs/monitoring/task-errors/assets/snagit/task-errors_task-segment.snagx b/docs/monitoring/task-errors/assets/snagit/task-errors_task-segment.snagx new file mode 100644 index 0000000000..5b0d2aa805 Binary files /dev/null and b/docs/monitoring/task-errors/assets/snagit/task-errors_task-segment.snagx differ diff --git a/docs/monitoring/task-errors/assets/task-errors_ai-task-errors-view.png b/docs/monitoring/task-errors/assets/task-errors_ai-task-errors-view.png new file mode 100644 index 0000000000..8dbb1333ff Binary files /dev/null and b/docs/monitoring/task-errors/assets/task-errors_ai-task-errors-view.png differ diff --git a/docs/monitoring/task-errors/assets/task-errors_filter-bar.png b/docs/monitoring/task-errors/assets/task-errors_filter-bar.png new file mode 100644 index 0000000000..0a2fdb6a40 Binary files /dev/null and b/docs/monitoring/task-errors/assets/task-errors_filter-bar.png differ diff --git a/docs/monitoring/task-errors/assets/task-errors_health-indicators.png b/docs/monitoring/task-errors/assets/task-errors_health-indicators.png new file mode 100644 index 0000000000..73eaafe314 Binary files /dev/null and b/docs/monitoring/task-errors/assets/task-errors_health-indicators.png differ diff --git a/docs/monitoring/task-errors/assets/task-errors_ongoing-tasks-bar-expanded.png b/docs/monitoring/task-errors/assets/task-errors_ongoing-tasks-bar-expanded.png new file mode 100644 index 0000000000..72f3725070 Binary files /dev/null and b/docs/monitoring/task-errors/assets/task-errors_ongoing-tasks-bar-expanded.png differ diff --git a/docs/monitoring/task-errors/assets/task-errors_ongoing-tasks-view.png b/docs/monitoring/task-errors/assets/task-errors_ongoing-tasks-view.png new file mode 100644 index 0000000000..773543f6e0 Binary files /dev/null and b/docs/monitoring/task-errors/assets/task-errors_ongoing-tasks-view.png differ diff --git a/docs/monitoring/task-errors/assets/task-errors_task-errors-view.png b/docs/monitoring/task-errors/assets/task-errors_task-errors-view.png new file mode 100644 index 0000000000..e97a80d508 Binary files /dev/null and b/docs/monitoring/task-errors/assets/task-errors_task-errors-view.png differ diff --git a/docs/monitoring/task-errors/assets/task-errors_task-segment.png b/docs/monitoring/task-errors/assets/task-errors_task-segment.png new file mode 100644 index 0000000000..d95843a552 Binary files /dev/null and b/docs/monitoring/task-errors/assets/task-errors_task-segment.png differ diff --git a/docs/monitoring/task-errors/configuration.mdx b/docs/monitoring/task-errors/configuration.mdx new file mode 100644 index 0000000000..14355d2d82 --- /dev/null +++ b/docs/monitoring/task-errors/configuration.mdx @@ -0,0 +1,140 @@ +--- +title: "Task errors: Configuration" +sidebar_label: "Configuration options" +description: "Configuration keys for task error monitoring." +sidebar_position: 3 +--- + +import Admonition from '@theme/Admonition'; +import Tabs from '@theme/Tabs'; +import TabItem from '@theme/TabItem'; +import CodeBlock from '@theme/CodeBlock'; +import LanguageSwitcher from "@site/src/components/LanguageSwitcher"; +import LanguageContent from "@site/src/components/LanguageContent"; +import Panel from "@site/src/components/Panel"; +import ContentFrame from "@site/src/components/ContentFrame"; + +# Task errors: Configuration + + + +* This page covers the configuration keys that control task error monitoring. + +* To learn how to apply these keys (where to set them, scope, syntax), see the + [Configuration Overview](../../server/configuration/configuration-options.mdx). + +* To learn about task errors and how task health is determined, see the + [Task errors overview](../../monitoring/task-errors/overview.mdx). + +* In this article: + * [Task health thresholds](../../monitoring/task-errors/configuration.mdx#task-health-thresholds) + * [ETL.ProcessHealthStatusImpairedThreshold](../../monitoring/task-errors/configuration.mdx#etlprocesshealthstatusimpairedthreshold) + * [ETL.ProcessHealthStatusFailedThreshold](../../monitoring/task-errors/configuration.mdx#etlprocesshealthstatusfailedthreshold) + * [Tuning the thresholds](../../monitoring/task-errors/configuration.mdx#tuning-the-thresholds) + * [Validation rules](../../monitoring/task-errors/configuration.mdx#validation-rules) + + + + + +Two configuration keys define the boundaries between the three task health states +(`Healthy`, `Impaired`, and `Failed`). Each task is classified by its error ratio +(described on the +[Task errors overview](../../monitoring/task-errors/overview.mdx#how-health-is-computed)): +`Healthy` below the Impaired threshold, `Impaired` between the two thresholds, and +`Failed` above the Failed threshold. A task moves between states as the ratio crosses +each threshold. + +Both keys can be set server-wide or per database, and both apply to AI tasks +(Embeddings Generation, GenAI) as well as ETL tasks despite their `ETL.` prefix. + + + +### ETL.ProcessHealthStatusImpairedThreshold + +* Error-rate threshold above which a task's health is classified as `Impaired`. +* A task whose recent error rate exceeds this value transitions from `Healthy` to `Impaired`. + +- **Type**: `float` +- **Default**: `0.1` +- **Range**: `[0, 1]` +- **Scope**: Server-wide or per database + + + +--- + + + +### ETL.ProcessHealthStatusFailedThreshold + +* Error-rate threshold above which a task's health is classified as `Failed`. +* A task whose recent error rate exceeds this value transitions from `Impaired` to `Failed`. + +- **Type**: `float` +- **Default**: `0.9` +- **Range**: `[0, 1]` +- **Scope**: Server-wide or per database + + + +--- + + + +### Tuning the thresholds + +The defaults are tuned for typical workloads where most tasks should run cleanly and any +sustained error rate is meaningful. Two situations commonly call for adjusting them: +workloads that legitimately accept a high item-failure rate, and operational environments +that need earlier escalation. + +A per-database setting always overrides the server-wide setting, so different workloads on +the same server can use different sensitivity. + +#### Tuning the Impaired threshold + +The default of `0.1` is conservative. Even a small ratio of recent failures flips a task to +`Impaired`, which makes sense when failures are expected to be rare and the goal is to flag +a task as soon as it starts misbehaving. + +* Raise the threshold (for example to `0.2` or `0.3`) when the workload routinely produces + item errors that you do not want to escalate. A typical case is an ETL or AI task + processing user-generated data that often fails validation; the task is doing its job, + the failures are not actionable, and flipping to `Impaired` on every batch is noisy. + +* Lower the threshold (for example to `0.05`) when you want earlier alerting on tasks that + are starting to slip. The cost is more frequent `Impaired` classifications and the alerts + that ride on them. + +#### Tuning the Failed threshold + +The default of `0.9` is permissive. A task only flips to `Failed` when its recent error +rate is overwhelming - effectively, when most of its recent batches have failed. + +* Raise the threshold (for example to `0.95`) when you want `Failed` to mean "essentially + broken" and tolerate substantial impairment without escalating. Useful when `Failed` + triggers automated responses that should be reserved for genuinely catastrophic states. + +* Lower the threshold (for example to `0.7`) when you want stronger and earlier escalation + on degraded tasks. The cost is more frequent `Failed` classifications and the automated + responses that ride on them. + + + +--- + + + +### Validation rules + +RavenDB validates both keys at server startup. The server refuses to start if any of the +following is violated: + +* Each threshold value must be between `0` and `1`, inclusive. +* `ETL.ProcessHealthStatusFailedThreshold` must be strictly greater than + `ETL.ProcessHealthStatusImpairedThreshold`. Equal values are rejected. + + + + diff --git a/docs/monitoring/task-errors/overview.mdx b/docs/monitoring/task-errors/overview.mdx new file mode 100644 index 0000000000..791d7deb8a --- /dev/null +++ b/docs/monitoring/task-errors/overview.mdx @@ -0,0 +1,257 @@ +--- +title: "Task errors: Overview" +sidebar_label: "Overview" +description: "Track errors raised by ETL and AI tasks, evaluate their impact on each task's health, and clear or retry as needed." +sidebar_position: 1 +--- + +import Admonition from '@theme/Admonition'; +import Tabs from '@theme/Tabs'; +import TabItem from '@theme/TabItem'; +import CodeBlock from '@theme/CodeBlock'; +import LanguageSwitcher from "@site/src/components/LanguageSwitcher"; +import LanguageContent from "@site/src/components/LanguageContent"; +import Panel from "@site/src/components/Panel"; +import ContentFrame from "@site/src/components/ContentFrame"; + +# Task errors: Overview + + + +* Task errors are raised and stored whenever an ETL task or an AI task fails to process an + item or a batch. Each error records the task name, the time of failure, the processing + step the error occurred at, and the error message. + +* Throughout this section, "AI tasks" means [Embeddings Generation](../../ai-integration/generating-embeddings/overview.mdx) + and [GenAI](../../ai-integration/gen-ai-integration/overview.mdx) tasks. + +* Errors are persisted on disk per task. Each task keeps its own error history and that history + survives moves between nodes and server restarts. + +* Each task also has a health classification - `Healthy`, `Impaired`, or `Failed` - that reflects + its recent error rate, independently of the raw number of errors stored. + +* Task errors and the health states they drive are exposed in + [Studio](../../monitoring/task-errors/studio-views.mdx), + [HTTP endpoints](../../server/troubleshooting/debug-routes.mdx#debug-endpoints), + [SNMP OIDs](../../server/administration/snmp/snmp-overview.mdx#list-of-oids), + [Prometheus metrics](../../server/administration/monitoring/prometheus.mdx#metrics-provided-by-the-prometheus-endpoint), + and [monitoring endpoints](../../server/administration/monitoring/telegraf.mdx#monitoring-endpoints). + +* In this article: + * [What task errors are](../../monitoring/task-errors/overview.mdx#what-task-errors-are) + * [Error types](../../monitoring/task-errors/overview.mdx#error-types) + * [Error steps](../../monitoring/task-errors/overview.mdx#error-steps) + * [Where task errors are stored](../../monitoring/task-errors/overview.mdx#where-task-errors-are-stored) + * [Task health](../../monitoring/task-errors/overview.mdx#task-health) + * [Health states](../../monitoring/task-errors/overview.mdx#health-states) + * [How health is computed](../../monitoring/task-errors/overview.mdx#how-health-is-computed) + * [Where to view and manage task errors](../../monitoring/task-errors/overview.mdx#where-to-view-and-manage-task-errors) + + + + + +Task errors are recorded for every ETL provider (RavenDB ETL, SQL, OLAP, ElasticSearch, Kafka, +RabbitMQ, Azure Queue Storage, Amazon SQS, Snowflake) and for AI tasks (Embeddings Generation, +GenAI). Whenever one of these tasks fails to process an item or a batch, an error is added to +the task's error history. + +Every error carries the same set of core fields: the task name, the time the error was created, +the processing step the error occurred at, and the error message. Different error types carry +additional fields specific to what went wrong. + + + +### Error types + +RavenDB classifies every task error as one of two types, based on the scope of what went wrong. + +* **Item error** + An error that occurred while processing a single document. The document was skipped and the + task moved on to the remaining documents in the batch. The error record includes the + document ID. + +* **Process error** + An error that occurred while processing a batch as a whole and may affect multiple documents, + such as a failure to send the batch to its destination. The error record includes the number + of documents the failing batch attempted to handle. + After a process error, the task enters fallback mode and retries the batch periodically. + + + +--- + + + +### Error steps + +Every error records the processing step it occurred at. The available steps depend on the task +type. + +* **Configuration** + The task's configuration was rejected. Typical causes include an invalid script or a missing + destination setting. + +* **Extraction** + The task could not read its source data. This is rare and usually indicates a transient + storage issue. + +* **Transformation** + The transformation script raised an exception while running, such as an unhandled + JavaScript error or a reference to a missing property. + +* **Load** + The task could not send its transformed data to the destination. Typical causes include the + destination being unreachable or rejecting the data. + +* **Persistence** + The task could not save its results back to the database, or could not update its own + process state. Usually caused by storage errors. + +* **Model Inference** (AI tasks only) + The task could not communicate with the AI model. Typical causes include the model service + being unreachable or returning an error. + +* **Unknown** + The processing step could not be determined. + + + + + + + +* Each ETL or AI task keeps its errors in two dedicated tables on disk: one for item errors + and one for process errors. + +* Each table is capped at 500 entries per task. When a new error needs to be recorded + after the cap is reached, the oldest entry in that table is evicted to make room. + The cap is not configurable. + +* Retention is per task and per table, so a single noisy task cannot push errors out of + an unrelated task. + +* Task errors are also included in the server's debug package, with separate files for ETL + and AI task errors, so support engineers can capture a full error history without going + through Studio or the HTTP endpoints. + + + + + +Each ETL and AI task carries a health state that summarizes how well it has been processing +recent batches. The health state is exposed everywhere task errors are +(see [Where to view and manage task errors](../../monitoring/task-errors/overview.mdx#where-to-view-and-manage-task-errors)) +and is used by automated monitoring to decide when a task needs attention. + + + +### Health states + +A task is in one of three health states at any time. + +* **Healthy** + No errors recently, or only an occasional one. The task is processing batches normally. + +* **Impaired** + Errors are accumulating at a rate that warrants attention. The task is still making + progress, but it should be looked at. + +* **Failed** + Errors dominate recent batches. The task is effectively not progressing and needs + intervention. + +A task recovers automatically as new batches complete. The health state transitions from +`Failed` back to `Impaired`, and from `Impaired` back to `Healthy`, as the running error rate +falls below each threshold. + +Updating the task's configuration also resets the health state to `Healthy`. + + +Deleting a task's stored errors clears the rows from the error tables but does not, on its own, +reset the task's health state. +Health is driven by the running error rate, not by the rows in the error tables. A task in +the `Failed` state will recover only when its error rate falls back below the +[configured thresholds](../../monitoring/task-errors/configuration.mdx). + + + + +--- + + + +### How health is computed + +RavenDB watches the ratio between a task's failed items and the total number of items the +task has attempted to process. The ratio is computed as a time-independent EWMA +(Exponentially Weighted Moving Average) - the weight of each batch decays as more batches +complete, not as time passes - and is updated continuously as new batches complete. + +In plain terms, more recent batches weigh more in the calculation than older ones. A fresh +string of failures pushes the ratio up faster than the raw error count would suggest, and a +clean stretch of batches pulls it back down, again with the most recent batches having the +strongest effect. + +The ratio is bounded between `0` and `1`, where `0` means no recent failures and `1` means +recent batches have all failed. Two thresholds determine the transitions between states: + +* The task is classified as `Impaired` when the ratio exceeds + `ETL.ProcessHealthStatusImpairedThreshold` (default `0.1`). +* The task is classified as `Failed` when the ratio exceeds + `ETL.ProcessHealthStatusFailedThreshold` (default `0.9`). + +Both thresholds are configurable, server-wide or per database, and apply to AI tasks as well +as ETL tasks despite the keys `ETL` prefix. + +[Task errors configuration](../../monitoring/task-errors/configuration.mdx) covers the two +keys, their valid ranges, and guidance for choosing values. + + + + + + + +Task errors and the resulting health states are exposed in several places. Most users will +start with Studio; automated monitoring tools usually pull from SNMP OIDs, Prometheus +metrics, or monitoring endpoints. + +[Inspect and manage task errors via the HTTP endpoints](../../server/troubleshooting/debug-routes.mdx) +[Inspect and manage task errors via Studio](../../monitoring/task-errors/studio-views.mdx) + +Where to find them in detail: + +* **HTTP endpoints** + * `GET /databases/*/tasks/errors` returns errors across all ETL and AI tasks. + * `GET /databases/*/etl/errors` and `GET /databases/*/ai/errors` return errors per category. + * `DELETE` variants of each path remove errors in bulk, optionally filtered by task name or + category. For example, `DELETE /databases/*/etl/errors?name=` clears the + errors of one specific ETL task. + * `POST /databases/*/etl/retry-batch` forces an immediate retry of an ETL task currently in + fallback mode. + + See [Debug Endpoints](../../server/troubleshooting/debug-routes.mdx#debug-endpoints) for the full reference. + +* **Studio views** + The `Task Errors` view is reachable from `Tasks` **>** `Task Errors` and from + `AI Hub` **>** `AI Task Errors` (the same view, pre-filtered to AI tasks). + Each ETL and AI task bar on the `Ongoing Tasks` view also shows the task's health state and error count. + See [Task errors Studio views](../../monitoring/task-errors/studio-views.mdx). + +* **SNMP OIDs** + Dedicated OIDs for server-level, database-level, and per-task error counts and health + states. + See [List of OIDs](../../server/administration/snmp/snmp-overview.mdx#list-of-oids). + +* **Prometheus metrics** + Metrics for server, database, and per-task scopes, mirroring the SNMP set. + See [Prometheus integration](../../server/administration/monitoring/prometheus.mdx). + +* **Monitoring Endpoints** + `/admin/monitoring/v1/etls` and `/admin/monitoring/v1/ai-tasks` return per-task health and + error counts as JSON. + See [Monitoring endpoints](../../server/administration/monitoring/telegraf.mdx#monitoring-endpoints). + + diff --git a/docs/monitoring/task-errors/studio-views.mdx b/docs/monitoring/task-errors/studio-views.mdx new file mode 100644 index 0000000000..f235b2c219 --- /dev/null +++ b/docs/monitoring/task-errors/studio-views.mdx @@ -0,0 +1,282 @@ +--- +title: "Task errors: Studio views" +sidebar_label: "Studio views" +description: "Inspect task errors and health states for ETL and AI tasks from the Task Errors view in Studio." +sidebar_position: 2 +--- + +import Admonition from '@theme/Admonition'; +import Tabs from '@theme/Tabs'; +import TabItem from '@theme/TabItem'; +import CodeBlock from '@theme/CodeBlock'; +import LanguageSwitcher from "@site/src/components/LanguageSwitcher"; +import LanguageContent from "@site/src/components/LanguageContent"; +import Panel from "@site/src/components/Panel"; +import ContentFrame from "@site/src/components/ContentFrame"; + +# Task errors: Studio views + + + +* **`Tasks` > `Task Errors`** + Open the `Task Errors` view from the `Tasks` menu to inspect errors raised by ETL and AI tasks. + You can browse all errors in a unified list or group them by task, apply various filters, + select an error to view it in detail, and see how task health is impacted by recent errors. + +* **`AI Hub` > `AI Task Errors`** + The `AI Task Errors` view, opened from the `AI Hub`, is a pre-filtered subset of the `Task Errors` view. + Use this view to inspect errors raised by `Embeddings Generation` and `GenAI` tasks. + +* Both views display the same errors for listed tasks; deleting a task's errors from one view is reflected + in the other. + +* **`Tasks` > `Ongoing Tasks`** + Each ETL and AI task bar on the `Ongoing Tasks` view shows the task's health state + and error count; expanding the bar reveals additional detail. + +* To learn about task errors and how they impact task health, see the + [Overview](../../monitoring/task-errors/overview.mdx) page. + +* In this article: + * [Task Errors view](../../monitoring/task-errors/studio-views.mdx#task-errors-view) + * [Opening the view](../../monitoring/task-errors/studio-views.mdx#opening-the-view) + * [Task filters](../../monitoring/task-errors/studio-views.mdx#task-filters) + * [Task health indicators](../../monitoring/task-errors/studio-views.mdx#task-health-indicators) + * [Task errors](../../monitoring/task-errors/studio-views.mdx#task-errors) + * [AI Task Errors view](../../monitoring/task-errors/studio-views.mdx#ai-task-errors-view) + * [Task health on the Ongoing Tasks view](../../monitoring/task-errors/studio-views.mdx#task-health-on-the-ongoing-tasks-view) + * [Collapsed view](../../monitoring/task-errors/studio-views.mdx#collapsed-view) + * [Expanded view](../../monitoring/task-errors/studio-views.mdx#expanded-view) + + + + + +In its default layout, `Task Errors` groups errors into per-task segments, each showing the +task's errors in a sortable table. + + + +### Opening the view + +Open the `Task Errors` view from the `Tasks` menu. By default it will open with no filters applied, +showing a segment for every ETL or AI task that currently has any errors. + +![Task Errors view](./assets/task-errors_task-errors-view.png) + +* **A.** Click to open the Tasks menu. + +* **B.** Click to open the Task Errors view. + +* **C.** [Task filters](../../monitoring/task-errors/studio-views.mdx#task-filters) (see below). + +* **D.** [Task health indicators](../../monitoring/task-errors/studio-views.mdx#task-health-indicators) (see below). + +* **E.** Toggle to **group errors by task** or display them in a unified list. + +* **F.** [Task errors](../../monitoring/task-errors/studio-views.mdx#task-errors) (see below). + + + +--- + + + +### Task filters + +Use the filters bar to narrow the listing to specific tasks and errors. + +![Filters bar](./assets/task-errors_filter-bar.png) + +* **`Filter by task/script name`** + Type a task or script name to narrow the listing to matching tasks. + +* **`Filter by node`** + Pick one or more cluster nodes to show only the errors raised on the selected nodes. + +* **`Filter by task type`** + Pick one or more task types (e.g., Kafka ETL) to show only the errors raised by the selected types. + +* **`Filter by task health`** + Pick one or more health states to show only tasks currently in the selected states. + + + +--- + + + +### Task health indicators + +The indicators' colors represent task health states: Green for `Healthy`, yellow for +`Impaired`, and red for `Failed`. +* Hover an indicator to trigger a popup summary of tasks whose health currently matches + the selected state. +* The summary lists only the node currently running the task and any nodes that recorded + errors for it, with the error count per node. + +![Task health indicators](./assets/task-errors_health-indicators.png) + + + +--- + + + +### Task errors + +The image below shows one of the task segments displayed in the task errors view when errors +are grouped by task. + +![Task Errors - per-task segment](./assets/task-errors_task-segment.png) + +1. **Task name** + The name of the ETL or AI task whose errors are displayed here. + +2. **Delete errors** + Click to **remove all errors raised by this task**, including both item and process errors. + + Deleting a task's errors does not, on its own, reset the task's health state. + Health is driven by the running error rate, not by the rows in the error tables. + A task in `Impaired` or `Failed` state will recover only as new batches complete + successfully and its error rate falls back below the configured thresholds. + See the [Overview](../../monitoring/task-errors/overview.mdx#health-states) + for more. + + +3. **Task metadata row** + * A toggle to collapse or expand all errors related to this task. + * Task type. + * Error count for this task. + * The number of scripts that this task runs. + * Task's current health state (`Healthy`, `Impaired`, or `Failed`). + * Tag/s of the cluster node/s currently running the task. + +4. **Script sub-segment details** + Errors for each script the task runs appear in their own sub-segment, with a header showing + the script's name and error count and a toggle to collapse or expand the errors related to this script. + +5. **Errors table** + The script's errors, one row per error. + + * **Column headers** + You can filter or sort the table by the content of each column, using the + funnel (filter) or arrow (sort) icons at the column header. + + * **`Show` column** + Click the eye icon for a specific error to open an error-details dialog with the full error + message. + + * **`Error type` column** + Marks the row as `Item Error` (a single document failure the task skipped past) or + `Process Error` (a batch-scope failure that may affect multiple documents). + + * **`Error step` column** + The processing step the error occurred at: `Configuration`, `Extraction`, + `Transformation`, `Load`, `Persistence`, `Model Inference`, or `Unknown`. + See the [Error steps](../../monitoring/task-errors/overview.mdx#error-steps) reference on the + overview. + + * **`Document` column** + For item errors, the ID of the document being processed when the error occurred, + rendered as a hyperlink to the document. + For process errors, the column shows `-` because the error is not bound to a single document. + + * **`Date` column** + The error's creation timestamp, shown in date form and in relation to the current time + (e.g., "4 hours ago"). + + * **`Affected Documents` column** + For process errors, the number of documents the failing batch attempted to process. + Empty for item errors. + + * **`Error` column** + The error message, truncated to one line. + + * **`Node` column** + The tag of the cluster node that recorded the error. + + + + + + + +The `AI Task Errors` view lists the same errors listed by the `Task Errors` view, +with the same layout, controls, and data, but applies a predefined filter to show only +`Embeddings Generation` and `GenAI` task errors. + +All options documented under +[Task Errors view](../../monitoring/task-errors/studio-views.mdx#task-errors-view) +above apply here without change. + +![AI Task Errors view](./assets/task-errors_ai-task-errors-view.png) + +1. Click to open the **AI Hub**. + +2. Click to open the AI Task Errors view. + + + + + +On the `Ongoing Tasks` view, each ETL or AI task bar displays the task's current health +state and the number of errors recorded for the task. Expanding the bar reveals these +details per node, along with the node's `Connection status`. + + + +### Collapsed view + +![Ongoing Tasks view](./assets/task-errors_ongoing-tasks-view.png) + +1. **Click to open the `Tasks` menu.** + +2. **Click to open the `Ongoing Tasks` view.** + +3. **Task bar** + +4. **Expand details** + Click to expand the bar - see the per-node breakdown below. + +5. **Task health and error count** + * `Health Status` - the task's current state (`Healthy`, `Impaired`, or `Failed`). + * `Errors` - the number of errors currently recorded for the task. + + + +--- + + + +### Expanded view + +![Ongoing Tasks task bar - expanded](./assets/task-errors_ongoing-tasks-bar-expanded.png) + +Each relevant node has its own column showing how the task is doing on that node. Only +the node currently running the task, and any other nodes that recorded errors for it, +are shown. + +* `Connection status` - the state of the node's connection to the task's destination. + The value is `Active` while the connection is up, and `Reconnect` after a failure + while the task waits to retry. + To retry the failing batch immediately, hover `Reconnect` and click the **Retry now** + button that appears. + +* `Errors` - the number of errors the task has raised on this node. + +* `Health status` - the task's classification on this node (`Healthy`, `Impaired`, + or `Failed`). + +* `State` - the task's processing state on this node (such as `UP TO DATE` or + `0% RUNNING`). + + + +
+ +See the [Ongoing Tasks - Overview](../../studio/database/tasks/ongoing-tasks/general-info.mdx#the-ongoing-tasks-list) +page for a full walkthrough of the view, including filters, selection, and per-task +actions. + +
diff --git a/docs/server/administration/monitoring/prometheus.mdx b/docs/server/administration/monitoring/prometheus.mdx index 0092dfcee1..79b56674dc 100644 --- a/docs/server/administration/monitoring/prometheus.mdx +++ b/docs/server/administration/monitoring/prometheus.mdx @@ -64,7 +64,7 @@ or to `false` to include it. `skipCollectionsMetrics` E.g., to skip indexing metrics use - -http://localhost:8080/admin/monitoring/v1/prometheus?skipIndexesMetrics=true +http://localhost:8080/admin/monitoring/v1/prometheus?skipIndexesMetrics=true And to skip both indexing and server metrics use - http://localhost:8080/admin/monitoring/v1/prometheus?skipIndexesMetrics=true&skipServerMetrics=true @@ -74,6 +74,10 @@ Here is the list of metrics made available by the `/admin/monitoring/v1/promethe | Metrics | Description | | - | - | +| ai_task_documents_processed_per_second | Documents processed per second by the AI task (one minute rate) | +| ai_task_errors_count | Number of errors recorded for the AI task | +| ai_task_health_status | AI task health status + `0`/`1`/`2`
0 => Healthy
1 => Impaired
2 => Failed | +| ai_task_last_successful_batch_time_in_seconds | Time since the AI task's last successful batch, in seconds | | archived_data_processing_behavior | Archived data processing behavior + `0`/`1`/`2`
0 => ExcludeArchived
1 => IncludeArchived
2 => ArchivedOnly | | backup_current_number_of_running_backups | Number of currently running backups | | backup_max_number_of_concurrent_backups | Maximum number of concurrent backups | @@ -93,9 +97,19 @@ Here is the list of metrics made available by the `/admin/monitoring/v1/promethe | cpu_processor_count | Number of processors on the machine | | cpu_thread_pool_available_completion_port_threads | Number of available completion port threads in the thread pool | | cpu_thread_pool_available_worker_threads | Number of available worker threads in the thread pool | +| database_ai_tasks_count | Number of AI tasks in the database | +| database_ai_tasks_errors_count | Total number of AI task errors in the database | +| database_ai_tasks_failed_count | Number of AI tasks with `Failed` health status in the database | +| database_ai_tasks_healthy_count | Number of AI tasks with `Healthy` health status in the database | +| database_ai_tasks_impaired_count | Number of AI tasks with `Impaired` health status in the database | | database_alerts_count | Number of alerts | | database_attachments_count | Number of attachments | | database_documents_count | Number of documents | +| database_etls_count | Number of ETL tasks in the database | +| database_etls_errors_count | Total number of ETL errors in the database | +| database_etls_failed_count | Number of ETL tasks with `Failed` health status in the database | +| database_etls_healthy_count | Number of ETL tasks with `Healthy` health status in the database | +| database_etls_impaired_count | Number of ETL tasks with `Impaired` health status in the database | | database_indexes_auto_count | Number of auto indexes | | database_indexes_count | Number of indexes | | database_indexes_errored_count | Number of error indexes | @@ -131,6 +145,10 @@ Here is the list of metrics made available by the `/admin/monitoring/v1/promethe | database_uptime_seconds | Database up-time | | databases_loaded_count | Number of loaded databases | | databases_total_count | Number of all databases | +| etl_documents_processed_per_second | Documents processed per second by the ETL task (one minute rate) | +| etl_errors_count | Number of errors recorded for the ETL task | +| etl_health_status | ETL task health status + `0`/`1`/`2`
0 => Healthy
1 => Impaired
2 => Failed | +| etl_last_successful_batch_time_in_seconds | Time since the ETL task's last successful batch, in seconds | | index_entries_count | Number of entries in the index | | index_errors | Number of index errors | | index_is_invalid | Indicates if index is invalid | @@ -161,9 +179,19 @@ Here is the list of metrics made available by the `/admin/monitoring/v1/promethe | network_requests_per_second | Number of requests per second (one minute rate) | | network_tcp_active_connections | Number of active TCP connections | | network_total_requests | Total number of requests since server startup | +| server_ai_tasks_count | Total number of AI tasks across all databases | +| server_ai_tasks_errors_count | Total number of AI task errors across all databases | +| server_ai_tasks_failed_count | Number of AI tasks with `Failed` health status across all databases | +| server_ai_tasks_healthy_count | Number of AI tasks with `Healthy` health status across all databases | +| server_ai_tasks_impaired_count | Number of AI tasks with `Impaired` health status across all databases | | server_disk_remaining_storage_space_percentage | Remaining server storage disk space in % | | server_disk_system_store_total_data_file_size_bytes | Server storage total size | | server_disk_system_store_used_data_file_size_bytes | Server storage used size | +| server_etls_count | Total number of ETL tasks across all databases | +| server_etls_errors_count | Total number of ETL errors across all databases | +| server_etls_failed_count | Number of ETL tasks with `Failed` health status across all databases | +| server_etls_healthy_count | Number of ETL tasks with `Healthy` health status across all databases | +| server_etls_impaired_count | Number of ETL tasks with `Impaired` health status across all databases | | server_info | Server Info | | server_process_id | Server process ID | | server_storage_io_read_operations | Disk IO Read operations | diff --git a/docs/server/administration/monitoring/telegraf.mdx b/docs/server/administration/monitoring/telegraf.mdx index c224654793..fb18eac35f 100644 --- a/docs/server/administration/monitoring/telegraf.mdx +++ b/docs/server/administration/monitoring/telegraf.mdx @@ -39,12 +39,14 @@ data tracking dashboard. But this feature is flexible - Telegraf can output data ## Monitoring Endpoints -The monitoring endpoints output data in JSON format. There are four endpoints: +The monitoring endpoints output data in JSON format. There are six endpoints: * `/admin/monitoring/v1/server` * `/admin/monitoring/v1/databases` * `/admin/monitoring/v1/indexes` * `/admin/monitoring/v1/collections` +* `/admin/monitoring/v1/etls` +* `/admin/monitoring/v1/ai-tasks` ## JSON Fields Returned by the Endpoints @@ -52,6 +54,11 @@ The following is a list of JSON fields returned by the endpoints: | Endpoint Suffix | Field Name | Description | | - | - | - | +| `ai-tasks` | `process_name` | The AI task name | +| `ai-tasks` | `errors_count` | Number of errors recorded for the AI task | +| `ai-tasks` | `health_status` | AI task health status (`Healthy`, `Impaired`, or `Failed`) | +| `ai-tasks` | `last_successful_batch_time_in_sec` | Time since the AI task's last successful batch, in seconds | +| `ai-tasks` | `documents_processed_per_second` | Documents processed per second by the AI task (one minute rate) | | `collections` | `collection_name` | Collection name | | `collections` | `database_name` | Name of this collection's database | | `collections` | `documents_count` | Number of documents in collection | @@ -96,6 +103,11 @@ The following is a list of JSON fields returned by the endpoints: | `databases` | `storage_queue_length` | Storage queue length
Optional, Linux only | | `databases` | `time_since_last_backup_in_sec` | LastBackup | | `databases` | `uptime_in_sec` | Database up-time | +| `etls` | `process_name` | The ETL task name | +| `etls` | `errors_count` | Number of errors recorded for the ETL task | +| `etls` | `health_status` | ETL task health status (`Healthy`, `Impaired`, or `Failed`) | +| `etls` | `last_successful_batch_time_in_sec` | Time since the ETL task's last successful batch, in seconds | +| `etls` | `documents_processed_per_second` | Documents processed per second by the ETL task (one minute rate) | | `indexes` | `entries_count` | Number of entries in the index | | `indexes` | `errors` | Number of index errors | | `indexes` | `index_name` | Index name | diff --git a/docs/server/administration/snmp/snmp-overview.mdx b/docs/server/administration/snmp/snmp-overview.mdx index 60ddff546b..e43e15010c 100644 --- a/docs/server/administration/snmp/snmp-overview.mdx +++ b/docs/server/administration/snmp/snmp-overview.mdx @@ -38,6 +38,8 @@ SNMP support is available for [Enterprise](../../../licensing/overview.mdx#enter * [Index OIDs](../../../server/administration/snmp/snmp-overview.mdx#index-oids) * [General OIDs](../../../server/administration/snmp/snmp-overview.mdx#general-oids) * [Ongoing tasks OIDs](../../../server/administration/snmp/snmp-overview.mdx#ongoing-tasks-oids) + * [Per-task ETL OIDs](../../../server/administration/snmp/snmp-overview.mdx#per-task-etl-oids) + * [Per-task AI OIDs](../../../server/administration/snmp/snmp-overview.mdx#per-task-ai-oids) ## Overview @@ -268,6 +270,7 @@ curl -X GET http://live-test.ravendb.net/monitoring/snmp?oid=1.3.6.1.4.1.45751.1 `3` - **a background collection** (this is always a generation 2 collection) * `D` - **Database number** * `I` - **Index number** + * `T` - **Task number** (used for per-task ETL and per-task AI OIDs) @@ -361,6 +364,18 @@ curl -X GET http://live-test.ravendb.net/monitoring/snmp?oid=1.3.6.1.4.1.45751.1 | 1.17.2 | Number of current map files in '/proc/self/maps' | | 1.17.3 | Value of the '/proc/sys/kernel/threads-max' parameter | | 1.17.4 | Number of current threads | +| 1.20.1 | Total number of ETL errors | +| 1.20.2 | Number of ETL tasks with `Healthy` health status | +| 1.20.3 | Number of ETL tasks with `Impaired` health status | +| 1.20.4 | Number of ETL tasks with `Failed` health status | +| 1.20.5 | Total number of ETL tasks | +| 1.20.6 | Number of active ETL tasks (processed at least one batch in the last minute) | +| 1.21.1 | Total number of AI task errors | +| 1.21.2 | Number of AI tasks with `Healthy` health status | +| 1.21.3 | Number of AI tasks with `Impaired` health status | +| 1.21.4 | Number of AI tasks with `Failed` health status | +| 1.21.5 | Total number of AI tasks | +| 1.21.6 | Number of active AI tasks (processed at least one batch in the last minute) | @@ -390,6 +405,8 @@ curl -X GET http://live-test.ravendb.net/monitoring/snmp?oid=1.3.6.1.4.1.45751.1 | 5.2.`D`.1.14 | Number of rehabs | | 5.2.`D`.1.15 | Number of performance hints | | 5.2.`D`.1.16 | Number of indexing errors | +| 5.2.`D`.1.17 | Total number of ETL errors in the database | +| 5.2.`D`.1.18 | Total number of AI task errors in the database | | 5.2.`D`.2.1 | Documents storage allocated size in MB | | 5.2.`D`.2.2 | Documents storage used size in MB | | 5.2.`D`.2.3 | Index storage allocated size in MB | @@ -417,6 +434,18 @@ curl -X GET http://live-test.ravendb.net/monitoring/snmp?oid=1.3.6.1.4.1.45751.1 | 5.2.`D`.5.7 | Number of faulty indexes | | 5.2.`D`.6.1 | Number of writes (documents, attachments, counters, timeseries) | | 5.2.`D`.6.2 | Number of bytes written (documents, attachments, counters, timeseries) | +| 5.2.`D`.7.1 | Number of ETL tasks with `Healthy` health status in the database | +| 5.2.`D`.7.2 | Number of ETL tasks with `Impaired` health status in the database | +| 5.2.`D`.7.3 | Number of ETL tasks with `Failed` health status in the database | +| 5.2.`D`.7.4 | Total number of ETL tasks in the database | +| 5.2.`D`.7.5 | Number of active ETL tasks in the database | +| 5.2.`D`.7.6 | ETL documents processed per second in the database (one minute rate) | +| 5.2.`D`.8.1 | Number of AI tasks with `Healthy` health status in the database | +| 5.2.`D`.8.2 | Number of AI tasks with `Impaired` health status in the database | +| 5.2.`D`.8.3 | Number of AI tasks with `Failed` health status in the database | +| 5.2.`D`.8.4 | Total number of AI tasks in the database | +| 5.2.`D`.8.5 | Number of active AI tasks in the database | +| 5.2.`D`.8.6 | AI task documents processed per second in the database (one minute rate) | @@ -490,6 +519,27 @@ curl -X GET http://live-test.ravendb.net/monitoring/snmp?oid=1.3.6.1.4.1.45751.1 | 5.1.11.24 | Number of active Snowflake ETL tasks for all databases | | 5.1.11.25 | Number of enabled Embeddings Generation tasks for all databases | | 5.1.11.26 | Number of active Embeddings Generation tasks for all databases | - +| 5.1.12.1 | Total ETL documents processed per second across all databases (one minute rate) | +| 5.1.12.2 | Total AI task documents processed per second across all databases (one minute rate) | + + + +| OID | Metric (Per-task ETL) | +|------------------------------------------------------|----------------------------------------------------------------------| +| 5.2.`D`.1.`T`.1 | Number of errors for the ETL task | +| 5.2.`D`.1.`T`.2 | Health status of the ETL task (`Healthy`, `Impaired`, or `Failed`) | +| 5.2.`D`.1.`T`.3 | Time of the last successful batch processed by the ETL task | +| 5.2.`D`.1.`T`.4 | Documents processed per second by the ETL task (one minute rate) | +| 5.2.`D`.1.`T`.5 | Responsible node tag for the ETL task | + + + +| OID | Metric (Per-task AI) | +|------------------------------------------------------|----------------------------------------------------------------------| +| 5.2.`D`.2.`T`.1 | Number of errors for the AI task | +| 5.2.`D`.2.`T`.2 | Health status of the AI task (`Healthy`, `Impaired`, or `Failed`) | +| 5.2.`D`.2.`T`.3 | Time of the last successful batch processed by the AI task | +| 5.2.`D`.2.`T`.4 | Documents processed per second by the AI task (one minute rate) | +| 5.2.`D`.2.`T`.5 | Responsible node tag for the AI task | diff --git a/docs/server/configuration/etl-configuration.mdx b/docs/server/configuration/etl-configuration.mdx index 9434422ea9..9b67801804 100644 --- a/docs/server/configuration/etl-configuration.mdx +++ b/docs/server/configuration/etl-configuration.mdx @@ -23,6 +23,8 @@ import Panel from '@site/src/components/Panel'; * [ETL.MaxNumberOfExtractedDocuments](../../server/configuration/etl-configuration.mdx#etlmaxnumberofextracteddocuments) * [ETL.MaxNumberOfExtractedItems](../../server/configuration/etl-configuration.mdx#etlmaxnumberofextracteditems) * [ETL.OLAP.MaxNumberOfExtractedDocuments](../../server/configuration/etl-configuration.mdx#etlolapmaxnumberofextracteddocuments) + * [ETL.ProcessHealthStatusFailedThreshold](../../server/configuration/etl-configuration.mdx#etlprocesshealthstatusfailedthreshold) + * [ETL.ProcessHealthStatusImpairedThreshold](../../server/configuration/etl-configuration.mdx#etlprocesshealthstatusimpairedthreshold) * [ETL.Queue.AzureQueueStorage.TimeToLiveInSec](../../server/configuration/etl-configuration.mdx#etlqueueazurequeuestoragetimetoliveinsec) * [ETL.Queue.AzureQueueStorage.VisibilityTimeoutInSec](../../server/configuration/etl-configuration.mdx#etlqueueazurequeuestoragevisibilitytimeoutinsec) * [ETL.Queue.Kafka.InitTransactionsTimeoutInSec](../../server/configuration/etl-configuration.mdx#etlqueuekafkainittransactionstimeoutinsec) @@ -112,6 +114,36 @@ Max number of extracted documents in OLAP ETL batch. +## ETL.ProcessHealthStatusFailedThreshold + +* Error-rate threshold for the `Failed` task health state. A task whose recent error rate + exceeds this value is classified as `Failed`. +* See the [Task errors](../../monitoring/task-errors/configuration.mdx#etlprocesshealthstatusfailedthreshold) + page to learn how the rate is calculated and how to set a value. + +- **Type**: `float` +- **Default**: `0.9` +- **Range**: `[0, 1]` +- **Scope**: Server-wide or per database + + + + +## ETL.ProcessHealthStatusImpairedThreshold + +* Error-rate threshold for the `Impaired` task health state. A task whose recent error rate + exceeds this value is classified as `Impaired`. +* See the [Task errors](../../monitoring/task-errors/configuration.mdx#etlprocesshealthstatusimpairedthreshold) + page to learn how the rate is calculated and how to set a value. + +- **Type**: `float` +- **Default**: `0.1` +- **Range**: `[0, 1]` +- **Scope**: Server-wide or per database + + + + ## ETL.Queue.AzureQueueStorage.TimeToLiveInSec Lifespan of a message in the queue. diff --git a/docs/server/troubleshooting/debug-routes.mdx b/docs/server/troubleshooting/debug-routes.mdx index f20e90c9c5..713eeb1b83 100644 --- a/docs/server/troubleshooting/debug-routes.mdx +++ b/docs/server/troubleshooting/debug-routes.mdx @@ -47,6 +47,7 @@ For the endpoints that begin with `/databases/*/`, replace `*` with the name of | /build/version | GET | | Returns product build number, major version, commit hash and full version number | | | /databases/*/admin/debug/cluster/txinfo | GET |
  • `from` (Optional)
    Number of results to skip
  • `take` (Optional)
    Number of results to take
| List the incomplete [cluster transaction commands](../clustering/cluster-transactions.mdx#cluster--cluster-wide-transactions) | | | /databases/*/admin/debug/txinfo | GET | | List | | +| /databases/*/ai/errors | GET |
  • `name` (Optional, multi-valued)
    Filter results to errors of the named AI task or tasks.
| List recent errors recorded for AI tasks (Embeddings Generation, GenAI). See [Task errors overview](../../monitoring/task-errors/overview.mdx). | | | /databases/*/debug/documents/huge | GET | | List IDs of documents which exceed `PerformanceHints.`
`Documents.`
`HugeDocumentSizeInMb` setting | | | /databases/*/debug/identities | GET | | | | | /databases/*/debug/info-package | GET | | Save debug package information for later analysis | | @@ -57,6 +58,7 @@ For the endpoints that begin with `/databases/*/`, replace `*` with the name of | /databases/*/debug/script-runners | GET | | | | | /databases/*/debug/storage/all-environments/report | GET | | | | | /databases/*/debug/storage/report | GET | | | | +| /databases/*/etl/errors | GET |
  • `name` (Optional, multi-valued)
    Filter results to errors of the named ETL task or tasks.
| List recent errors recorded for ETL tasks (RavenDB, SQL, OLAP, ElasticSearch, Kafka, RabbitMQ, Azure Queue Storage, Amazon SQS, Snowflake). See [Task errors overview](../../monitoring/task-errors/overview.mdx). | | | /databases/*/indexes | GET | | | | | /databases/*/indexes/errors | GET | | | | | /databases/*/indexes/stats | GET | | | | diff --git a/docs/studio/database/tasks/ongoing-tasks/assets/snagit/task-list-0.snagx b/docs/studio/database/tasks/ongoing-tasks/assets/snagit/task-list-0.snagx new file mode 100644 index 0000000000..4a16d403ff Binary files /dev/null and b/docs/studio/database/tasks/ongoing-tasks/assets/snagit/task-list-0.snagx differ diff --git a/docs/studio/database/tasks/ongoing-tasks/assets/snagit/task-list-1.snagx b/docs/studio/database/tasks/ongoing-tasks/assets/snagit/task-list-1.snagx new file mode 100644 index 0000000000..18f72ead14 Binary files /dev/null and b/docs/studio/database/tasks/ongoing-tasks/assets/snagit/task-list-1.snagx differ diff --git a/docs/studio/database/tasks/ongoing-tasks/assets/snagit/task-list-2.snagx b/docs/studio/database/tasks/ongoing-tasks/assets/snagit/task-list-2.snagx new file mode 100644 index 0000000000..ac054446dc Binary files /dev/null and b/docs/studio/database/tasks/ongoing-tasks/assets/snagit/task-list-2.snagx differ diff --git a/docs/studio/database/tasks/ongoing-tasks/assets/snagit/task-list_task-bar_actions.snagx b/docs/studio/database/tasks/ongoing-tasks/assets/snagit/task-list_task-bar_actions.snagx new file mode 100644 index 0000000000..a9c9b43bea Binary files /dev/null and b/docs/studio/database/tasks/ongoing-tasks/assets/snagit/task-list_task-bar_actions.snagx differ diff --git a/docs/studio/database/tasks/ongoing-tasks/assets/snagit/task-list_task-bar_info.snagx b/docs/studio/database/tasks/ongoing-tasks/assets/snagit/task-list_task-bar_info.snagx new file mode 100644 index 0000000000..772e592199 Binary files /dev/null and b/docs/studio/database/tasks/ongoing-tasks/assets/snagit/task-list_task-bar_info.snagx differ diff --git a/docs/studio/database/tasks/ongoing-tasks/assets/task-list-0.png b/docs/studio/database/tasks/ongoing-tasks/assets/task-list-0.png new file mode 100644 index 0000000000..e0ad562492 Binary files /dev/null and b/docs/studio/database/tasks/ongoing-tasks/assets/task-list-0.png differ diff --git a/docs/studio/database/tasks/ongoing-tasks/assets/task-list-1.png b/docs/studio/database/tasks/ongoing-tasks/assets/task-list-1.png index 00f120d0be..365dfddefd 100644 Binary files a/docs/studio/database/tasks/ongoing-tasks/assets/task-list-1.png and b/docs/studio/database/tasks/ongoing-tasks/assets/task-list-1.png differ diff --git a/docs/studio/database/tasks/ongoing-tasks/assets/task-list-2.png b/docs/studio/database/tasks/ongoing-tasks/assets/task-list-2.png index 08d858233c..f771bfdfc9 100644 Binary files a/docs/studio/database/tasks/ongoing-tasks/assets/task-list-2.png and b/docs/studio/database/tasks/ongoing-tasks/assets/task-list-2.png differ diff --git a/docs/studio/database/tasks/ongoing-tasks/assets/task-list-3.png b/docs/studio/database/tasks/ongoing-tasks/assets/task-list-3.png deleted file mode 100644 index 8aaedb49e3..0000000000 Binary files a/docs/studio/database/tasks/ongoing-tasks/assets/task-list-3.png and /dev/null differ diff --git a/docs/studio/database/tasks/ongoing-tasks/assets/task-list_action-bar.png b/docs/studio/database/tasks/ongoing-tasks/assets/task-list_action-bar.png new file mode 100644 index 0000000000..51d7dfd3c6 Binary files /dev/null and b/docs/studio/database/tasks/ongoing-tasks/assets/task-list_action-bar.png differ diff --git a/docs/studio/database/tasks/ongoing-tasks/assets/task-list_task-bar_actions.png b/docs/studio/database/tasks/ongoing-tasks/assets/task-list_task-bar_actions.png new file mode 100644 index 0000000000..d3a7e9d9f8 Binary files /dev/null and b/docs/studio/database/tasks/ongoing-tasks/assets/task-list_task-bar_actions.png differ diff --git a/docs/studio/database/tasks/ongoing-tasks/assets/task-list_task-bar_info.png b/docs/studio/database/tasks/ongoing-tasks/assets/task-list_task-bar_info.png new file mode 100644 index 0000000000..c756b4f4d0 Binary files /dev/null and b/docs/studio/database/tasks/ongoing-tasks/assets/task-list_task-bar_info.png differ diff --git a/docs/studio/database/tasks/ongoing-tasks/general-info.mdx b/docs/studio/database/tasks/ongoing-tasks/general-info.mdx index dbf289abf7..a22dbd6d4b 100644 --- a/docs/studio/database/tasks/ongoing-tasks/general-info.mdx +++ b/docs/studio/database/tasks/ongoing-tasks/general-info.mdx @@ -11,6 +11,8 @@ import TabItem from '@theme/TabItem'; import CodeBlock from '@theme/CodeBlock'; import LanguageSwitcher from "@site/src/components/LanguageSwitcher"; import LanguageContent from "@site/src/components/LanguageContent"; +import Panel from "@site/src/components/Panel"; +import ContentFrame from "@site/src/components/ContentFrame"; # Ongoing Tasks - Overview @@ -19,40 +21,74 @@ import LanguageContent from "@site/src/components/LanguageContent"; * Each task is assigned a responsible node from the [Database Group nodes](../../../../studio/database/settings/manage-database-group.mdx) to handle the work. * If not specified by the user, the cluster decides which node will be responsible for the task. See [Members Duties](../../../../studio/database/settings/manage-database-group.mdx#database-group-topology---members-duties). - * If a node is down, the cluster will reassign the work to another node for the duration. + * If a node is down, the cluster will reassign the work to another node. -* Once enabled, an **ongoing task** runs in the background, - and its responsible node executes the defined task work whenever relevant data changes occur. +* Once enabled, an **ongoing task** runs in the background and executes its defined + work whenever relevant data changes occur. -* In this page: +* Ongoing tasks can also be managed via the Client API. + See [Ongoing tasks operations](../../../../client-api/operations/maintenance/ongoing-tasks/ongoing-task-operations.mdx). + +* In this article: * [The ongoing tasks](../../../../studio/database/tasks/ongoing-tasks/general-info.mdx#the-ongoing-tasks) - * [The ongoing tasks list - View](../../../../studio/database/tasks/ongoing-tasks/general-info.mdx#the-ongoing-tasks-list---view) - * [The ongoing tasks list - Actions](../../../../studio/database/tasks/ongoing-tasks/general-info.mdx#the-ongoing-tasks-list---actions) + * [Creating a new task](../../../../studio/database/tasks/ongoing-tasks/general-info.mdx#creating-a-new-task) + * [Available task types](../../../../studio/database/tasks/ongoing-tasks/general-info.mdx#available-task-types) + * [The ongoing tasks list](../../../../studio/database/tasks/ongoing-tasks/general-info.mdx#the-ongoing-tasks-list) -## The ongoing tasks + + + + +### Creating a new task + +To create a new database task open the **Ongoing Tasks** view, click the **Add a Database Task** button, +and select a task type. + +![Open the Ongoing Tasks view](./assets/task-list-0.png) + + + +--- + + + +### Available task types + +The following task types are available: -The available ongoing tasks are: +![Available task types](./assets/task-list-1.png) -![Figure 3. Ongoing Tasks New Task](./assets/task-list-1.png) +**AI:** + +* **[GenAI](../../../../ai-integration/gen-ai-integration/overview.mdx)** + Analyze and enrich your documents using an LLM. +* **[Embeddings Generation](../../../../ai-integration/generating-embeddings/overview.mdx)** + Automatically generate embeddings from your document content. **Replication:** * **[External Replication](../../../../studio/database/tasks/ongoing-tasks/external-replication-task.mdx)** Create a live replica of your database in another RavenDB database in another cluster. This replication is initiated by the source database. -* **[Hub/Sink Replication](../../../../studio/database/tasks/ongoing-tasks/hub-sink-replication/overview.mdx)** - Create a live replica of your database, or a part of it, in another RavenDB database. - The replication is initiated by the *Sink* task. - The replication can be *bidirectional* or limited to a *single direction*. - The replication can be *filtered* to allow the delivery of selected documents. +* **[Replication Hub](../../../../studio/database/tasks/ongoing-tasks/hub-sink-replication/replication-hub-task.mdx)** + Replicate documents to and/or from one or more `Replication Sink` tasks in other RavenDB + databases across different clusters. +* **[Replication Sink](../../../../studio/database/tasks/ongoing-tasks/hub-sink-replication/replication-sink-task.mdx)** + Connect to a central `Replication Hub` in another RavenDB cluster to receive documents, + and optionally replicate back. + The replication can be *bidirectional* or limited to a *single direction*, + and can be *filtered* to allow the delivery of selected documents. -**Backups & Subscriptions:** +**Backups:** * **[Backup](../../../../backup/create/periodic-tasks/database-backup.mdx)** Schedule a backup or a snapshot of the database at a specified point in time. + +**Subscriptions:** + * **[Subscription](../../../../client-api/data-subscriptions/what-are-data-subscriptions.mdx)** - Send batches of documents that match a pre-defined query for processing on a client. + Send batches of documents that match a pre-defined query for processing on a client. **ETL (RavenDB => Target):** @@ -62,6 +98,9 @@ The available ongoing tasks are: * **[SQL ETL](../../../../server/ongoing-tasks/etl/sql.mdx)** Write the database data to a relational database. Data can be filtered and modified with transformation scripts. +* **[Snowflake ETL](../../../../studio/database/tasks/ongoing-tasks/snowflake-etl-task.mdx)** + Write all or chosen database documents to a Snowflake database. + Data can be filtered and modified with transformation scripts. * **[OLAP ETL](../../../../studio/database/tasks/ongoing-tasks/olap-etl-task.mdx)** Convert database data to the _Parquet_ file format for OLAP purposes. Data can be filtered and modified with transformation scripts. @@ -77,8 +116,11 @@ The available ongoing tasks are: * **[Azure Queue Storage ETL](../../../../studio/database/tasks/ongoing-tasks/azure-queue-storage-etl.mdx)** Write all or chosen database documents to Azure Queue Storage. Data can be filtered and modified with transformation scripts. +* **[Amazon SQS ETL](../../../../studio/database/tasks/ongoing-tasks/amazon-sqs-etl.mdx)** + Write all or chosen database documents to Amazon SQS queues. + Data can be filtered and modified with transformation scripts. -**Sink (Source => RavendB)** +**Sink (Source => RavenDB):** * **[Kafka Sink](../../../../studio/database/tasks/ongoing-tasks/kafka-queue-sink.mdx)** Consume and process incoming messages from Kafka topics. @@ -87,32 +129,39 @@ The available ongoing tasks are: Consume and process incoming messages from RabbitMQ queues. Add scripts to Load, Put, or Delete documents in RavenDB based on the incoming messages. + + -## The ongoing tasks list - View - -![Figure 1. Ongoing Tasks View](./assets/task-list-2.png) - -1. Navigate to **Tasks > Ongoing Tasks** - -2. The list of the current tasks defined for the database. - -3. The task name. - -4. The node that is currently responsible for executing the task. - + +The tasks you create are listed in the Ongoing Tasks view, where you can see their status at a glance, +expand task bars for further details, perform basic actions like disabling or deleting tasks, and open +any task for editing. -## The ongoing tasks list - Actions +![Ongoing Tasks list](./assets/task-list-2.png) -![Figure 2. Ongoing Tasks Actions](./assets/task-list-3.png) +1. **Filter by name** + Enter a string to list only tasks whose name includes this string. -1. **Add Task** - Create a new task for the database. -2. **Enable / Disable** the task. -3. **Details** - Click to see a short task details summary in this view. -4. **Edit** - Click to edit the task. -5. **Delete** the task. +2. **Filter by type** + Click **All** to see tasks of all types. + Click a specific task type, e.g. `ETL`, to add tasks of this type to the view. -The ongoing tasks can also be managed via the Client API. See [Ongoing tasks operations](../../../../client-api/operations/maintenance/ongoing-tasks/ongoing-task-operations.mdx). +3. **Selection boxes** + Select all tasks using the "select all" checkbox at the top. + Select individual tasks using task-specific checkboxes. + Selecting tasks opens an action bar. Use **Set state** to enable or disable the selected tasks, + or **Delete** to remove them. + ![Selection action bar](./assets/task-list_action-bar.png) +4. **Task bar** + * Each defined task is represented by a task bar. + * A task bar always shows the task's name and type, whether it is enabled, and which cluster node is responsible for running it. + ![Task bar - info elements](./assets/task-list_task-bar_info.png) + * You can enable, disable, edit, or delete the task. + You can also expand each task bar for additional details and options related to the task. + ![Task bar - actions](./assets/task-list_task-bar_actions.png) + * Other details and available actions vary by task type. + diff --git a/sidebars.ts b/sidebars.ts index 7bdaba7bf7..b378340461 100644 --- a/sidebars.ts +++ b/sidebars.ts @@ -97,6 +97,11 @@ const sidebars: SidebarsConfig = { label: "AI Integration", items: [{ type: "autogenerated", dirName: "ai-integration" }], }, + { + type: "category", + label: "Monitoring", + items: [{ type: "autogenerated", dirName: "monitoring" }], + }, { type: "category", label: "Glossary",