Skip to content

Commit 4ce52a7

Browse files
moved section from replication problems
1 parent 80f2919 commit 4ce52a7

File tree

1 file changed

+36
-8
lines changed
  • content/en/altinity-kb-setup-and-maintenance/altinity-kb-ddlworker

1 file changed

+36
-8
lines changed

content/en/altinity-kb-setup-and-maintenance/altinity-kb-ddlworker/_index.md

Lines changed: 36 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -1,20 +1,25 @@
11
---
2-
title: "DDLWorker"
3-
linkTitle: "DDLWorker"
4-
description: >
5-
DDLWorker
2+
title: "DDLWorker and DDL queue problems"
3+
linkTitle: "DDLWorker and DDL queue problems"
4+
description: >
5+
Finding and troubleshooting problems in the `distributed_ddl_queue`
6+
keywords:
7+
- clickhouse ddl
8+
- clickhouse replication queue
69
---
710
DDLWorker is a subprocess (thread) of `clickhouse-server` that executes `ON CLUSTER` tasks at the node.
811

9-
When you execute a DDL query with `ON CLUSTER mycluster` section the query executor at the current node reads the cluster `mycluster` definition (remote_servers / system.clusters) and places tasks into Zookeeper znode `task_queue/ddl/...` for members of the cluster `mycluster`.
12+
When you execute a DDL query with `ON CLUSTER mycluster` section, the query executor at the current node reads the cluster `mycluster` definition (remote_servers / system.clusters) and places tasks into Zookeeper znode `task_queue/ddl/...` for members of the cluster `mycluster`.
1013

11-
DDLWorker at all ClickHouse® nodes constantly check this `task_queue` for their tasks and executes them locally and reports about a result back into `task_queue`.
14+
DDLWorker at all ClickHouse® nodes constantly check this `task_queue` for their tasks, executes them locally, and reports about the results back into `task_queue`.
1215

1316
The common issue is the different hostnames/IPAddresses in the cluster definition and locally.
1417

15-
So a node initiator puts tasks for a host named Host1. But the Host1 thinks about own name as localhost or **xdgt634678d** (internal docker hostname) and never sees tasks for the Host1 because is looking tasks for **xdgt634678d.** The same with internal VS external IP addresses.
18+
So if the initiator node puts tasks for a host named Host1. But the Host1 thinks about own name as localhost or **xdgt634678d** (internal docker hostname) and never sees tasks for the Host1 because is looking tasks for **xdgt634678d.** The same with internal VS external IP addresses.
1619

17-
Another issue that sometimes DDLWorker thread can crash then ClickHouse node stops to execute `ON CLUSTER` tasks.
20+
## DDLWorker thread crashed
21+
22+
That causes ClickHouse to stop executing `ON CLUSTER` tasks.
1823

1924
Check that DDLWorker is alive:
2025

@@ -36,6 +41,7 @@ config.xml
3641
<yandex>
3742
<distributed_ddl>
3843
<path>/clickhouse/task_queue/ddl</path>
44+
<pool_size>1</pool_size>
3945
<max_tasks_in_queue>1000</max_tasks_in_queue>
4046
<task_max_lifetime>604800</task_max_lifetime>
4147
<cleanup_delay_period>60</cleanup_delay_period>
@@ -50,3 +56,25 @@ Default values:
5056
**task_max_lifetime** = 7 \* 24 \* 60 \* 60 (in seconds = week) – Delete task if its age is greater than that.
5157

5258
**max_tasks_in_queue** = 1000 – How many tasks could be in the queue.
59+
60+
**pool_size** = 1 - How many ON CLUSTER queries can be run simultaneously.
61+
62+
## Too intensive stream of ON CLUSTER command
63+
64+
Generally, it's a bad design, but you can increase pool_size setting
65+
66+
## Stuck DDL tasks in the distributed_ddl_queue
67+
68+
Sometimes [DDL tasks](/altinity-kb-setup-and-maintenance/altinity-kb-ddlworker/) (the ones that use ON CLUSTER) can get stuck in the `distributed_ddl_queue` because the replicas can overload if multiple DDLs (thousands of CREATE/DROP/ALTER) are executed at the same time. This is very normal in heavy ETL jobs.This can be detected by checking the `distributed_ddl_queue` table and see if there are tasks that are not moving or are stuck for a long time.
69+
70+
If these DDLs are completed in some replicas but failed in others, the simplest way to solve this is to execute the failed command in the missed replicas without ON CLUSTER. If most of the DDLs failed, then check the number of unfinished records in `distributed_ddl_queue` on the other nodes, because most probably it will be as high as thousands.
71+
72+
First, backup the `distributed_ddl_queue` into a table so you will have a snapshot of the table with the states of the tasks. You can do this with the following command:
73+
74+
```sql
75+
CREATE TABLE default.system_distributed_ddl_queue AS SELECT * FROM system.distributed_ddl_queue;
76+
```
77+
78+
After this, we need to check from the backup table which tasks are not finished and execute them manually in the missed replicas, and review the pipeline which do `ON CLUSTER` command and does not abuse them. There is a new `CREATE TEMPORARY TABLE` command that can be used to avoid the `ON CLUSTER` command in some cases, where you need an intermediate table to do some operations and after that you can `INSERT INTO` the final table or do `ALTER TABLE final ATTACH PARTITION FROM TABLE temp` and this temp table will be dropped automatically after the session is closed.
79+
80+

0 commit comments

Comments
 (0)