Debugging and reporting tips for complex issues

Some potential bugs in IRRD are easily isolated and reproducible, like "this object is rejected but that syntax should be accepted". However there are also some that depend on an interaction of multiple components of IRRD, Redis, PostgreSQL, the OS, client behaviour, network, etc. These may also appear only some of the time and the conditions triggering them may be unknown.

This page is meant to provide hints on what you can debug yourself, and what information is helpful in a GitHub issue.

Information about your environment

When reporting, include:

The OS, IRRD, Python, PostgreSQL and Redis versions you are running.
The specifications of the server (memory, CPU cores).
Your settings, particularly the number of workers configured.
Your typical load. Are you mirroring most of the other IRRs? Are you only querying it with your own tool once a day? Authoritative changes? If you know of large query spikes, are these each on separate connections, or few connections that get many queries?
Whether clients connect directly to your IRRD instance, or you have any kind of active networking devices in front, like an IDS, that may interfere with the connection.
Any correlations you suspect based on the usage or other parameters of your instance.

Logs to review

The IRRD log
Kernel logs (some complex IRRD issues were triggered by the OOM killer)
PostgreSQL logs

Inspection of the troubled state

While the instance is in a broken or troubled state, start by looking at resource usage. For overall system stats, vmstat is probably your best tool, use -w at 5 to10 seconds to get some reasonable averages. Also use top or variants to see specific process stats. Some things to look for:

Is the system swapping in and out a lot (si, so)? That almost certainly means you are low on memory. Using swap in itself is not always an issue.
What is CPU time spent on? Is it mostly IO wait? User? Is the CPU mostly idle? This doesn’t tell you much in itself, but is an important hint for debugging.

Socket state is another good hint: inspect it for whois with ss -itape src :43. Are there many open connections? What state are they in?

From 4.3 onwards, IRRd processes will log a traceback when receiving a SIGUSR1. This is a snapshot, but will tell you what a certain process is currently doing or waiting for.

While it can be a firehose, strace can be a great inspection tool as it shows all system calls. Run it with -p {pid} -s 1000 -tt -yy so that we log full strings, include timestamps and information about file descriptors. Check closely which process you are monitoring - IRRD has many processes and only some process user queries.

Redis is rarely involved in issues, as IRRD currently uses it to aggregate and transfer preloaded data. This data is then kept in memory - user queries are not passed to Redis. However, if you suspect it could be involved, you can monitor it with redis-cli monitor.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Debugging and reporting tips for complex issues

Information about your environment

Logs to review

Inspection of the troubled state

Uh oh!

Uh oh!

Clone this wiki locally