Skip to content

Mute/unmute overhead interferes with checkpoint barriers #3120

@slfritchie

Description

@slfritchie

Is this a bug, feature request, or feedback?

Bug

What is the current behavior?

Intermittent crash during checkpoint processing

What is the expected behavior?

No crash

What OS and version of Wallaroo are you using?

Ubuntu Bionic/18.04 LTS + Wallaroo @ commit 35d2038

Steps to reproduce?

See README.md in tarball at http://wallaroolabs-dev.s3.amazonaws.com/scott/count2.tar.gz. Instructions include options for building & running a demonstration test via a VM or Docker.

reset.sh
start-cluster.sh 4

... can occasionally yield a crash a few seconds after the start-cluster.sh script is finished. See full logs at http://wallaroolabs-dev.s3.amazonaws.com/logs/logs.1583892856.tar.gz. On a 1 CPU/5GB RAM virtual machine, the crash seems to happen roughly 50% of the time.

The crash is more likely to happen as the cluster size is increased. The crash always seems to be during the 2nd checkpoint operation.

$ tail /tmp/wallaroo.2
1583892717.934412,Unmuting DataChannel
1583892717.934418,Unmuting DataChannel
1583892717.934425,Unmuting DataChannel
1583892717.934431,Unmuting DataChannel
1583892718.090417,Sent control message to initializer: EventLogAckCheckpointMsg
1583892718.091238,Sent control message to initializer: WorkerAckBarrierMsg
1583892718.102630,Sent control message to initializer: EventLogAckCheckpointIdWrittenMsg
1583892719.111070,ERROR,Step,Invariant violation: received barrier CheckpointBarrierToken(2) is greater than current barrier CheckpointBarrierToken(1) at Step 193591313640807744353045639962347611769

Invariant violated in /build2/.deps/wallaroolabs/wallaroo/lib/wallaroo/core/step/step_phase.pony at line 219

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions