-
Notifications
You must be signed in to change notification settings - Fork 118
Open
Description
Hi,
I'm having issues when stopping and restarting the cluster.
Stop is working fine (i.e. flintrock stop my-cluster).
However when trying to start again (flintrock start my-cluster) the instances fails 1 of the 2 sanity checks, they cannot be reached event with console ssh login, and the cluster won't start.
I'm guessing is something related to the ephemeral storage because (as you can see from the system log below) the instance is going in a "recovery mode" due to some errors related to ext4 partition non found
Mounting /media/ephemeral0...
[ 4.953751] EXT4-fs (nvme1n1): VFS: Can't find ext4 filesystem
Do you have any guess?
Thanks for your kind help.
Andrea
Here a more complete log file. After you can find also my flintrock config.
Starting Apply Kernel Variables...
[�[32m OK �[0m] Started Apply Kernel Variables.
[�[32m OK �[0m] Created slice system-ec2net\x2difup.slice.
Starting Relabel kernel modules early in the boot, if needed...
[�[32m OK �[0m] Started Relabel kernel modules early in the boot, if needed.
[�[32m OK �[0m] Found device Elastic Network Adapter (ENA).
[�[32m OK �[0m] Started Monitoring of LVM2 mirrors,...ng dmeventd or progress polling.
[�[32m OK �[0m] Reached target Local File Systems (Pre).
Mounting /media/ephemeral0...
[ 4.953751] EXT4-fs (nvme1n1): VFS: Can't find ext4 filesystem
[�[1;31mFAILED�[0m] Failed to mount /media/ephemeral0.
See 'systemctl status media-ephemeral0.mount' for details.
[�[1;33mDEPEND�[0m] Dependency failed for Local File Systems.
[�[1;33mDEPEND�[0m] Dependency failed for Migrate local... structure to the new structure.
[�[1;33mDEPEND�[0m] Dependency failed for Relabel all filesystems, if necessary.
[�[1;33mDEPEND�[0m] Dependency failed for Mark the need to relabel after reboot.
Starting Preprocess NFS configuration...
[�[32m OK �[0m] Reached target Timers.
[�[32m OK �[0m] Reached target Network (Pre).
[�[32m OK �[0m] Reached target Cloud-init target.
[�[32m OK �[0m] Reached target Network.
Starting Initial cloud-init job (metadata service crawler)...
[�[32m OK �[0m] Reached target Login Prompts.
[�[32m OK �[0m] Reached target Paths.
[�[32m OK �[0m] Reached target Sockets.
Starting Create Volatile Files and Directories...
Starting Tell Plymouth To Write Out Runtime Data...
[�[32m OK �[0m] Started Emergency Shell.
[�[32m OK �[0m] Reached target Emergency Mode.
[�[32m OK �[0m] Started Preprocess NFS configuration.
[�[32m OK �[0m] Started Create Volatile Files and Directories.
Starting RPC bind service...
Mounting RPC Pipe File System...
Starting Security Auditing Service...
[ 5.025955] RPC: Registered named UNIX socket transport module.
[ 5.025956] RPC: Registered udp transport module.
[ 5.025957] RPC: Registered tcp transport module.
[ 5.025957] RPC: Registered tcp NFSv4.1 backchannel transport module.
[�[32m OK �[0m] Started RPC bind service.
[�[32m OK �[0m] Mounted RPC Pipe File System.
[�[32m OK �[0m] Started Security Auditing Service.
Starting Update UTMP about System Boot/Shutdown...
[�[32m OK �[0m] Reached target rpc_pipefs.target.
[�[32m OK �[0m] Reached target NFS client services.
[�[32m OK �[0m] Reached target Remote File Systems (Pre).
[�[32m OK �[0m] Reached target Remote File Systems.
[�[32m OK �[0m] Started Update UTMP about System Boot/Shutdown.
Starting Update UTMP about System Runlevel Changes...
[�[32m OK �[0m] Started Update UTMP about System Runlevel Changes.
[�[32m OK �[0m] Started Tell Plymouth To Write Out Runtime Data.
[�[32m OK �[0m] Started udev Wait for Complete Device Initialization.
Starting Activation of DM RAID sets...
[ 5.305390] device-mapper: uevent: version 1.0.3
[ 5.309580] device-mapper: ioctl: 4.43.0-ioctl (2020-10-01) initialised: [email protected]
[�[32m OK �[0m] Started Activation of DM RAID sets.
[�[32m OK �[0m] Reached target Local Encrypted Volumes.
[ 4.977500] cloud-init[2346]: Cloud-init v. 19.3-46.amzn2.0.1 running 'init' at Thu, 02 May 2024 18:43:55 +0000. Up 4.95 seconds.
[ 4.993484] cloud-init[2346]: ci-info: +++++++++++++++++++++++++++Net device info++++++++++++++++++++++++++++
[ 4.997062] cloud-init[2346]: ci-info: +--------+-------+-----------+-----------+-------+-------------------+
[ 4.997895] cloud-init[2346]: ci-info: | Device | Up | Address | Mask | Scope | Hw-Address |
[ 4.997985] cloud-init[2346]: ci-info: +--------+-------+-----------+-----------+-------+-------------------+
[ 4.999620] cloud-init[2346]: ci-info: | eth0 | False | . | . | . | (masked by me) |
[ 5.013097] cloud-init[2346]: ci-info: | lo | True | 127.0.0.1 | 255.0.0.0 | host | . |
[ 5.016240] cloud-init[2346]: ci-info: | lo | True | ::1/128 | . | host | . |
[ 5.017904] cloud-init[2346]: ci-info: +--------+-------+-----------+-----------+-------+-------------------+
[ 5.018004] cloud-init[2346]: ci-info: +++++++++++++++++++Route IPv6 info+++++++++++++++++++
[ 5.021742] cloud-init[2346]: ci-info: +-------+-------------+---------+-----------+-------+
[ 5.021831] cloud-init[2346]: ci-info: | Route | Destination | Gateway | Interface | Flags |
[ 5.023449] cloud-init[2346]: ci-info: +-------+-------------+---------+-----------+-------+
[ 5.044822] cloud-init[2346]: ci-info: +-------+-------------+---------+-----------+-------+
[�[32m OK �[0m] Started Initial cloud-init job (metadata service crawler).
[�[32m OK �[0m] Reached target Cloud-config availability.
[�[32m OK �[0m] Reached target Network is Online.
Starting Notify NFS peers of a restart...
[�[32m OK �[0m] Started Notify NFS peers of a restart.
Welcome to emergency mode! After logging in, type "journalctl -xb" to view
system logs, "systemctl reboot" to reboot, "systemctl default" or ^D to
try again to boot into default mode.
Cannot open access to console, the root account is locked.
See sulogin(8) man page for more details.
Press Enter to continue.
services:
spark:
version: 3.5.1
download-source: "s3://xxxx/flintrock/spark/spark-{v}/"
# executor-instances: 1
hdfs:
version: 3.3.6
download-source: "s3://xxxx/flintrock/hadoop/hadoop-{v}/"
provider: ec2
providers:
ec2:
key-name: xxx
identity-file: /home/xxx/spark/xxx.pem
instance-type: i4i.xlarge
#instance-type: m5d.large
region: eu-central-1
# availability-zone: <name>
ami: ami-0a946522147cbcbcc # Amazon Linux 2, us-east-1
user: ec2-user
# spot-price: <price>
vpc-id: *masked*
subnet-id: *masked*
# placement-group: <name>
security-groups:
- sg_xxx
# - group-name2
instance-profile-name: role_xx
tags:
- owner,spark_cluster
# - key2, value2 # leading/trailing spaces are trimmed
# - key3, # value will be empty
# min-root-ebs-size-gb: <size-gb>
tenancy: default # default | dedicated
ebs-optimized: no # yes | no
#min-root-ebs-size-gb: 120
instance-initiated-shutdown-behavior: terminate # terminate | stop
user-data: /home/ec2-user/spark/user-data.sh
# authorize-access-from:
# - 10.0.0.42/32
# - sg-xyz4654564xyz
launch:
num-slaves: 1
install-hdfs: True
install-spark: True
# java-version: 8
debug: true
Metadata
Metadata
Assignees
Labels
No labels