piątek, kwietnia 13, 2018

Well designed filesystem with poorly designed cmdline util

This is a story about btrfs and docker build server. Decomissioned hardware with AMD Phenom X4 was brought to life again after Intel Spectre/Meltdown bugs, however problem of not tightly installed RAM modules went unnoticed. Electrical connection between pins was not reliable and from time to time headless server was crashing due to corrupted memory. After cold restart it was working again. Until it was loaded with building docker images. Once ssh failed with I/O error from filesystem and after warm reboot server was dead. After attaching display it was visible that boot process crashed early on btrfs open_ctree_failed.

OK, there is initrd with btrfsck. Let's try btrfsck --repair. According to output there is a disaster.

Let's check what else we have with this tool: -b use the first valid backup root copy. Command line tool did something, but after --repair rerun still nothing. Assessment of situation: brtfs seriously damaged, no backup since it was simple server for building docker images. Nothing important was on disks, everything important was pushed to the cloud. Reinstall Linux, now with ZFS? OK, booting USB installer. Next, next, hmm - installer sees btrfs volumes. How is it possible? Running dmesg. Kernel sees partition, tries to reply transactions but fails with checksums.

OK, now: btrfsck --repair -b --init-csum-tree. Waiting....

Mounting. Now kernel fails with extents. How about: btrfsck --repair -b --init-csum-tree --init-extent-tree?

After many minutes fsck finishes and I'm able to mount fs read write. Listing content and there is /root and /home. Reboot and server is fully operational!

Filesystem is probably rock solid, but fsck is written this way that it totally doesn't help people to recover filesystem.








0 komentarze: