Regarding kernel testing for production.


Nick Alcock22/12/2012

 +Diego Cadogan, +Theodore Ts’o is, naturally, quite right. I think we can safely presume for the time being that if you’re new to ext4 and not mkfsing or mounting filesystems with a bunch of cool-sounding options you plucked out of the manpage, that it is not metadata checksums at fault, nor journal checksums, since neither of these are on by default. Distros are not using metadata checksums yet (except perhaps bleeding-edge mad distros like Arch): you will not see effects relating to metadata checksums with a normal distro unless you take special measures.

ext4 is incredibly reliable, widely used software. Like all such software, nearly all bugs in it are the result of a long chain of improbable events. While much software is so buggy that you can pick a random feature and probably find a bug in it, ext4 is the opposite: virtually all features, in isolation, work fine. You need to do whole chains of things wrong (or have faulty hardware) to see problems. (e.g., in my case, it took journal checksums and no-barrier mode and asynchronous journal commit and a really weird non-distro shutdown script that routinely rebooted in the middle of umount to see corruption — and even then I didn’t see corruption every time until I intentionally hacked things to amplify the effect. This wasn’t an “oops, using one option causes corruption every time”. I have never seen such a thing with production-quality ext4 options, and never expect to.

You also cannot say ‘ooh, similar symptoms, must be the same bug’. Lots of things can almost certainly cause the symptoms that I saw. Note also that your symptoms are clearly different from mine — I was seeing silent corruption where fsck claimed the filesystem was clean unless you did a force-fsck.)

 

First of all.

Ext4 is absolutely reliable for me, but not widely trusted-or-implemented among all distros you could need in a production enabled office.

i work by several production-ready environments. sometimes Samba, NFS, AFS and so on. i made through manually to a self-kernel so there was the less chances to data loss at all. lastest kernels have the most FS compatibility ALWAYS. that’s no bleeding edge at all to say something.
* compatiblity isn’t stable or performance or production-ready.

>> cannot say ‘ooh, similar symptoms, >> the same bug’.
yes i do. i can certainly modify inodes for over 20 millon files monthly for over 15 years of working so i’m pretty sure of what is hardware related and what it isn’t.

>> until I intentionally hacked things to amplify the effect.
I AGREE.  but errors from work environments are sometimes irreproducible by intentionally finding those manually except you were actually producing something with the environment.
* we could postulate that speeding-up doesn’t bug-up.

extending these points we will ensure i’d be talking about:
* i’ve never been ext3/ext4 expert. meaning by that i can not determinate while the fs will be using journaling, when not.
* i used to determinate when a filesystem is journaling by the “learning times” aka seek times.
* i asumed it was journaling because of writes and checksum writes, you just simply know when it’s happening.
* i’m sure you’re talking about metadata checksums while i was pointing to group-descriptors checksums (perhaps the same)**
* production-ready kernel numbering mean by that at most kernels 2.6 hopefully depending on the platform and syscalls used by byproducts (KVM, Xen, Perl, Python)

* Regarding external media failures and data loss i’m doubtful. i reorder nearly 60+ millon files each two months nearly 3 millon of them related to at least 5 production environments i use alternatively for realtime applications research, hardware testing and for data parity itself**.  and so on, making things complicate for me without identifying all specific syscalls and writing specific research tools for these topics aka fs perf profiling.

* what i got sure at this point is i live-updated several kernels each one after another without having any update to err. e2fstools 4ex. which should have been sort of mandatory given the environment (some bleeding edge technologies like NFSv4) and lastest kernel lines (3.2.35) for a production environment i use.
* past three kernels there was one working real nicely but some syscalls were broken and tools then came the slow and problematic ones.
* i’ve done kernel patching and kernel patch porting betwen different kernel numbers and it’s one of the most difficult tasks i’ve been into. i’m really sure you’re doing a LOT of work for extending kernel capabilities and profiling.

* Seen no guidelines to making production enabled environments from bleeding-edge distros. 3.2, 2.6, etc. not to name policing among distros which is not my own so far.

* i assume it was i was at least lucky to use alternative superblocks since original one was destroyed by system fsck itself.
* i’m not sure by the moment if i could be giving major time for writing fs checking tools. most of checkings go by using the environment itself. python processing millon files at once, C# doing sqlite inserts or nerly the most used performance testing environments like Perl-BDB.
* i do overclockings at some point of my tasks. i have no reboots for weeks or months.
* i test several different media, not just “production ready” and never ever led to this kind of catastrophic versioning incident.
* its mandatory for me having *at least* one desktop environment. EXT4 support was *the only reason* i’ve choosen these desktops;
* i will be likely way more interested on how things develop from now on. certainly i’m not C++ developer but instead i’ve succesfuly helped a number of developers find minor bugs at kernel or the X like it was for example SHM overflow found past years writing XCB python and C applications.
* i’ve not seen lately filesystems without flags for the data processing i do. let’s count important here we don’t have redundant hardware. RAIDs or ECC to be sure for it all.
* i’m sorry if i raised an alert for something it might never happened using a “production environment”
* i have certain numbers of brain paralysis so my actions are limited daily but i promise i will add-on a task for ensuring my part for a production ready ext4 for all media, not just USB devices or LVMs.
* i find important to say i’ve never done superblocks salvatage so this has been like a first time to start being cooperative to Kernel groups.
* i will be finding ways to stay in touch with active developers which might be mandatory at this point.
* thanks for all, and thanks Teo for using Google Plus for sharing.

About these ads