Forgot your password?
typodupeerror
Bug Data Storage Apple

One Developer's Experience With Real Life Bitrot Under HFS+ 396

Posted by timothy
from the so-really-it's-both-plus-and-minus dept.
New submitter jackjeff (955699) writes with an excerpt from developer Aymeric Barthe about data loss suffered under Apple's venerable HFS+ filesystem. HFS+ lost a total of 28 files over the course of 6 years. Most of the corrupted files are completely unreadable. The JPEGs typically decode partially, up to the point of failure. The raw .CR2 files usually turn out to be totally unreadable: either completely black or having a large color overlay on significant portions of the photo. Most of these shots are not so important, but a handful of them are. One of the CR2 files in particular, is a very good picture of my son when he was a baby. I printed and framed that photo, so I am glad that I did not lose the original. (Barthe acknowledges that data loss and corruption certainly aren't limited to HFS+; "bitrot is actually a problem shared by most popular filesystems. Including NTFS and ext4." I wish I'd lost only 28 files over the years.)
This discussion has been archived. No new comments can be posted.

One Developer's Experience With Real Life Bitrot Under HFS+

Comments Filter:
  • by carlhaagen (1021273) on Saturday June 14, 2014 @09:30AM (#47235971)
    An old partition of some 20000 files, most of them 10 years or older, in where I found 7 or 8 files - coincidentally jpg images as well - that were corrupted. It struck me as nothing other than filesystem corruption as the drive was and still is working just fine.
    • by istartedi (132515) on Saturday June 14, 2014 @09:57AM (#47236091) Journal

      coincidentally jpg images as well

      Well, JPGs are usually lossy and thus compressed. Flipping one bit in a compressed image file is likely to have severe consequences. OTOH, you could coXrupt a fewYentire byteZ in an uncompressed text file and it would still be readable. I suspect your drives also had a few "typos" that you didn't notice because of that.

    • by Jane Q. Public (1010737) on Saturday June 14, 2014 @07:14PM (#47238303)
      I agree with istaredi.

      Ultimately, it isn't a "failure" of HFS+ when your files get corrupted. It was (definitely) a hardware failure. It's just that HFS+ didn't catch the error when it happened.

      Granted, HFS+ is due for an update. That's something I've said myself many times. But blaming it when something goes wrong is like blaming your Honda Civic for smashing your head in when you roll it. It wasn't designed with a roll cage. You knew that but you bought it anyway, and decided to hotdog.

      Checksums also have performance and storage costs. So there are several different ways to look at it. One thing I strongly suggest is keeping records of your drive's S.M.A.R.T. status, and comparing them from time to time. And encourage Apple to update their FS, rather than blaming it for something it didn't cause, or for not doing something it wasn't designed to do.
  • Backup? (Score:4, Insightful)

    by graphius (907855) on Saturday June 14, 2014 @09:36AM (#47235983) Homepage
    shouldn't you have backups?
    • Re:Backup? (Score:5, Insightful)

      by kthreadd (1558445) on Saturday June 14, 2014 @09:42AM (#47236023)

      The problem with bit rot is that backups doesn't help. The corrupted file go into the backup and eventually replace the good copy depending on retention policy. You need a file system which uses checksums on all data block so that it can detect a corrupted block after reading it, flag the file as corrupted so that you can restore it from a good backup.

      • Re:Backup? (Score:5, Insightful)

        by dgatwood (11270) on Saturday June 14, 2014 @10:09AM (#47236139) Journal

        Depends on the backup methodology. If your backup works the way Apple's backups do, e.g. only modified files get pushed into a giant tree of hard links, then there's a good chance the corrupted data won't ever make it into a backup, because the modification wasn't explicit. Of course, the downside is that if the file never gets modified, you only have one copy of it, so if the backup gets corrupted, you have no backup.

        So yes, in an ideal world, the right answer is proper block checksumming. It's a shame that neither of the two main consumer operating systems currently supports automatic checksumming in the default filesystem.

        • Re: (Score:2, Informative)

          The bitrot will change the checksums and cause the files to show up as modified.

          Moreover, what will you do about a reported bitrotted file unless you have genuine archival backups somewhere else?

          • If I remember correctly, that's not how Apple's current backup system works. Every time a file gets written to, there's a log someplace that records that the file was modified. Next time Time Machine runs, it backs up the files in that log. If the OS didn't actually modify the file, it won't get backed up.

            I may be wrong, but that's how I understood it.

            • Re: (Score:2, Informative)

              by Anonymous Coward

              Macosx Time Machine works by listening to filesystem events except for the first backup where everything is copied over as is. Bit rot doesn't get transferred until you overwrite the file, time by which it should have been obvious something was fishy or the bitrot was negligible and you didn't notice yourself. There are also situations where Time Machine itself says "this backup is fishy, regenerate from scratch?". Happened last week, but only after a failed drive had to be replaced which caused a 150GB bac

    • by ZosX (517789)

      This is a good idea, but not a solution. Often you have no idea that the file is bad until after the fact, in this case years later. I've had mp3 collections get glitches here and there after a few copies from various drives. If you have no idea the data is bad in the first place, your backup of the data isn't going to be any better. I would say that all of my photography I've collected over the years has stayed readable somehow. I do check in lightroom every once in a while, but I wouldn't be shocked to fi

      • I have close to 4 terabytes of photography and video stored (not that kind of photography and video). I, too, have seen occasional unreadable files, typically in JPEGS but also an occasional TIFF file. Any compressed container (like a JPEG) is going to be more susceptible to this issue thus JPEGs aren't a great storage format. Video files are harder to figure - a corrupted bit could easily get overlooked.

        I've never actually lost a picture that I was interested in - I always have more than one copy of th

        • by wagnerrp (1305589)

          Video files are harder to figure - a corrupted bit could easily get overlooked.

          Again, it depends on whether it is compressed or not. A corrupted bit in video with only interframe compression will look just like a damaged JPEG. You may have an unreadable frame, or may have a corrupted macroblock or two in that frame. A corrupted bit in video with intraframe compression will smear that corrupted frame or macroblock for potentially several seconds until you hit the next I-frame to flush the image.

          You can spend a lot more money getting near perfect replication but I don't think many people are willing to have a system with ECC memory throughout the chain.

          The common solution to this issue is software, not hardware. You have your filesystem co

    • by rnturn (11092)

      Even if you did have backups how could you even begin to know which saveset to restore from? You could have been backing up a corrupted file for a lo-o-ong time.

      Friends wonder why I still purchase physical books and CDs. This is why. I'll have to come up with a simple 2-3 sentence explanation of the problem the OP was describing for when they ask next time. I've had MP3 files made from my CD collection mysteriously become corrupted over time. No problem, I can just re-rip/convert/etc. but losing the origi

    • by gweihir (88907)

      You should have:
      1. backups
      2. redundancy
      3. regular integrity checks of your data

      Or alternatively, you should have been using an archival grade medium, like archival tape or (historically now unfortunately) MOD.

      What the OP did is just plain incompetent and stupid and if he had spent 15 minutes to find out how to properly archive data, he would now not be in this fix. Instead he made assumptions without understanding or verification against the real world now blames others for his failure. Pathetic. Dunning-Kr

  • by Gaygirlie (1657131) <(gaygirlie) (at) (hotmail.com)> on Saturday June 14, 2014 @09:40AM (#47236005) Homepage

    Bitrot isn't the fault of the filesystem unless something is badly buggy. It's the fault of the underlying storage-device itself. Attacking HFS+ for something like that is just silly. Now, with that said there are filesystems out there that can guard against bitrot, most notably Btrfs and ZFS. Both Btrfs and ZFS can be used just like a regular filesystem where no parity-information or duplicate copies are saved and in such a case there is no safety against bitrot, but once you enable parity they can silently heal any affected files without issues. The downside? Saving parity consumes a lot more HDD-space, and that's why it's not done by default by most filesystems.

    • by jbolden (176878)

      There is a 3rd possibility. As the size of the dataset increases you can construct a more complex error correcting code on that dataset with loss of spacing being 1/n. Note that's essentially saving information about the decoding and then the coded information, sort of like how compression works. Which for most files would be essentially free. And of course you could combine with this compression by default which might very well result in a net savings. But then you pick up computational complexity. Wi

      • It's not a matter of CPU load. Suppose you have one checksum block for every eight data blocks. In order to verify the checksum on read, you have to read the checksum block and all eight data blocks. So you have to read a total of nine blocks instead of one. Reading from the disk is one if the slowest operations in a computer, so ddoing it nine times instead of one slows things down considerably.

        • by jbolden (176878)

          You don't have checksum blocks in the space efficient method. Rather in the computational way I'm talking about it is a transformation. You might have something like every 6354 bits becomes 6311 bits after the complex transformation. It doesn't slow down the read but you have to do math.

    • by jmitchel!jmitchel.co (254506) on Saturday June 14, 2014 @09:52AM (#47236073)
      Even with just checksums, knowing that there is corruption means knowing to restore from backups. And in the consumer space most people have plenty of space to keep parity if it comes to that.
  • Good backups aren't enough. If the filesystem isn't flagging corruption as it happens, the backup software will happily back up your corrupted data over and over until the last backup which has the valid file in it has expired or become unrecoverable itself.
  • by grub (11606) <slashdot@grub.net> on Saturday June 14, 2014 @09:48AM (#47236047) Homepage Journal

    This is why Apple should resurrect its ZFS project. Overnight they would be the largest ZFS vendor to match with being the largest UNIX vendor.
    • I'm curious why it's been ignored or deprecated or whatever Apple did to it. They have the resources to throw at a project like that. Presumably there was some calculation somewhere along the line that didn't make sense. Not that Apple is much for telling us things like that, but it would be fun to know.

      • by jbolden (176878)

        Apple did announce why the project failed. ZFS on consumer grade hardware with consumer interactions was too dangerous. Things like pulling an external drive out during mid write could corrupt an entire ZFS volume. Apple simply couldn't get ZFS to work under the conditions their systems need it to. They had to backout completely and come up with a plan-B. The developer who worked on this left Apple and now produces a better ZFS for OSX. That company got bought by Oracle so Oracle owns it now.

    • Due to their commanding smartphone marketshare, along with millions of devices with embedded Linux shipped every year, wouldn't Samsung be the largest UNIX vendor?

      Oh? What's that? You weren't counting embedded Linux and I'm a pedantic #$(*#$&@!!!. Can't argue with that!

      • by jo_ham (604554)

        Now there's a can of worms. I think the question "Is Linux really Unix?" is a guaranteed heat-generator.

        • If you follow the specifications, there's no need for heat. No Linux variant has been certified according to the POSIX standards for UNIX, and most variants have subtle ways in which they diverge from the POSIX standards, at least subtly. Wikipedia has a good note on this at http://en.wikipedia.org/wiki/S... [wikipedia.org]

          Personally, I've found each UNIX to each have some rather strange distinctions from the other UNIX's, and using the GNU software base and the Linux based software packages to assure compatibility among t

  • The solution is to not become too attached to data. It's all ephemeral anyway, in the grand scheme of things.

    • Well, in the "grand scheme of things", so are we.

      Me? I get rather attached to the source file I've been working on for the past 6 months.

    • Yeah, tell that to the IRS when you go to pull your records during an audit... ;-)
  • by cpct0 (558171) <.moc.sianodlehcim. .ta. .todhsals.> on Saturday June 14, 2014 @10:08AM (#47236133) Homepage Journal

    Bitrot is not usually the issue for most files. Sometimes, but it's rare. What I lost is a mayhem repository of hardware and software and human failure. Thanks for backup, life :)

    On Bitrot:

    - MP3s and M4As I had that suddenly started to stutter and jump around. You play the music and it starts to skip. Luckily I have backups (read on for why I have multiple backups of everything :) ) so when I find them, I just revert to the backup.
    - Images having bad sectors like everyone else. Once or twice here or there.

    - A few CDs due to CD degradation. That includes one that I really wish I'd still have, as it was a backup of something I lost. However, the CD takes hours to read, and then eventually either balks up or not for the directory. I won't tell you about actually trying to copy the files, especially with normal timeouts in modern OSes or the hardware pieces or whatnot.

    Not Bitrot:

    - Two RAID Mirror hard drives, as they were both the same company, and purchased at the same time (same batch), in the same condition, they both balked at approximately the same time, not leaving me time to transfer data back.

    - An internal hard drive, as I was making backups to CDs (at that time). For some kind of reason I still cannot explain, the software thought my hard drive was both the source and the destination !!!! Computer froze completely after a minute or two, then I tried rebooting to no avail, and my partition block was now containing a 700mb CD image, quarter full with my stuff. I still don't know how that's possible, but hey, it did. Since I was actualy making my first CD at the time and it was my first backup in a year, I lost countless good files, many I gave up upon (especially my 90's favorite music video sources ripped from the original betacam tapes in 4:2:2 by myself).

    - A full bulk of HDs on Mac when I tried putting the journal to another internal SSD drive. I have dozens of HDDs, and I thought it'd go faster to use that nifty "journal on another drive" option. It did work well, although it was hell to initialize, as I had to create a partition for each HDD, then convert them to journaled partitions. Worked awesomely, very quick, very efficient. One day after weeks of usage, I had to hard close the computer and its HDD. When they remounted, they all remounted in the wrong order, somehow using the bad partition order. So imagine you have perfectly healthy HDDs but thinking they have to use another HDDs journal. Mayhem! Most drives thought they were other ones, so my music HDD became my photos HDD RAID, my system HDD thought it was the backup HDD, but just what was in the journal. It took me weeks sporting DiskWarrrior and Data Rescue in order to get 99% of my files back (I'm looking at you, DiskWarrior as a 32 bit app not supporting my 9TB photo drive) with a combinaison of the original drive files and the backup drive files. Took months to rebuild the Aperture database from that.

    - All my pictures from when I met my wife to our first travels. I had them in a computer, I made a copy for sure. But I cannot find any of that anywhere. Nowhere to be found, no matter where I look. Since that time, many computers happened, so I don't know where it could've been sent. But I'm really sad to have lost these

    - Did a paid photoshoot for an unique event. Took 4 32GB cards worth of priceless pictures. Once done with a card, I was sifting through the pictures with my camera and noticed it had issues reading the card. I removed it immediately. When at home, I put the card in my computer, it had all the troubles in the world reading it (but was able to do so), I was (barely) able to import its contents to Aperture (4-5 pictures didn't make the cut, a few dozens had glitches). It would then (dramatically, as it somehow have its last breath after relinquishing its precious data) not read or mount anywhere, not even being recognized as a card by the readers. Childs, use new cards regularly for your gigs :)

    - A RAID array b

  • And the story is? (Score:4, Insightful)

    by Immerman (2627577) on Saturday June 14, 2014 @10:08AM (#47236137)

    Bitrot. It's a thing. It's been a thing since at least the very first tape drive - hell it was a thing with punch cards (when it might well have involved actual rot). While the mechanism changes, every single consumer-level data-storage system in the history of computing has suffered from it. It's a physical phenomena independent from file system, and impossible to defend against in software unless it transparently invokes the one and only defense: redundant data storage. Preferably in the form of multiple redundant backups.

    So what is the point of this article?

  • The real article would be titled "file systems with no data redundancy and no checksums are vulnerable to bitrot".
    That covers about any file system with the lone exception of ZFS when ran on a raid, maybe btrfs? and i guess some mainframe stuff.

  • by sribe (304414) on Saturday June 14, 2014 @10:15AM (#47236183)

    In a footnote he admits that the corruption was caused by hardware issues, not HFS+ bugs, and of course the summary ignores that completely.

    So, for that, let me counter his anecdote with my own anecdote: I have an HFS+ volume with a collection of over 3,000,000 files on it. This collection started in 2004, approximately 50 people access thousands of files on it per day, and occasionally after upgrades or problems it gets a full byte-to-byte comparison to one of three warm standbys. No corruption found, ever.

    • by pla (258480)
      In a footnote he admits that the corruption was caused by hardware issues, not HFS+ bugs, and of course the summary ignores that completely.

      The summary doesn't claim HFS caused the bitrot, you read that into it. The summary merely points out that HFS doesn't reliably detect and correct flaws in the underlying storage media (as does NTFS, as does almost every filesystem widely used).

      More importantly, while merely detecting this issue may not incur too much overhead, correcting it requires some fairly l
      • by sribe (304414)

        The summary doesn't claim HFS caused the bitrot, you read that into it.

        The summary's first sentence ends: "about data loss suffered under Apple's venerable HFS+ filesystem" and shortly thereafter it continues with: "HFS+ lost a total of 28 files over the course of 6 years." So the chosen wording most certainly does imply that HFS is at fault. One has to click the link to the article, then read all the way through the frickin' footnotes before one encounters anything to explicitly disavow that implication.

  • Clueless article (Score:5, Informative)

    by alexhs (877055) on Saturday June 14, 2014 @10:27AM (#47236227) Homepage Journal

    People talking about "bit rot" usually have no clue, and this guy is no exception.

    It's extremely unlikely that a file would become silently corrupted on disk. Block devices include per-block checksums, and you either have a read error (maybe he has) or the data read is the same as the data previously written. As far as I know, ZFS doesn't help to recover data from read errors. You would need RAID and / or backups.

    Main memory is the weakest link. That's why my next computer will have ECC memory. So, when you copy the file (or otherwise defragment or modify the file, etc), you read a good copy, some bit flips in RAM, and you write back corrupted data. Your disk receives the corrupted data, happily computes a checksum, therefore ensuring you can read back your corrupted data faithfully. That's where ZFS helps. Using checksumming scripts is a good idea, and I do it myself. But I don't have auto-defrag on Linux, so I'm safer : when I detect a corrupted copy, I still have the original.

    ext2 was introduced in 1993, and so was NTFS. ext4 is just ext2 updated (ext was a different beast). If anything, HFS+ is more modern, not that it makes a difference. All of them are updated. By the way, I noticed recently that Mac OS X resource forks sometimes contain a CRC32. I noticed it in a file coming from Mavericks.

    • by rabtech (223758)

      People talking about "bit rot" usually have no clue, and this guy is no exception.

      It's extremely unlikely that a file would become silently corrupted on disk. Block devices include per-block checksums, and you either have a read error (maybe he has) or the data read is the same as the data previously written. As far as I know, ZFS doesn't help to recover data from read errors. You would need RAID and / or backups.

      I'm afraid it is you who is clueless. Up until ZFS started gaining traction, we all had the luxury of assuming the storage chain was reliable (RAM, SATA controller, cables, drive firmware, read/write heads, oxide layers, etc). Or at least we would know something went wrong.

      But it was found that in the actual real world, these systems all silently corrupt data from time to time. The problem is much worse as the volume of data grows because the error rates are basically unchanged, meaning what was once expect

  • by Flammon (4726)
    I've slowly been moving all my systems to Btrfs from least important to most important and have had no problems so far.
    • Btrfs "pronounced "Butterface"" - Wikipedia
      Lol.
      Strangely that acronym could also stand for BiT Rot Free System which is pretty ironic, I guess.

  • Some people are talking about the fact that bitrot could happen as a result of bad RAM. Are you talking about bad system RAM or the RAM onboard the HDD's controller board?

    If it was indeed bad system RAM, wouldn't bad system RAM cause a random BSOD (Windows) or Kernel Panic (Linux)? With how much RAM we use these days it's very likely we're going to be using all of the storage capacity of each of the DIMMs that we have in our systems.

    Myself I have 16 GBs of RAM in my Windows machine and at any moment i
    • RAM may have a low error rate much better than HDDs or SDs. That does not mean that you won't have errors even if you have a good brand and treat it well. Bit-level errors can and do happen all the time without us knowing; other times it happens in the wrong place and we notice (but think it is something else) it isn't until it gets really bad that we notice.

      Example, say your RAM has a 1% bit loss rate (ignore that is insanely high) well if 90% of your data is not touchy code but data, the odds are that

      • by trparky (846769)
        I have noticed that a lot of OEMs (Dell, HP, Apple, etc.) use a no-name brand of RAM in many of their systems that they build. If you look at them, especially the CAS latency stats, you'll notice that many of the RAM chips found in most pre-made computers are absolutely pitiful (to say the least).

        So with that being said, who knows if this no-name RAM that is installed in many pre-made computers that many people buy is of any real quality. I'm guessing... no. So, with that said perhaps that odds of bitr
      • by gweihir (88907)

        ECC is not what you need for reliable data archiving. What you need is independent checksums and you need to actually compare them to the data on disk. If you store an MD5 or SHA1 hash with all files, corruption from RAM, buses and the like will not go undetected. The way things go today though, most people do not even verify a backup. No surprise they lose data, incompetence and laziness comes at a price. Of course, you should make sure your RAM runs stable, but I have not had a single ECC corrected bit in

    • by fnj (64210)

      If it was indeed bad system RAM, wouldn't bad system RAM cause a random BSOD (Windows) or Kernel Panic (Linux)?

      Likely so, but if we are talking about errors that only show up in 28 file-reads out of millions of file-reads, there is no reason to believe that you would be bound to see such a panic during the period in question.

      BTW, bad RAM anywhere in the chain from disk drive to CPU - main system RAM, CPU cache RAM, hard drive cache RAM, controller RAM, etc - could cause such a panic, since most data travels

  • This sounds like actual disk errors. File systems can't do much about them, you really need something like a RAID.

    • by gweihir (88907)

      The OP did it wrong due to stupidity or laziness and now he is blaming others like an immature, petulant child would do.

    • by kthreadd (1558445)

      The file system can do quite a bit if it actually does consistency checks on the data when reading it. ZFS does this and will alert you if the contents of a file has changed after it was last written, allowing you to restore a good copy from backup and verify that it is still valid.

  • There are only two options for reliable data archiving: 1. Spinning disks with redundancy and regular checks 2. Archival grade tape. There used to be MOD as well, but as nobody cared enough to buy it, development stalled and then died. The OP simply was naive and stupid and did not bother to find out how to archive data properly. It is well-known how to do it and has been for a long time. I have not lost a single bit that I care about. Of course, I have a 3-way RAID1 with regular SMART and RAID consistency

For every complex problem, there is a solution that is simple, neat, and wrong. -- H. L. Mencken

Working...