
Nilfs:An Efficient and Full Featured File System for LinuxDeepgeek29 July 2009 |
Abstract: This is a script for a podcast. Deepgeek reviews and discusses "Nilfs2," a log-structured file system that is significantly faster than other file systems, and constantly creates checkpoints that can be used as whole filesystem snapshots for a variety of purposes.
I’ve been using Nilfs, a new file system, for a few weeks now. But before going full blast, a fairer question is why did I seek out another file system?
My regular listeners will know that I am currently using a backup system that creates dated folders for each backup. It functions as a versioning system as well as a backup, because I can go back to a specific date’s backup and pull up a version of a file from that date. This is really great when you want to revert to an old version of a configuration file in /etc because something messed up, as well as finding media to replay if you previously deleted it, but changed your mind and want it back. The BSD guys are working on some cutting edge file systems that implement “snapshotting” at the file system level, and if I could get this without even mounting my backup disk, it would be great. Since my current backup system takes, like an hour, to make a backup, I fear changing a file that has not been backed up yet during the backup. Being able to snapshot the source file system for the backup, and continuing work, would be even better. So I began to hunt for a Linux alternative.
I did find these features in a file system called nilfs2, which was just incorporated into the Linux kernel version 2.26.30, and it is even better than snapshotting file systems.
The reason I say Nilfs is better is because it provides these features without having to go through a snapshot process. You see, Nilfs creates checkpoints as it goes along, usually several times a minute, and any of these checkpoints can be promoted to a snapshot of the file system, then mounted at a separate mount point read-only. Unlike some snapshotting systems, it takes therefore no time to actually make a snapshot, and there is no practical limit on the number of snapshots.
As an example, lets say your working along, and need some file you realize you deleted that morning. Fortunately, you backed up the day before from a snapshot of the system. Now backing up the system this way is simple, you use a command to list the checkpoints. Then you change a checkpoint to a snapshot. Then you mount the file system on another mount point with the snapshot’s number as an option to the mount command. This gives you the situation where your data disk is mounted twice. Currently mounted read-write, as well as a fixed in time version mounted on a separate mount point read only. Got it? That is the same file system mounted simultaneously in two different mount points, but one mount point is totally static and read only. You then backup that mount point. Retrieving that file the next day is a piece of cake! You log in as “root,” and if you use bash, you can flip through old commands with the arrow keys. You find again the mount you did for backup, and hit enter. That mount point is on yesterday now, and a simple copy via your favorite file manager will bring it back, without even having to find your backup medium! (You would use your actual separate backup medium for more disastrous situations.)
By now you should have guessed, this one is not for Granny, as they say. Yes, this one if for my dear listeners who are geeks at heart.
There are other benefits also, such as writes to the disk being faster than almost all other file systems, as in 30% faster. Maybe you put it on Granny’s PC without her knowledge for the speed.
There is also a benefit of data recovery. In it’s default configuration, there are no real file deletes for at least an hour. Technically, you can give a partition on this file system the proverbial recursive “rm” command, and bring the whole thing back by promoting a good checkpoint made within the last hour to a snapshot, and copy it back. Wholly possible.
Another benefit with data recovery. Ever remount a mount point that was not unmounted correctly? The system steps through the system with integrity checks against it’s recovery journal. Not with a log-structured file system like Nilfs, your last check point is your recovery point, there just isn’t a step through process. As matter of fact, I discovered that on my Debian Lenny system, the shutdown script that unmounts did not recognize the new file system, and it was not unmounting it. It took me days to discover this, then I wrote in a script to unmount Nilfs on shutdown myself. Think of it this way, your traditional file system uses a journal to know what it was trying to do when the system crashed. Nilfs is like just using a journal for everything. Your current mount point is just your last checkpoint, which is your last point where the file system is consistent, and is generated every few seconds anyway.
Another possibility is maybe eliminating backups for a certain type of file. I like to watch my TV as video files off the Internet. Now, once I watch something, it is doubtful I want to watch it again, and I normally delete it. I could, say, just not back up a file system, and just checkpoint it every day. If I do need to access something after deletion, I can mount an old snapshot, and possibly not bother with backing up those low priority files. Down the road, when the disk gets tight, I can begin changing the oldest snapshots back into normal checkpoints, thus letting the delete become real. Of course, this business of changing a checkpoint to a snapshot and back is just a matter of flagging an existing checkpoint, it takes no time at all.
Here’s another one: if you use virtual machines, maybe you snapshot the file system the virtual machine’s files reside on. Then, if you end up wanting to undo it, you can, just by recovering the old version of the virtual machine’s file.
Then there are more possibilities. I like knowing I am backing up from a read only version of my files while I forge ahead. But maybe your at work and you boss wants an 8 AM backup. Now, not only can you have a backup of your system as it was at exactly 8 AM, but if you can’t kick off a backup at a certain time (maybe a strange but pressing conference call,) why not back up that 8 AM snapshot at a more convenient time. Heck, why not make a mount point called “yesterday,” and create a cron job script to automatically mount that snapshot in the morning, then your users, who may want a report from yesterday or something, may just be able to get at it without your assistance. This thing gives you possibilities! Heck, maybe I can replace my current backup system with a simple bunch of rsync commands, maybe my backups will no longer take an hour. If I can slide this thing under rsync without it knowing, maybe I can pull it off.
Let’s compare the Log File and Traditional File System, and peek at the rationale behind the Log File system. This will clear a lot of ground before we look Nilfs specifically.
Back in the 1980’s, RAM was expensive. If you had 64 Kilobytes on a PC it was a big deal, not like today where we have several Gigabytes on a PC. This actually influenced the design of disk file systems, because when you read information in from a disk, you wanted to get it fast. This is why great care was taken to get bytes “contiguous,” or “next to each other,” on the disk. On the disk, moving the disk head is the slow action, so that is what must be minimized.
This is why formatting a disk is a big deal, you have to lay out the disk in a certain way so that the data falls together in a kinda natural way. Then, you can go back and take the time to optimize the data later.
Times change, and two computer people named Ousterhout and Douglis reasoned that more and more reads would be handled by memory cache as people expanded their computers. They therefore began work on the “Log File” structure where, when you get data for disk storage, you just write it out in the order you receive it.
This is the heart of the concept, because once something is read, you want a second read satisfied from the cache. Don’t make the mistake of thinking that this means that you don’t have indexes and pointers in your file system so you can get at the data for that first read fast, you do. It’s just that instead of having it in a certain fixed location on the disk, you are constantly writing it onto the end of the file system. These records are you current checkpoint.
Now I’ve talked about laying out the disk for traditional file systems being important. Well, in Nilfs, formatting a new file system is just writing a few key bytes, and takes a few seconds regardless of disk size. No special places are defined, because everything is just going to be added onto the end of the disk.
By now perhaps you asked yourself, “that is a whole lotta appending, don’t you run out of space eventually?” Well, of course, there is no such thing as an unlimited disk, and some provision must be made for deletion of old data.
Let’s clear up the naming mystery:Nilfs is the file system, and Nilfs2 is the second version, which includes the Cleaner Daemon.
So we have this file system, and the new data is constantly being appended to the back of it. Coming up from behind is the Cleaner Daemon, checking for extents of data that have unused sections to scrub and reclaim the space. Can you imagine what the Cleaner does when it finds a half-used extent? That’s right, it makes a new fully used extent and appends it to the end of the disk!
To really see this in your minds eye, you need to imagine the data on a disk crawling across the disk in slow motion.
The Cleaner does have a configuration file so it can be customized. The defaults are sensible for hard disks. They are, don’t touch until it is an hour old, then every five seconds wake up, and scan the two oldest extents for space to reclaim. On my system, I long ago added two-disk raid, and found upon tuning that I needed to have it look at the five oldest extents. That’s like double the defaults, and since a two disk raid array is twice as fast as a single disk, this makes sense. Now, I backup to 250 Gigabyte Seagate USB disks, and these are much slower devices. You don’t want the Cleaner to get write bound! That is, it needs to be tuned for slower devices so it runs for about a second out of every five seconds. If this is not done, when the Cleaning starts, it will take away the performance of this system, and you computer will be doing nothing but reorganizing the disk over and over again. So, the default size of an extent is 8 Megabytes, in order to keep the USB drive on par, I found through trial and error that I had to format it at 256 Kilobytes an extent, and that kept the Cleaner from dominating the disk when mounted.
And it is quite an instantaneous process. The indexes in the file system show the Cleaner exactly which extents have free space in them, and because of this, a simple thing like deleting an old email might instantly trigger a two hour copying of the file system. I will probably begin experimenting with having a very small partition for frequently changed data and excluding it from the backup process, that should quiet things downs a bit.
So, there is no panacea, some things just need to be managed differently, that’s all.
I’m so excited about this great technology, I forgot to mention the people who made it!
Nilfs2 is sponsored by NTT, which is Nippon Telephone and Telegraph (Japan’s version of AT&T.) This is their foray into sponsoring an open source project.
The team members are Mr.’s Yoshiji, Konishi, Sato, Hifumi, Tamura, Kihara, and Moriai.
I’m hooked on this file system! The peace of mind of backing up a read only version of my biggest disk space is too good to part with, and the prospect of using simpler (read that as a synonym for faster,) tools for backing up means I can’t stop using this file system: The benefits are too great to turn away.
At the same time, however, I know that it may be too much for some people, if not everyone’s cup of tea. It happens to fit me like a glove.
Yet I still would not want to replace my entire file system with this, nor the root file system. A very busy file system gets hard when the time comes to manage checkpoints and snapshots. This could be different if the only benefit desired was the speed, then who would care? Even then, though, care should be taken to ensure that the cleaner daemon, as well as the kernel module (for those not using the latest kernel,) are accessible in the boot process before any of these partitions are mounted.
So I will be using this file system for a while now, and feel I can safely recommend it to you for your own trials.