March 9, 2004

StaticFS

StaticFS is my made-up, useless filesystem for testing purposes. It is a filesystem that looks like the following:
/--,
   +--a
   |
   `--b
      |
      `--c
So there is a root, a file under it called "a" (the contents of which will be "These are the characters in file a."; a directory "b" also in root; and a file "c" in that directory "b", with the contents "These are the characters in file c.".

This file system is not writeable, being static, so really this test case is to see if I understand the reading portion of a filesystem. I'll worry about writing in a different test filesystem, once I think I understand it a little more.

Initialization

I'm going to tackle this problem in the same order that I went through it in ramfs, starting with how the filesystem initializes itself with the kernel.
static DECLARE_FSTYPE(staticfs_fs_type, "staticfs", staticfs_read_super, FS_LITTER);

static int __init init_staticfs_fs(void)
{
        return register_filesystem(&staticfs_fs_type);
}

static void __exit exit_staticfs_fs(void)
{
        unregister_filesystem(&staticfs_fs_type);
}

module_init(init_staticfs_fs)
module_exit(exit_staticfs_fs)

MODULE_LICENSE("GPL");
It's pretty much cut-and-paste from the ramfs module, creating a structure that defines our filesystem (the FS_LITTER, defined in /usr/src/linux/include/linux/fs.h, tells VFS that this filesystem should "litter" the dentry cache with its entries.)

Superblocks

Now we need to define a structure of functions that we're going to give VFS that allow us to deal with our superblock. Back when looking at inodes, we saw the struct super_operations defined, and saw that ramfs implemented all of two of them, statfs and put_inode. From grepping all of the other filesystems supplied with Linux, it looks like everyone implements statfs, so lets look at struct statfs in /usr/src/linux/include/asm-i386/statfs.h:
struct statfs {
        long f_type;
        long f_bsize;
        long f_blocks;
        long f_bfree;
        long f_bavail;
        long f_files;
        long f_ffree;
        __kernel_fsid_t f_fsid;
        long f_namelen;
        long f_spare[6];
};
f_type is a magic number for the filesystem. We'll supply one of those. f_bsize is block size, and everyone seems to supply one, so we will too. f_blocks I'm not sure about, but could be the number of blocks used on the filesystem. f_bfree will be blocks free, but what's f_bavail? Many of the other filesystems set it to the same value as f_bfree, so what's the difference? I'll try and figure that out later.

f_files will be the number of files, and f_ffree the number of files available. f_fsid is only set by a few of the existing filesystems. What's it for? No idea. f_namelen looks like a maximum filename size. Everyone seems to set that, too, so we will as well. f_spare doesn't seem used by anyone, so it could be there for expansion, providing space for a pointer to an extension structure or six.

So we'll write a statfs function. Why not. Looks like everyone wants one, and that VFS needs one, so we'll play along.

#define STATICFS_MAGIC 0x61626364

static int staticfs_statfs(struct super_block *sb, struct statfs *buf)
{
        buf->f_type = STATICFS_MAGIC;
        buf->f_bsize = PAGE_CACHE_SIZE;
        buf->f_namelen = 1;
        return 0;
}
I made up the STATICFS_MAGIC, which is "abcd", named after our filesystem contents. I used PAGE_CACHE_SIZE because ramfs did, which, after a bit of searching, is equal to 4096. I used 1 for f_namelen because I wanted to be different. Or difficult.

Okay, so one function written. The only other one ramfs implemented was put_inode, which "releases" an inode from use. ramfs used force_delete(), as do many others, which from looking at /usr/src/linux/fs/inode.c, sets inode->i_nlink = 0, which will apparently delete it. So why doesn't romfs define a value for this? Is it just lazy and makes the VFS leave these unwanted inodes lying around? So are many of the other filesystems, then, since only 13 of the 40ish that I'm looking at have defined a function for this.

Looking deeper, I think I'm finding out what's going on. There are two functions being used in ramfs and romfs that I haven't looked into yet: iput() (in ramfs) and iget() (in romfs).

March 10, 2004

iget(), which is called in romfs, looks for an inode in the inode cache based on a given inode number (formed in romfs from the offset of the start of the filesystem in the file used as the romfs). If it doesn't find it, it creates a new inode, incrementing usage inside it if necessary. This is done in the romfs_read_super() function, when the superblock structure is being initialized with the root inode for the filesystem.

iput(), seen in ramfs, appears in the same place. ramfs goes through a different method of getting an inode than romfs; it calls it's own ramfs_get_inode(), which calls new_inode() (which calls get_empty_inode(), which calls alloc_inode()). romfs just calls iget() (which calls iget4()) which calls get_new_inode(), which calls alloc_inode(). iput() is called if, for some reason, the inode it found can't be allocated as the root inode.

All very interesting, but it didn't really shed any light on why ramfs has force_delete() as the put_inode: entry in the struct super_operations structure, and romfs does not. Is it because romfs's inode was created "internally", as a consequence of an iget(), where ramfs specifically and explicitly created its own with new_inode()? But in the end, they're both made the same way, as seen above.

Okay, so for now, I'm not going to handle put_inode. If it's good enough for romfs, it's good enough for me!

So, looking at struct super_operations, what else do we need to implement? dirty_inode, write_inode, delete_inode, write_super and write_super_lockfs are all about writing, and we're a static filesystem. clear_inode and unlockfs I don't know about. put_super seems to be used for cleaning up any module-specific data kept in the superblock's u union. umount_begin is only used in NFS right now, and probably isn't applicable here. We'll ignore the methods used by knfsd and anything else that seems to be a reiserfs kludge. This leaves, I believe, read_inode.

We've never seen a read_inode function before, so we're not familiar with what it should do. This is a superblock function, so it's looking for information from a superblock's point of view. We're given an inode as a parameter, so we don't have to create one. This also means that it will have some information inside of it, so we know which inode the caller is interested in. I'm going to peek into the romfs code to get an idea of what we need.

Wow. So what does it all do? i->ino contains the inode number that we're interested in. The rest of it just populates the struct inode, so let's look at that.

struct inode {
        struct list_head        i_hash;
        struct list_head        i_list;
        struct list_head        i_dentry;

        struct list_head        i_dirty_buffers;
        struct list_head        i_dirty_data_buffers;

        unsigned long           i_ino;
        atomic_t                i_count;
        kdev_t                  i_dev;
        umode_t                 i_mode;
        nlink_t                 i_nlink;
        uid_t                   i_uid;
        gid_t                   i_gid;
        kdev_t                  i_rdev;
        loff_t                  i_size;
        time_t                  i_atime;
        time_t                  i_mtime;
        time_t                  i_ctime;
        unsigned int            i_blkbits;
        unsigned long           i_blksize;
        unsigned long           i_blocks;
        unsigned long           i_version;
        unsigned short          i_bytes;
        struct semaphore        i_sem;
        struct semaphore        i_zombie;
        struct inode_operations *i_op;
        struct file_operations  *i_fop; /* former ->i_op->default_file_ops */
        struct super_block      *i_sb;
        wait_queue_head_t       i_wait;
        struct file_lock        *i_flock;
        struct address_space    *i_mapping;
        struct address_space    i_data;
        struct dquot            *i_dquot[MAXQUOTAS];
        /* These three should probably be a union */
        struct list_head        i_devices;
        struct pipe_inode_info  *i_pipe;
        struct block_device     *i_bdev;
        struct char_device      *i_cdev;

        unsigned long           i_dnotify_mask; /* Directory notify events */
        struct dnotify_struct   *i_dnotify; /* for directory notifications */

        unsigned long           i_state;

        unsigned int            i_flags;
        unsigned char           i_sock;

        atomic_t                i_writecount;
        unsigned int            i_attr_flags;
        __u32                   i_generation;
	union {
		...
	} u;
};
I clipped out the filesystem-specific names in the u union. I think we'll ignore the first five entries, all of them struct list_head variables, because I believe they're used by VFS and not us. I'll have to look back at that.

So, i_ino comes to us initialized with the inode number we are to fill in. i_count is handled by the inode cache (the iput() and iget() seen earlier). i_dev? Many of the VFS modules that set this value copy it from sb->s_dev, so this is probably the device on which the filesystem is found. Perhaps different inodes could be on different devices, especially in the /dev filesystem, though it doesn't seem to be in use there! Oh well. I'll read about kdev_t later, to see what the values represent.

i_mode is the permissions on the inode. Finally something we might want to set! I think we'll make our staticfs have all files and directories set 444/555, so everyone can read them, but not write, of course. All directories are navigable as well. What would happen if we set write flags on our files and directories, though? We're not implementing any of the write functions... should we try? Sure! Let's make the files LOOK like they're writeable, and see what happens.

Let's start writing this function.

static void staticfs_read_inode(struct inode *i) {
        int ino;

        ino = i->i_ino;
        i->i_mode = S_IRWXUGO;
S_IRWXUGO is a bitmask found in /usr/src/linux/include/linux/stat.h, which sets the flags to 777. We'll set i_nlink to 1, because there's only one reference to each of our inodes here. We'll set i_iuid and i_igid to 0 so they are owned by root. Why not!
        i->i_nlink = 1; 
        i->i_uid = 0;
        i->i_gid = 0;
Let's see... i_rdev is probably tied to i_dev, so we'll skip it. i_size. The two files contain 35 characters each. The directories, well, it's up to us to say what a directory size represents. ext2 returns the number of bytes, apparently in 4k blocks, that are used to store the files contained within. We'll say that it's how many items are within it, shall we?

Now, that's interesting. How do "." and ".." work? Do I have to have them as entries in my filesystem, or are they special directories that exist everywhere? I have no idea. Let's NOT add them, and see what happens. This means that the root directory is size 2, a is size 35, b is size 1 and c is size 35. To do this, we have to know which file is which. Usually we'd look this up in our filesystem; or like romfs, it uses the offset into the file as the inode number. Let's say that "root" is inode 0, "a" is inode 1, "b" is inode "2", and "c" is inode "3".

        switch (ino) {  
        case 0: i->i_size = 2; break;
        case 1: i->i_size = 35; break;
        case 2: i->i_size = 1; break;
        case 3: i->i_size = 35; break;
        default: i->i_size = 0; break;
        }

March 11, 2004

The next three values, i_atime, i_mtime and i_ctime, can all be set to zero, like in romfs.

	i->i_atime = i->i_mtime = i->i_ctime = 0;
No one seems to set i_blkbits, though they do set i_blksize, so we'll do that. i_blocks seems to be the number of blocks that the inode uses, but doesn't seem to be important (to romfs and ramfs at least!)
        i->i_blksize = PAGE_CACHE_SIZE;
        i->i_blocks = 0;
I've used PAGE_CACHE_SIZE because that seems to be the favorite among the other filesystems. i_version, I believe, is used to keep track of "when" the inode is referenced. I think it's used for directories, so you can tell if the directory has changed as you're reading it (the i_version value won't match the global event value. No one seems to initialize i_bytes either. Both i_sem and i_zombie are struct semaphores, and are likely used by the VFS to lock the inodes as needed.

Finally we're getting to the good stuff. i_op and i_fop are the pointers to our struct inode_operations and struct file_operations respectively. We haven't written ours, yet, but we know what we'll call them.

This brings up another good question. I haven't looked deeply enough into how VFS handles files versus directories, but I wonder: can an inode be treated as a file and a directory? Can I have an object in a filesystem that one can run both cat and cd on? Could I write a zipfs module, something that allowed you to run regular commands on the .zip file (such as unzip), as well as cd into the file and see the files as directory contents? This is my eventual goal, so I hope this is possible. I'll probably change staticfs to treat b as both a file and a directory later on, to test this out.

So for now, we need a structure of inode_operations to pass for file inodes, and one for directory inodes. Or do we? romfs fails to fill in that structure for file inodes. Is there a whole set of default functions that work just fine? Let's go wild and populate the structure, regardless of what type of inode it is! As for i_fop, it looks like we'll supply one structure for now, much as romfs does, using a generic_ro_fops() like romfs does, and assume that it does what we need!

        i->i_op = &staticfs_inode_operations;
        switch (ino) {  
        case 0:
        case 2: i->i_fop = &staticfs_dir_operations; break;
        case 1:
        case 3: i->i_fop = &generic_ro_fops; break;
        }
The only remaining members of struct inode that we might want to populate are i_mapping and i_data, both struct address_space members. One of them i_mapping is a pointer, the other is not. ramfs sets the a_ops member in i_mapping, while romfs sets a_ops in i_data. Why? What's the difference?

I couldn't figure it out from the kernel source nor the other VFS modules, so I hit the web. This is what I found from Alexander Viro, answering the same question:

i_data is "pages read/written by this inode" 
i_mapping is "whom should I ask for pages?" 


IOW, everything outside of individual filesystems should use the latter. 
They are same if (and only if) inode owns the data. CODA (or anything that 
caches data on a local fs) will have i_mapping pointing to the i_data of 
inode it caches into. Ditto for block devices if/when they go into pagecache - 
we should associate pagecache with struct block_device, since we can have 
many inodes with the same major:minor. IOW, ->i_mapping should be pointing 
to the same place for all of them. 
From this, then, it looks like if you have a filesystem that doesn't handle the data directly, then it should use i_mapping, which is a pointer, to point to the i_data structure of some other inode -- the one that it represents. If that's the case, then a lot of the filesystem modules are doing it wrong. But who am I to say?
        i->i_mapping->a_ops = &staticfs_aops;
}
Phew. All that just to read an inode. Now that we've just got to put together our super_operations structure, and we're almost done our superblock work.
static struct super_operations staticfs_ops = {
  read_inode:staticfs_read_inode,
  statfs:staticfs_statfs,
};
There. And what do we do with this function? We put it into the s_op field of the struct super_block that we return from our own staticfs_read_super(). Have we looked at struct super_block yet? I don't think we have.
struct super_block {
        struct list_head        s_list;         /* Keep this first */
        kdev_t                  s_dev;
        unsigned long           s_blocksize;
        unsigned char           s_blocksize_bits;
        unsigned char           s_dirt;
        unsigned long long      s_maxbytes;     /* Max file size */
        struct file_system_type *s_type;
        struct super_operations *s_op;
        struct dquot_operations *dq_op;
        unsigned long           s_flags;
        unsigned long           s_magic;
        struct dentry           *s_root;
        struct rw_semaphore     s_umount;
        struct semaphore        s_lock;
        int                     s_count;
        atomic_t                s_active;

        struct list_head        s_dirty;        /* dirty inodes */
        struct list_head        s_locked_inodes;/* inodes being synced */
        struct list_head        s_files;

        struct block_device     *s_bdev;
        struct list_head        s_instances;
        struct quota_info s_dquot;      /* Diskquota specific options */

        union {
	...
        } u;
        /*
         * The next field is for VFS *only*. No filesystems have any business
         * even looking at it. You had been warned.
         */
        struct semaphore s_vfs_rename_sem;      /* Kludge */

        /* The next field is used by knfsd when converting a (inode number based)
         * file handle into a dentry. As it builds a path in the dcache tree from
         * the bottom up, there may for a time be a subpath of dentrys which is not
         * connected to the main tree.  This semaphore ensure that there is only ever
         * one such free path per filesystem.  Note that unconnected files (or other
         * non-directories) are allowed, but not unconnected diretories.
         */
        struct semaphore s_nfsd_free_path_sem;
};
As I did with the inode structure, I've hidden the filesystem-specific entries in the u union. So, what do we want to do?

s_list is used by VFS to keep a list of all of the superblocks. We'll let it worry about that. s_dev is the device that this filesystem is on. It's not really a device in our case, and I think this value might already be set by VFS or the kernel, so we'll leave it.

s_blocksize and s_blocksize_bits we'll keep the same, as PAGE_CACHE_SIZE and PAGE_CACHE_SHIFT, like ramfs does.

static struct super_block *staticfs_read_super(struct super_block *sb, void *data, int silent) {
  sb->s_blocksize = PAGE_CACHE_SIZE;
  sb->s_blocksize_bits = PAGE_CACHE_SHIFT;
We still don't know what the data and silent parameters are used for, so we'll also ignore them in our function. The s_dirt is used as a flag to say that the superblock has been "dirtied", or modified. We never write our superblock anywhere, so we don't care. s_maxbytes really doesn't matter, because we don't allow writing. Also, by looking at /usr/src/linux/fs/super.c, if alloc_super() is called to create the struct super_block passed to staticfs_read_super(), then s_maxbytes comes preset with MAX_NON_LFS, so we'll work on the assumption that that's just fine. s_type, a pointer to a struct file_system_type, is probably populated through the call to register_filesystem() that was made in init_staticfs_fs(). No other filesystem sets this value, and it's needed for finding the right VFS module to handle a given filesystem, so I'd say it must be set already.

March 12, 2004

Next are s_op and dq_op, one familiar, one not. s_op is where we stick our struct super_operations. dq_op are disk quota operations, and none of the VFS modules seem to implement their own, so we'll assume that the stock ones found in /usr/src/linux/fs/dquot.c are good enough.

  sb->s_op = &staticfs_ops;
s_type allows us to define mount flags, as defined in fs.h, such as MS_NOSUID to ignore suid and sgid bits, or MS_NOEXEC to disallow program execution from the mounted filesystem. We'll follow romfs's lead and use MS_RDONLY. Perhaps this is how VFS will catch any attempts to write to our staticfs, even though we're setting the write flag on our inodes?

s_magic gets our magic number, as defined earlier.

  sb->s_flags = MS_RDONLY;
  sb->s_magic = STATICFS_MAGIC;
Next we have s_root. We've touched on this field before, when trying to put some sense into put_inode(). Both ramfs and romfs use d_alloc_root() on an inode to allocate the required struct dentry, so we will as well. Let's try romfs's method of making a root inode, which is to use iget(). We're going to call our root inode 0.
  sb->s_root = d_alloc_root(iget(sb,0));
  if (!sb->s_root) {
    return NULL;
  }
Note that I also added in some code to check that d_alloc_root() returned with a non-NULL value.

It looks like everything else in the superblock is used for maintaining lists, setting semaphores and keeping count of usage, so this is all stuff that VFS gets to set, not us. Let's return the superblock and say we're done!

  return sb;
}