Writing Filesystems - Userdata I/O

From Genunix

Jump to: navigation, search
Image:Info.gif This article has been identified as a draft. It is currently undergoing a community review. Please add your comments to the discussion page.

Do not quote any text on this page! It is still a draft!


UNIX knows two different methods for performing userdata I/O operations:

  1. via read(2) and write(2) system calls resp. their variants
  2. using memory-mapped file access, mmap(3C).

Solaris filesystems perform the actual I/O to the block device in a common codepath for both, and that has consequences for how a filesystem implementation must look like. This section will show how userdata I/O works, so that the following implementation of VOP_READ() and VOP_WRITE() becomes easy to understand.

Part 1 - mmap-based I/O

Before we look at VOP_READ() and VOP_WRITE(), though, let's see how mmaped I/O works and what vnode operations a filesystem must implement to support it. Easy to see via DTrace. Try the following C Program and associated D script:

mmaped I/O demonstration program D script to find filesystem mmap backends
/*
 * mmaptest.c
 * A simple program to demonstrate mmaped I/O
 */
#include <sys/mman.h>
#include <sys/types.h>
#include <sys/stat.h>
#include <sys/param.h>
#include <fcntl.h>
#include <string.h>
#include <unistd.h>
#include <stdio.h>
#include <errno.h>

int main(int argc, char **argv)
{
	int fd;
	char localbuf[PAGESIZE];
	char *mapbase;

	if ((fd = open(argv[1], O_RDWR)) < 0) {
		perror("open failed");
		return (-1);
	}

	mapbase = mmap(NULL, PAGESIZE,
	    PROT_READ | PROT_WRITE,
	    MAP_SHARED, fd, 0);
	if (mapbase == NULL) {
		perror("mmap failed");
		close(fd);
		return (-1);
	}
	close(fd);

	memcpy(localbuf, mapbase, PAGESIZE);
	sleep(5);
	memset(mapbase, 'A', PAGESIZE);
	sleep(5);
	msync(mapbase, PAGESIZE, MS_SYNC);
	sleep(5);
	munmap(mapbase, PAGESIZE);
	return (0);
}
#!/usr/sbin/dtrace -n

syscall:::entry, fbt::trap:entry
/execname == "mmaptest"/
{
	self->t = 1;
}

syscall:::return, fbt::trap:return
/self->t/
{
	self->t = 0;
}

fbt::fop_*:entry
/self->t/
{
	self->t = 2;
}

fbt:pcfs::entry
/self->t == 2/
{
	stack();
	ustack();
	self->t = 1;
}

Running this and/or modifying it so that it works on different filesystem types is left as an exercise to the reader; in any case, the important steps are:

C source backend
	mapbase = mmap(NULL, PAGESIZE,
	    PROT_READ | PROT_WRITE,
	    MAP_SHARED, fd, 0);
  1  42626                   pcfs_map:entry 
              genunix`fop_map+0x50
              genunix`smmap_common+0x257
              genunix`smmap32+0xaa
              genunix`dtrace_systrace_syscall32+0x11f
              unix`sys_syscall32+0x1ff

              libc.so.1`mmap+0x7
              mmaptest`0x80508a2

  1  42630                pcfs_addmap:entry 
              genunix`fop_addmap+0x5c
              genunix`segvn_create+0x2b7
              genunix`as_map_locked+0x1a9
              genunix`as_map+0x5a
              pcfs`pcfs_map+0x13e
              genunix`fop_map+0x50
              genunix`smmap_common+0x257
              genunix`smmap32+0xaa
              genunix`dtrace_systrace_syscall32+0x11f
              unix`sys_syscall32+0x1ff

              libc.so.1`mmap+0x7
              mmaptest`0x80508a2
	memcpy(localbuf, mapbase, PAGESIZE);
  1  42622               pcfs_getpage:entry 
              genunix`fop_getpage+0x52
              genunix`segvn_fault+0xdde
              genunix`as_fault+0x61d
              unix`pagefault+0xad
              unix`trap+0xecc
              unix`_cmntrap+0x201

              libc.so.1`memcpy+0xff
              mmaptest`0x80508a2
	memset(mapbase, 'A', PAGESIZE);
This does not show up !
	msync(mapbase, PAGESIZE, MS_SYNC);
  0  42624               pcfs_putpage:entry 
              genunix`fop_putpage+0x3a
              genunix`segvn_sync+0x104
              genunix`as_ctl+0x204
              genunix`memcntl+0x77a
              genunix`dtrace_systrace_syscall32+0x11f
              unix`sys_syscall32+0x1ff

              libc.so.1`memcntl+0x7
              libc.so.1`msync+0x97
              mmaptest`main+0x12e
              mmaptest`0x80508a2
	munmap(mapbase, PAGESIZE);
  0  42632                pcfs_delmap:entry 
              genunix`fop_delmap+0x5b
              genunix`segvn_unmap+0x11c
              genunix`as_unmap+0x11e
              genunix`munmap+0x92
              genunix`dtrace_systrace_syscall32+0x11f
              unix`sys_syscall32+0x1ff

              libc.so.1`munmap+0x7
              mmaptest`0x80508a2

[ ... ]

  0  42624               pcfs_putpage:entry 
              genunix`fop_putpage+0x3a
              pcfs`syncpcp+0x43
              pcfs`pc_rele+0x9d
              pcfs`pcfs_inactive+0x7d
              genunix`fop_inactive+0x93
              genunix`vn_rele+0x66
              genunix`segvn_free+0x1f9
              genunix`seg_free+0x40
              genunix`segvn_unmap+0x8e8
              genunix`as_unmap+0x11e
              genunix`munmap+0x92
              genunix`dtrace_systrace_syscall32+0x11f
              unix`sys_syscall32+0x1ff

              libc.so.1`munmap+0x7
              mmaptest`0x80508a2

From this we see that mmap-based I/O uses the following vnode ops, in sequence:

  1. When a mapping is created, VOP_MAP() is called by the framework to indicate the request.
  2. The filesystem's implementation of VOP_MAP() calls as_map() to create a VM segment.
  3. The VM framework will, on completion of the task, call VOP_ADDMAP() as a notification to the filesystem that the mapping is now 'active'.
  4. The first pagefault on the new segment (no matter whether it's a memory load/store) will require the data backing the mapping to be brought in from the file on disk.
    The segment driver handling the fault calls VOP_GETPAGE() to request the filesystem to do this.
  5. Further accesses, whether read or write, cause no more calls into the filesystem code until the need to synchronize the modified data back to disk occurs. This can be an explicit call to msync(), or a delayed writeback by the paging daemon, fsflush(), which periodically writes dirty pages back to disk.
    Such a request makes the framework call VOP_PUTPAGE() in the filesystem.
  6. Removing the mapping results in the segment driver calling VOP_DELMAP().

So one of the definite conclusions from this is that actual I/O operations must be performed by the filesystem in VOP_GETPAGE() and VOP_PUTPAGE() in order to support mmap-based I/O operations as per above. We will see soon how this code actually looks like.

Part 2 - I/O via systemcalls

But first, something that might be a little surprising. Let's change the C program above to perform the same sequence of I/O operations, but use "normal" system calls instead of mmap(). The DTrace script barely changes, but our C source will now look like this:

syscall I/O demonstration program D script to find filesystem systemcall backends
/*
 * readwritetest.c
 * A simple program to demonstrate
 * systemcall-based I/O
 */
#include <sys/types.h>
#include <sys/stat.h>
#include <sys/param.h>
#include <fcntl.h>
#include <string.h>
#include <unistd.h>
#include <stdio.h>
#include <errno.h>

int main(int argc, char **argv)
{
	int fd;
	char localbuf[PAGESIZE];

	if ((fd = open(argv[1], O_RDWR)) < 0) {
		perror("open failed");
		return (-1);
	}

	(void)read(fd, localbuf, PAGESIZE);
	sleep(5);
	memset(localbuf, 'A', PAGESIZE);
	(void)write(fd, localbuf, PAGESIZE);
	sleep(5);
	fsync(fd);
	sleep(5);
	close(fd);
	return (0);
}
#!/usr/sbin/dtrace -n

syscall:::entry
/execname == "readwritetest"/
{
	self->t = 1;
}

syscall:::return
/self->t/
{
	self->t = 0;
}

fbt::fop_*:entry
/self->t/
{
	self->t = 2;
}

fbt:pcfs::entry
/self->t == 2/
{
	stack();
	ustack();
	self->t = 1;
}

Running this tells us how systemcall-based I/O works. We see output like this:

C Sourcecode Backend
	(void)read(fd, localbuf, PAGESIZE);
  1  42590                  pcfs_read:entry 
              genunix`fop_read+0x43
              genunix`read+0x2a4
              genunix`read32+0x20
              genunix`dtrace_systrace_syscall32+0x11f
              unix`sys_syscall32+0x1ff

              libc.so.1`_read+0x7
              rwt`main+0x84
              rwt`0x8050862

  1  42622               pcfs_getpage:entry 
              genunix`fop_getpage+0x52
              genunix`segmap_fault+0x241
              genunix`as_fault+0x61d
              unix`pagefault+0x226
              unix`trap+0x1596
              unix`_cmntrap+0x201
              unix`kcopy+0x4b
              genunix`uiomove+0x17f
              pcfs`rwpcp+0x4ff
              pcfs`pcfs_read+0x77
              genunix`fop_read+0x43
              genunix`read+0x2a4
              genunix`read32+0x20
              genunix`dtrace_systrace_syscall32+0x11f
              unix`sys_syscall32+0x1ff
	(void)write(fd, localbuf, PAGESIZE);
  1  42594                 pcfs_write:entry 
              genunix`fop_write+0x43
              genunix`write+0x21d
              genunix`write32+0x20
              genunix`dtrace_systrace_syscall32+0x11f
              unix`sys_syscall32+0x1ff

              libc.so.1`_write+0x7
              rwt`main+0xc2
              rwt`0x8050862

  1  42622               pcfs_getpage:entry 
              genunix`fop_getpage+0x52
              genunix`segmap_fault+0x241
              genunix`as_fault+0x61d
              unix`pagefault+0x226
              unix`trap+0x1596
              unix`_cmntrap+0x201
              unix`do_copy_fault_nta+0x35
              genunix`uiomove+0xc8
              pcfs`rwpcp+0x46d
              pcfs`pcfs_write+0x91
              genunix`fop_write+0x43
              genunix`write+0x21d
              genunix`write32+0x20
              genunix`dtrace_systrace_syscall32+0x11f
              unix`sys_syscall32+0x1ff

              libc.so.1`_write+0x7
              rwt`main+0xc2
              rwt`0x8050862

  1  42624               pcfs_putpage:entry 
              genunix`fop_putpage+0x3a
              genunix`segmap_release+0x381
              pcfs`rwpcp+0x546
              pcfs`pcfs_write+0x91
              genunix`fop_write+0x43
              genunix`write+0x21d
              genunix`write32+0x20
              genunix`dtrace_systrace_syscall32+0x11f
              unix`sys_syscall32+0x1ff

              libc.so.1`_write+0x7
              rwt`main+0xc2
              rwt`0x8050862
	fsync(fd);
  1  42602                 pcfs_fsync:entry 
              genunix`fop_fsync+0x31
              genunix`fdsync+0x3b
              genunix`dtrace_systrace_syscall32+0x11f
              unix`sys_syscall32+0x1ff

              libc.so.1`__fdsync+0x7
              libc.so.1`fsync+0x8b
              rwt`main+0xd8
              rwt`0x8050862

  1  42624               pcfs_putpage:entry 
              genunix`fop_putpage+0x3a
              pcfs`syncpcp+0x43
              pcfs`pc_nodesync+0x41
              pcfs`pcfs_fsync+0x70
              genunix`fop_fsync+0x31
              genunix`fdsync+0x3b
              genunix`dtrace_systrace_syscall32+0x11f
              unix`sys_syscall32+0x1ff

              libc.so.1`__fdsync+0x7
              libc.so.1`fsync+0x8b
              rwt`main+0xd8
              rwt`0x8050862
	close(fd);
  1  42588                 pcfs_close:entry 
              genunix`fop_close+0x42
              genunix`closef+0xa1
              genunix`closeandsetf+0x45d
              genunix`close+0x16
              genunix`dtrace_systrace_syscall32+0x11f
              unix`sys_syscall32+0x1ff

              libc.so.1`_close+0x7
              rwt`main+0xee
              rwt`0x8050862

  1  42604              pcfs_inactive:entry 
              genunix`fop_inactive+0x93
              genunix`vn_rele+0x66
              genunix`closef+0xc9
              genunix`closeandsetf+0x45d
              genunix`close+0x16
              genunix`dtrace_systrace_syscall32+0x11f
              unix`sys_syscall32+0x1ff

              libc.so.1`_close+0x7
              rwt`main+0xee
              rwt`0x8050862

  1  42624               pcfs_putpage:entry 
              genunix`fop_putpage+0x3a
              pcfs`syncpcp+0x43
              pcfs`pc_rele+0x9d
              pcfs`pcfs_inactive+0x7d
              genunix`fop_inactive+0x93
              genunix`vn_rele+0x66
              genunix`closef+0xc9
              genunix`closeandsetf+0x45d
              genunix`close+0x16
              genunix`dtrace_systrace_syscall32+0x11f
              unix`sys_syscall32+0x1ff

              libc.so.1`_close+0x7
              rwt`main+0xee
              rwt`0x8050862

These codepaths clearly show that the implementations of VOP_READ() and VOP_WRITE() actually do not perform I/O operations themselves. Instead, they use a specific segment driver, segmap, to create temporary VM mappings, and then delegate the actual I/O request to VOP_GETPAGE() and VOP_PUTPAGE(), by causing faults directly, or by dedicated calls into functions from segmap.

Puh - long. Why do it like this ? There are two reasons. The first - don't duplicate code - is obvious but alone might not justify the strange segmap effort. But the second is compelling: We want to put userdata into the system's page cache - and it may not matter which codepath populates the page cache, we must find the same data there, whether we use VOP_READ() or VOP_GETPAGE(). This - pagecache management - kind of forces a common backend for mmap- and systemcall-based I/O, which is provided by segmap.

Personal tools