xv6/web/l-name.html

<title>L11</title>
<html>
<head>
</head>
<body>

<h1>Naming in file systems</h1>

<p>Required reading: nami(), and all other file system code.

<h2>Overview</h2>

<p>To help users to remember where they stored their data, most
systems allow users to assign their own names to their data.
Typically the data is organized in files and users assign names to
files.  To deal with many files, users can organize their files in
directories, in a hierarchical manner.  Each name is a pathname, with
the components separated by "/".

<p>To avoid that users have to type long abolute names (i.e., names
starting with "/" in Unix), users can change their working directory
and use relative names (i.e., naming that don't start with "/").

<p>User file namespace operations include create, mkdir, mv, ln
(link), unlink, and chdir. (How is "mv a b" implemented in xv6?
Answer: "link a b"; "unlink a".)  To be able to name the current
directory and the parent directory every directory includes two
entries "." and "..".  Files and directories can reclaimed if users
cannot name it anymore (i.e., after the last unlink).

<p>Recall from last lecture, all directories entries contain a name,
followed by an inode number. The inode number names an inode of the
file system.  How can we merge file systems from different disks into
a single name space?

<p>A user grafts new file systems on a name space using mount.  Umount
removes a file system from the name space.  (In DOS, a file system is
named by its device letter.)  Mount takes the root inode of the
to-be-mounted file system and grafts it on the inode of the name space
entry where the file system is mounted (e.g., /mnt/disk1). The
in-memory inode of /mnt/disk1 records the major and minor number of
the file system mounted on it.  When namei sees an inode on which a
file system is mounted, it looks up the root inode of the mounted file
system, and proceeds with that inode.

<p>Mount is not a durable operation; it doesn't surive power failures.
After a power failure, the system administrator must remount the file
system (i.e., often in a startup script that is run from init).

<p>Links are convenient, because with users can create synonyms for
  file names.  But, it creates the potential of introducing cycles in
  the naning tree.  For example, consider link("a/b/c", "a").  This
  makes c a synonym for a. This cycle can complicate matters; for
  example:
<ul>
<li>If a user subsequently calls unlink ("a"), then the user cannot
  name the directory "b" and the link "c" anymore, but how can the
  file system decide that?
</ul>

<p>This problem can be solved by detecting cycles.  The second problem
  can be solved by computing with files are reacheable from "/" and
  reclaim all the ones that aren't reacheable.  Unix takes a simpler
  approach: avoid cycles by disallowing users to create links for
  directories.  If there are no cycles, then reference counts can be
  used to see if a file is still referenced. In the inode maintain a
  field for counting references (nlink in xv6's dinode). link
  increases the reference count, and unlink decreases the count; if
  the count reaches zero the inode and disk blocks can be reclaimed.

<p>How to handle symbolic links across file systems (i.e., from one
  mounted file system to another)?  Since inodes are not unique across
  file systems, we cannot create a link across file systems; the
  directory entry only contains an inode number, not the inode number
  and the name of the disk on which the inode is located.  To handle
  this case, Unix provides a second type of link, which are called
  soft links.

<p>Soft links are a special file type (e.g., T_SYMLINK).  If namei
  encounters a inode of type T_SYMLINK, it resolves the the name in
  the symlink file to an inode, and continues from there.  With
  symlinks one can create cycles and they can point to non-existing
  files.

<p>The design of the name system can have security implications. For
  example, if you tests if a name exists, and then use the name,
  between testing and using it an adversary can have change the
  binding from name to object.  Such problems are called TOCTTOU.

<p>An example of TOCTTOU is follows.  Let's say root runs a script
  every night to remove file in /tmp.  This gets rid off the files
  that editors might left behind, but we will never be used again. An
  adversary can exploit this script as follows:
<pre>
    Root                         Attacker
                                 mkdir ("/tmp/etc")
				 creat ("/tmp/etc/passw")
    readdir ("tmp");
    lstat ("tmp/etc");
    readdir ("tmp/etc");
                                 rename ("tmp/etc", "/tmp/x");
				 symlink ("etc", "/tmp/etc");
    unlink ("tmp/etc/passwd");
</pre>
Lstat checks whether /tmp/etc is not symbolic link, but by the time it
runs unlink the attacker had time to creat a symbolic link in the
place of /tmp/etc, with a password file of the adversary's choice.

<p>This problem could have been avoided if every user or process group
  had its own private /tmp, or if access to the shared one was
  mediated.

<h2>V6 code examples</h2>

<p> namei (sheet 46) is the core of the Unix naming system. namei can
  be called in several ways: NAMEI_LOOKUP (resolve a name to an inode
  and lock inode), NAMEI_CREATE (resolve a name, but lock parent
  inode), and NAMEI_DELETE (resolve a name, lock parent inode, and
  return offset in the directory).  The reason is that namei is
  complicated is that we want to atomically test if a name exist and
  remove/create it, if it does; otherwise, two concurrent processes
  could interfere with each other and directory could end up in an
  inconsistent state.

<p>Let's trace open("a", O_RDWR), focussing on namei:
<ul>
<li>5263: we will look at creating a file in a bit.
<li>5277: call namei with NAMEI_LOOKUP
<li>4629: if path name start with "/", lookup root inode (1).
<li>4632: otherwise, use inode for current working directory.
<li>4638: consume row of "/", for example in "/////a////b"
<li>4641: if we are done with NAMEI_LOOKUP, return inode (e.g.,
  namei("/")).
<li>4652: if the inode we are searching for a name isn't of type
  directory, give up.
<li>4657-4661: determine length of the current component of the
  pathname we are resolving.
<li>4663-4681: scan the directory for the component.
<li>4682-4696: the entry wasn't found. if we are the end of the
  pathname and NAMEI_CREATE is set, lock parent directory and return a
  pointer to the start of the component.  In all other case, unlock
  inode of directory, and return 0.
<li>4701: if NAMEI_DELETE is set, return locked parent inode and the
  offset of the to-be-deleted component in the directory.
<li>4707: lookup inode of the component, and go to the top of the loop.
</ul>

<p>Now let's look at creating a file in a directory:
<ul>
<li>5264: if the last component doesn't exist, but first part of the
  pathname resolved to a directory, then dp will be 0, last will point
  to the beginning of the last component, and ip will be the locked
  parent directory.
<li>5266: create an entry for last in the directory.
<li>4772: mknod1 allocates a new named inode and adds it to an
  existing directory.
<li>4776: ialloc. skan inode block, find unused entry, and write
  it. (if lucky 1 read and 1 write.)
<li>4784: fill out the inode entry, and write it. (another write)
<li>4786: write the entry into the directory (if lucky, 1 write)
</ul>

</ul>
Why must the parent directory be locked?  If two processes try to
create the same name in the same directory, only one should succeed
and the other one, should receive an error (file exist).

<p>Link, unlink, chdir, mount, umount could have taken file
descriptors instead of their path argument. In fact, this would get
rid of some possible race conditions (some of which have security
implications, TOCTTOU). However, this would require that the current
working directory be remembered by the process, and UNIX didn't have
good ways of maintaining static state shared among all processes
belonging to a given user. The easiest way is to create shared state
is to place it in the kernel.
   
<p>We have one piece of code in xv6 that we haven't studied: exec.
  With all the ground work we have done this code can be easily
  understood (see sheet 54).

</body>
DO NOT MAIL: xv6 web pages 2008-09-03 04:50:04 +00:00			`<title>L11</title>`
			`<html>`
			`<head>`
			`</head>`
			`<body>`

			`<h1>Naming in file systems</h1>`

			`<p>Required reading: nami(), and all other file system code.`

			`<h2>Overview</h2>`

			`<p>To help users to remember where they stored their data, most`
			`systems allow users to assign their own names to their data.`
			`Typically the data is organized in files and users assign names to`
			`files. To deal with many files, users can organize their files in`
			`directories, in a hierarchical manner. Each name is a pathname, with`
			`the components separated by "/".`

			`<p>To avoid that users have to type long abolute names (i.e., names`
			`starting with "/" in Unix), users can change their working directory`
			`and use relative names (i.e., naming that don't start with "/").`

			`<p>User file namespace operations include create, mkdir, mv, ln`
			`(link), unlink, and chdir. (How is "mv a b" implemented in xv6?`
			`Answer: "link a b"; "unlink a".) To be able to name the current`
			`directory and the parent directory every directory includes two`
			`entries "." and "..". Files and directories can reclaimed if users`
			`cannot name it anymore (i.e., after the last unlink).`

			`<p>Recall from last lecture, all directories entries contain a name,`
			`followed by an inode number. The inode number names an inode of the`
			`file system. How can we merge file systems from different disks into`
			`a single name space?`

			`<p>A user grafts new file systems on a name space using mount. Umount`
			`removes a file system from the name space. (In DOS, a file system is`
			`named by its device letter.) Mount takes the root inode of the`
			`to-be-mounted file system and grafts it on the inode of the name space`
			`entry where the file system is mounted (e.g., /mnt/disk1). The`
			`in-memory inode of /mnt/disk1 records the major and minor number of`
			`the file system mounted on it. When namei sees an inode on which a`
			`file system is mounted, it looks up the root inode of the mounted file`
			`system, and proceeds with that inode.`

			`<p>Mount is not a durable operation; it doesn't surive power failures.`
			`After a power failure, the system administrator must remount the file`
			`system (i.e., often in a startup script that is run from init).`

			`<p>Links are convenient, because with users can create synonyms for`
			`file names. But, it creates the potential of introducing cycles in`
			`the naning tree. For example, consider link("a/b/c", "a"). This`
			`makes c a synonym for a. This cycle can complicate matters; for`
			`example:`
			`<ul>`
			`<li>If a user subsequently calls unlink ("a"), then the user cannot`
			`name the directory "b" and the link "c" anymore, but how can the`
			`file system decide that?`
			`</ul>`

			`<p>This problem can be solved by detecting cycles. The second problem`
			`can be solved by computing with files are reacheable from "/" and`
			`reclaim all the ones that aren't reacheable. Unix takes a simpler`
			`approach: avoid cycles by disallowing users to create links for`
			`directories. If there are no cycles, then reference counts can be`
			`used to see if a file is still referenced. In the inode maintain a`
			`field for counting references (nlink in xv6's dinode). link`
			`increases the reference count, and unlink decreases the count; if`
			`the count reaches zero the inode and disk blocks can be reclaimed.`

			`<p>How to handle symbolic links across file systems (i.e., from one`
			`mounted file system to another)? Since inodes are not unique across`
			`file systems, we cannot create a link across file systems; the`
			`directory entry only contains an inode number, not the inode number`
			`and the name of the disk on which the inode is located. To handle`
			`this case, Unix provides a second type of link, which are called`
			`soft links.`

			`<p>Soft links are a special file type (e.g., T_SYMLINK). If namei`
			`encounters a inode of type T_SYMLINK, it resolves the the name in`
			`the symlink file to an inode, and continues from there. With`
			`symlinks one can create cycles and they can point to non-existing`
			`files.`

			`<p>The design of the name system can have security implications. For`
			`example, if you tests if a name exists, and then use the name,`
			`between testing and using it an adversary can have change the`
			`binding from name to object. Such problems are called TOCTTOU.`

			`<p>An example of TOCTTOU is follows. Let's say root runs a script`
			`every night to remove file in /tmp. This gets rid off the files`
			`that editors might left behind, but we will never be used again. An`
			`adversary can exploit this script as follows:`
			`<pre>`
			`Root Attacker`
			`mkdir ("/tmp/etc")`
			`creat ("/tmp/etc/passw")`
			`readdir ("tmp");`
			`lstat ("tmp/etc");`
			`readdir ("tmp/etc");`
			`rename ("tmp/etc", "/tmp/x");`
			`symlink ("etc", "/tmp/etc");`
			`unlink ("tmp/etc/passwd");`
			`</pre>`
			`Lstat checks whether /tmp/etc is not symbolic link, but by the time it`
			`runs unlink the attacker had time to creat a symbolic link in the`
			`place of /tmp/etc, with a password file of the adversary's choice.`

			`<p>This problem could have been avoided if every user or process group`
			`had its own private /tmp, or if access to the shared one was`
			`mediated.`

			`<h2>V6 code examples</h2>`

			`<p> namei (sheet 46) is the core of the Unix naming system. namei can`
			`be called in several ways: NAMEI_LOOKUP (resolve a name to an inode`
			`and lock inode), NAMEI_CREATE (resolve a name, but lock parent`
			`inode), and NAMEI_DELETE (resolve a name, lock parent inode, and`
			`return offset in the directory). The reason is that namei is`
			`complicated is that we want to atomically test if a name exist and`
			`remove/create it, if it does; otherwise, two concurrent processes`
			`could interfere with each other and directory could end up in an`
			`inconsistent state.`

			`<p>Let's trace open("a", O_RDWR), focussing on namei:`
			`<ul>`
			`<li>5263: we will look at creating a file in a bit.`
			`<li>5277: call namei with NAMEI_LOOKUP`
			`<li>4629: if path name start with "/", lookup root inode (1).`
			`<li>4632: otherwise, use inode for current working directory.`
			`<li>4638: consume row of "/", for example in "/////a////b"`
			`<li>4641: if we are done with NAMEI_LOOKUP, return inode (e.g.,`
			`namei("/")).`
			`<li>4652: if the inode we are searching for a name isn't of type`
			`directory, give up.`
			`<li>4657-4661: determine length of the current component of the`
			`pathname we are resolving.`
			`<li>4663-4681: scan the directory for the component.`
			`<li>4682-4696: the entry wasn't found. if we are the end of the`
			`pathname and NAMEI_CREATE is set, lock parent directory and return a`
			`pointer to the start of the component. In all other case, unlock`
			`inode of directory, and return 0.`
			`<li>4701: if NAMEI_DELETE is set, return locked parent inode and the`
			`offset of the to-be-deleted component in the directory.`
			`<li>4707: lookup inode of the component, and go to the top of the loop.`
			`</ul>`

			`<p>Now let's look at creating a file in a directory:`
			`<ul>`
			`<li>5264: if the last component doesn't exist, but first part of the`
			`pathname resolved to a directory, then dp will be 0, last will point`
			`to the beginning of the last component, and ip will be the locked`
			`parent directory.`
			`<li>5266: create an entry for last in the directory.`
			`<li>4772: mknod1 allocates a new named inode and adds it to an`
			`existing directory.`
			`<li>4776: ialloc. skan inode block, find unused entry, and write`
			`it. (if lucky 1 read and 1 write.)`
			`<li>4784: fill out the inode entry, and write it. (another write)`
			`<li>4786: write the entry into the directory (if lucky, 1 write)`
			`</ul>`

			`</ul>`
			`Why must the parent directory be locked? If two processes try to`
			`create the same name in the same directory, only one should succeed`
			`and the other one, should receive an error (file exist).`

			`<p>Link, unlink, chdir, mount, umount could have taken file`
			`descriptors instead of their path argument. In fact, this would get`
			`rid of some possible race conditions (some of which have security`
			`implications, TOCTTOU). However, this would require that the current`
			`working directory be remembered by the process, and UNIX didn't have`
			`good ways of maintaining static state shared among all processes`
			`belonging to a given user. The easiest way is to create shared state`
			`is to place it in the kernel.`

			`<p>We have one piece of code in xv6 that we haven't studied: exec.`
			`With all the ground work we have done this code can be easily`
			`understood (see sheet 54).`

			`</body>`