DO NOT MAIL: xv6 web pages
This commit is contained in:
parent
ee3f75f229
commit
f53494c28e
37 changed files with 9034 additions and 0 deletions
3
web/Makefile
Normal file
3
web/Makefile
Normal file
|
@ -0,0 +1,3 @@
|
|||
index.html: index.txt mkhtml
|
||||
mkhtml index.txt >_$@ && mv _$@ $@
|
||||
|
353
web/index.html
Normal file
353
web/index.html
Normal file
|
@ -0,0 +1,353 @@
|
|||
<!-- AUTOMATICALLY GENERATED: EDIT the .txt version, not the .html version -->
|
||||
<html>
|
||||
<head>
|
||||
<title>Xv6, a simple Unix-like teaching operating system</title>
|
||||
<style type="text/css"><!--
|
||||
body {
|
||||
background-color: white;
|
||||
color: black;
|
||||
font-size: medium;
|
||||
line-height: 1.2em;
|
||||
margin-left: 0.5in;
|
||||
margin-right: 0.5in;
|
||||
margin-top: 0;
|
||||
margin-bottom: 0;
|
||||
}
|
||||
|
||||
h1 {
|
||||
text-indent: 0in;
|
||||
text-align: left;
|
||||
margin-top: 2em;
|
||||
font-weight: bold;
|
||||
font-size: 1.4em;
|
||||
}
|
||||
|
||||
h2 {
|
||||
text-indent: 0in;
|
||||
text-align: left;
|
||||
margin-top: 2em;
|
||||
font-weight: bold;
|
||||
font-size: 1.2em;
|
||||
}
|
||||
--></style>
|
||||
</head>
|
||||
<body bgcolor=#ffffff>
|
||||
<h1>Xv6, a simple Unix-like teaching operating system</h1>
|
||||
<br><br>
|
||||
Xv6 is a teaching operating system developed
|
||||
in the summer of 2006 for MIT's operating systems course,
|
||||
“6.828: Operating Systems Engineering.”
|
||||
We used it for 6.828 in Fall 2006 and Fall 2007
|
||||
and are using it this semester (Fall 2008).
|
||||
We hope that xv6 will be useful in other courses too.
|
||||
This page collects resources to aid the use of xv6
|
||||
in other courses.
|
||||
|
||||
<h2>History and Background</h2>
|
||||
For many years, MIT had no operating systems course.
|
||||
In the fall of 2002, Frans Kaashoek, Josh Cates, and Emil Sit
|
||||
created a new, experimental course (6.097)
|
||||
to teach operating systems engineering.
|
||||
In the course lectures, the class worked through Sixth Edition Unix (aka V6)
|
||||
using John Lions's famous commentary.
|
||||
In the lab assignments, students wrote most of an exokernel operating
|
||||
system, eventually named Jos, for the Intel x86.
|
||||
Exposing students to multiple systems–V6 and Jos–helped
|
||||
develop a sense of the spectrum of operating system designs.
|
||||
In the fall of 2003, the experimental 6.097 became the
|
||||
official course 6.828; the course has been offered each fall since then.
|
||||
<br><br>
|
||||
V6 presented pedagogic challenges from the start.
|
||||
Students doubted the relevance of an obsolete 30-year-old operating system
|
||||
written in an obsolete programming language (pre-K&R C)
|
||||
running on obsolete hardware (the PDP-11).
|
||||
Students also struggled to learn the low-level details of two different
|
||||
architectures (the PDP-11 and the Intel x86) at the same time.
|
||||
By the summer of 2006, we had decided to replace V6
|
||||
with a new operating system, xv6, modeled on V6
|
||||
but written in ANSI C and running on multiprocessor
|
||||
Intel x86 machines.
|
||||
Xv6's use of the x86 makes it more relevant to
|
||||
students' experience than V6 was
|
||||
and unifies the course around a single architecture.
|
||||
Adding multiprocessor support also helps relevance
|
||||
and makes it easier to discuss threads and concurrency.
|
||||
(In a single processor operating system, concurrency–which only
|
||||
happens because of interrupts–is too easy to view as a special case.
|
||||
A multiprocessor operating system must attack the problem head on.)
|
||||
Finally, writing a new system allowed us to write cleaner versions
|
||||
of the rougher parts of V6, like the scheduler and file system.
|
||||
<br><br>
|
||||
6.828 substituted xv6 for V6 in the fall of 2006.
|
||||
Based on that experience, we cleaned up rough patches
|
||||
of xv6 for the course in the fall of 2007.
|
||||
Since then, xv6 has stabilized, so we are making it
|
||||
available in the hopes that others will find it useful too.
|
||||
<br><br>
|
||||
6.828 uses both xv6 and Jos.
|
||||
Courses taught at UCLA, NYU, and Stanford have used
|
||||
Jos without xv6; we believe other courses could use
|
||||
xv6 without Jos, though we are not aware of any that have.
|
||||
|
||||
<h2>Xv6 sources</h2>
|
||||
The latest xv6 is <a href="xv6-rev2.tar.gz">xv6-rev2.tar.gz</a>.
|
||||
We distribute the sources in electronic form but also as
|
||||
a printed booklet with line numbers that keep everyone
|
||||
together during lectures. The booklet is available as
|
||||
<a href="xv6-rev2.pdf">xv6-rev2.pdf</a>.
|
||||
<br><br>
|
||||
xv6 compiles using the GNU C compiler,
|
||||
targeted at the x86 using ELF binaries.
|
||||
On BSD and Linux systems, you can use the native compilers;
|
||||
On OS X, which doesn't use ELF binaries,
|
||||
you must use a cross-compiler.
|
||||
Xv6 does boot on real hardware, but typically
|
||||
we run it using the Bochs emulator.
|
||||
Both the GCC cross compiler and Bochs
|
||||
can be found on the <a href="../../2007/tools.html">6.828 tools page</a>.
|
||||
|
||||
<h2>Lectures</h2>
|
||||
In 6.828, the lectures in the first half of the course
|
||||
introduce the PC hardware, the Intel x86, and then xv6.
|
||||
The lectures in the second half consider advanced topics
|
||||
using research papers; for some, xv6 serves as a useful
|
||||
base for making discussions concrete.
|
||||
This section describe a typical 6.828 lecture schedule,
|
||||
linking to lecture notes and homework.
|
||||
A course using only xv6 (not Jos) will need to adapt
|
||||
a few of the lectures, but we hope these are a useful
|
||||
starting point.
|
||||
|
||||
<br><br><b><i>Lecture 1. Operating systems</i></b>
|
||||
<br><br>
|
||||
The first lecture introduces both the general topic of
|
||||
operating systems and the specific approach of 6.828.
|
||||
After defining “operating system,” the lecture
|
||||
examines the implementation of a Unix shell
|
||||
to look at the details the traditional Unix system call interface.
|
||||
This is relevant to both xv6 and Jos: in the final
|
||||
Jos labs, students implement a Unix-like interface
|
||||
and culminating in a Unix shell.
|
||||
<br><br>
|
||||
<a href="l1.html">lecture notes</a>
|
||||
|
||||
<br><br><b><i>Lecture 2. PC hardware and x86 programming</i></b>
|
||||
<br><br>
|
||||
This lecture introduces the PC architecture, the 16- and 32-bit x86,
|
||||
the stack, and the GCC x86 calling conventions.
|
||||
It also introduces the pieces of a typical C tool chain–compiler,
|
||||
assembler, linker, loader–and the Bochs emulator.
|
||||
<br><br>
|
||||
Reading: PC Assembly Language
|
||||
<br><br>
|
||||
Homework: familiarize with Bochs
|
||||
<br><br>
|
||||
<a href="l2.html">lecture notes</a>
|
||||
<a href="x86-intro.html">homework</a>
|
||||
|
||||
<br><br><b><i>Lecture 3. Operating system organization</i></b>
|
||||
<br><br>
|
||||
This lecture continues Lecture 1's discussion of what
|
||||
an operating system does.
|
||||
An operating system provides a “virtual computer”
|
||||
interface to user space programs.
|
||||
At a high level, the main job of the operating system
|
||||
is to implement that interface
|
||||
using the physical computer it runs on.
|
||||
<br><br>
|
||||
The lecture discusses four approaches to that job:
|
||||
monolithic operating systems, microkernels,
|
||||
virtual machines, and exokernels.
|
||||
Exokernels might not be worth mentioning
|
||||
except that the Jos labs are built around one.
|
||||
<br><br>
|
||||
Reading: Engler et al., Exokernel: An Operating System Architecture
|
||||
for Application-Level Resource Management
|
||||
<br><br>
|
||||
<a href="l3.html">lecture notes</a>
|
||||
|
||||
<br><br><b><i>Lecture 4. Address spaces using segmentation</i></b>
|
||||
<br><br>
|
||||
This is the first lecture that uses xv6.
|
||||
It introduces the idea of address spaces and the
|
||||
details of the x86 segmentation hardware.
|
||||
It makes the discussion concrete by reading the xv6
|
||||
source code and watching xv6 execute using the Bochs simulator.
|
||||
<br><br>
|
||||
Reading: x86 MMU handout,
|
||||
xv6: bootasm.S, bootother.S, <a href="src/bootmain.c.html">bootmain.c</a>, <a href="src/main.c.html">main.c</a>, <a href="src/init.c.html">init.c</a>, and setupsegs in <a href="src/proc.c.html">proc.c</a>.
|
||||
<br><br>
|
||||
Homework: Bochs stack introduction
|
||||
<br><br>
|
||||
<a href="l4.html">lecture notes</a>
|
||||
<a href="xv6-intro.html">homework</a>
|
||||
|
||||
<br><br><b><i>Lecture 5. Address spaces using page tables</i></b>
|
||||
<br><br>
|
||||
This lecture continues the discussion of address spaces,
|
||||
examining the other x86 virtual memory mechanism: page tables.
|
||||
Xv6 does not use page tables, so there is no xv6 here.
|
||||
Instead, the lecture uses Jos as a concrete example.
|
||||
An xv6-only course might skip or shorten this discussion.
|
||||
<br><br>
|
||||
Reading: x86 manual excerpts
|
||||
<br><br>
|
||||
Homework: stuff about gdt
|
||||
XXX not appropriate; should be in Lecture 4
|
||||
<br><br>
|
||||
<a href="l5.html">lecture notes</a>
|
||||
|
||||
<br><br><b><i>Lecture 6. Interrupts and exceptions</i></b>
|
||||
<br><br>
|
||||
How does a user program invoke the operating system kernel?
|
||||
How does the kernel return to the user program?
|
||||
What happens when a hardware device needs attention?
|
||||
This lecture explains the answer to these questions:
|
||||
interrupt and exception handling.
|
||||
<br><br>
|
||||
It explains the x86 trap setup mechanisms and then
|
||||
examines their use in xv6's SETGATE (<a href="src/mmu.h.html">mmu.h</a>),
|
||||
tvinit (<a href="src/trap.c.html">trap.c</a>), idtinit (<a href="src/trap.c.html">trap.c</a>), <a href="src/vectors.pl.html">vectors.pl</a>, and vectors.S.
|
||||
<br><br>
|
||||
It then traces through a call to the system call open:
|
||||
<a href="src/init.c.html">init.c</a>, usys.S, vector48 and alltraps (vectors.S), trap (<a href="src/trap.c.html">trap.c</a>),
|
||||
syscall (<a href="src/syscall.c.html">syscall.c</a>),
|
||||
sys_open (<a href="src/sysfile.c.html">sysfile.c</a>), fetcharg, fetchint, argint, argptr, argstr (<a href="src/syscall.c.html">syscall.c</a>),
|
||||
<br><br>
|
||||
The interrupt controller, briefly:
|
||||
pic_init and pic_enable (<a href="src/picirq.c.html">picirq.c</a>).
|
||||
The timer and keyboard, briefly:
|
||||
timer_init (<a href="src/timer.c.html">timer.c</a>), console_init (<a href="src/console.c.html">console.c</a>).
|
||||
Enabling and disabling of interrupts.
|
||||
<br><br>
|
||||
Reading: x86 manual excerpts,
|
||||
xv6: trapasm.S, <a href="src/trap.c.html">trap.c</a>, <a href="src/syscall.c.html">syscall.c</a>, and usys.S.
|
||||
Skim <a href="src/lapic.c.html">lapic.c</a>, <a href="src/ioapic.c.html">ioapic.c</a>, <a href="src/picirq.c.html">picirq.c</a>.
|
||||
<br><br>
|
||||
Homework: Explain the 35 words on the top of the
|
||||
stack at first invocation of <code>syscall</code>.
|
||||
<br><br>
|
||||
<a href="l-interrupt.html">lecture notes</a>
|
||||
<a href="x86-intr.html">homework</a>
|
||||
|
||||
<br><br><b><i>Lecture 7. Multiprocessors and locking</i></b>
|
||||
<br><br>
|
||||
This lecture introduces the problems of
|
||||
coordination and synchronization on a
|
||||
multiprocessor
|
||||
and then the solution of mutual exclusion locks.
|
||||
Atomic instructions, test-and-set locks,
|
||||
lock granularity, (the mistake of) recursive locks.
|
||||
<br><br>
|
||||
Although xv6 user programs cannot share memory,
|
||||
the xv6 kernel itself is a program with multiple threads
|
||||
executing concurrently and sharing memory.
|
||||
Illustration: the xv6 scheduler's proc_table_lock (<a href="src/proc.c.html">proc.c</a>)
|
||||
and the spin lock implementation (<a href="src/spinlock.c.html">spinlock.c</a>).
|
||||
<br><br>
|
||||
Reading: xv6: <a href="src/spinlock.c.html">spinlock.c</a>. Skim <a href="src/mp.c.html">mp.c</a>.
|
||||
<br><br>
|
||||
Homework: Interaction between locking and interrupts.
|
||||
Try not disabling interrupts in the disk driver and watch xv6 break.
|
||||
<br><br>
|
||||
<a href="l-lock.html">lecture notes</a>
|
||||
<a href="xv6-lock.html">homework</a>
|
||||
|
||||
<br><br><b><i>Lecture 8. Threads, processes and context switching</i></b>
|
||||
<br><br>
|
||||
The last lecture introduced some of the issues
|
||||
in writing threaded programs, using xv6's processes
|
||||
as an example.
|
||||
This lecture introduces the issues in implementing
|
||||
threads, continuing to use xv6 as the example.
|
||||
<br><br>
|
||||
The lecture defines a thread of computation as a register
|
||||
set and a stack. A process is an address space plus one
|
||||
or more threads of computation sharing that address space.
|
||||
Thus the xv6 kernel can be viewed as a single process
|
||||
with many threads (each user process) executing concurrently.
|
||||
<br><br>
|
||||
Illustrations: thread switching (swtch.S), scheduler (<a href="src/proc.c.html">proc.c</a>), sys_fork (<a href="src/sysproc.c.html">sysproc.c</a>)
|
||||
<br><br>
|
||||
Reading: <a href="src/proc.c.html">proc.c</a>, swtch.S, sys_fork (<a href="src/sysproc.c.html">sysproc.c</a>)
|
||||
<br><br>
|
||||
Homework: trace through stack switching.
|
||||
<br><br>
|
||||
<a href="l-threads.html">lecture notes (need to be updated to use swtch)</a>
|
||||
<a href="xv6-sched.html">homework</a>
|
||||
|
||||
<br><br><b><i>Lecture 9. Processes and coordination</i></b>
|
||||
<br><br>
|
||||
This lecture introduces the idea of sequence coordination
|
||||
and then examines the particular solution illustrated by
|
||||
sleep and wakeup (<a href="src/proc.c.html">proc.c</a>).
|
||||
It introduces and refines a simple
|
||||
producer/consumer queue to illustrate the
|
||||
need for sleep and wakeup
|
||||
and then the sleep and wakeup
|
||||
implementations themselves.
|
||||
<br><br>
|
||||
Reading: <a href="src/proc.c.html">proc.c</a>, sys_exec, sys_sbrk, sys_wait, sys_exec, sys_kill (<a href="src/sysproc.c.html">sysproc.c</a>).
|
||||
<br><br>
|
||||
Homework: Explain how sleep and wakeup would break
|
||||
without proc_table_lock. Explain how devices would break
|
||||
without second lock argument to sleep.
|
||||
<br><br>
|
||||
<a href="l-coordination.html">lecture notes</a>
|
||||
<a href="xv6-sleep.html">homework</a>
|
||||
|
||||
<br><br><b><i>Lecture 10. Files and disk I/O</i></b>
|
||||
<br><br>
|
||||
This is the first of three file system lectures.
|
||||
This lecture introduces the basic file system interface
|
||||
and then considers the on-disk layout of individual files
|
||||
and the free block bitmap.
|
||||
<br><br>
|
||||
Reading: iread, iwrite, fileread, filewrite, wdir, mknod1, and
|
||||
code related to these calls in <a href="src/fs.c.html">fs.c</a>, <a href="src/bio.c.html">bio.c</a>, <a href="src/ide.c.html">ide.c</a>, and <a href="src/file.c.html">file.c</a>.
|
||||
<br><br>
|
||||
Homework: Add print to bwrite to trace every disk write.
|
||||
Explain the disk writes caused by some simple shell commands.
|
||||
<br><br>
|
||||
<a href="l-fs.html">lecture notes</a>
|
||||
<a href="xv6-disk.html">homework</a>
|
||||
|
||||
<br><br><b><i>Lecture 11. Naming</i></b>
|
||||
<br><br>
|
||||
The last lecture discussed on-disk file system representation.
|
||||
This lecture covers the implementation of
|
||||
file system paths (namei in <a href="src/fs.c.html">fs.c</a>)
|
||||
and also discusses the security problems of a shared /tmp
|
||||
and symbolic links.
|
||||
<br><br>
|
||||
Understanding exec (<a href="src/exec.c.html">exec.c</a>) is left as an exercise.
|
||||
<br><br>
|
||||
Reading: namei in <a href="src/fs.c.html">fs.c</a>, <a href="src/sysfile.c.html">sysfile.c</a>, <a href="src/file.c.html">file.c</a>.
|
||||
<br><br>
|
||||
Homework: Explain how to implement symbolic links in xv6.
|
||||
<br><br>
|
||||
<a href="l-name.html">lecture notes</a>
|
||||
<a href="xv6-names.html">homework</a>
|
||||
|
||||
<br><br><b><i>Lecture 12. High-performance file systems</i></b>
|
||||
<br><br>
|
||||
This lecture is the first of the research paper-based lectures.
|
||||
It discusses the “soft updates” paper,
|
||||
using xv6 as a concrete example.
|
||||
|
||||
<h2>Feedback</h2>
|
||||
If you are interested in using xv6 or have used xv6 in a course,
|
||||
we would love to hear from you.
|
||||
If there's anything that we can do to make xv6 easier
|
||||
to adopt, we'd like to hear about it.
|
||||
We'd also be interested to hear what worked well and what didn't.
|
||||
<br><br>
|
||||
Russ Cox (rsc@swtch.com)<br>
|
||||
Frans Kaashoek (kaashoek@mit.edu)<br>
|
||||
Robert Morris (rtm@mit.edu)
|
||||
<br><br>
|
||||
You can reach all of us at 6.828-staff@pdos.csail.mit.edu.
|
||||
<br><br>
|
||||
<br><br>
|
||||
</body>
|
||||
</html>
|
335
web/index.txt
Normal file
335
web/index.txt
Normal file
|
@ -0,0 +1,335 @@
|
|||
** Xv6, a simple Unix-like teaching operating system
|
||||
Xv6 is a teaching operating system developed
|
||||
in the summer of 2006 for MIT's operating systems course,
|
||||
``6.828: Operating Systems Engineering.''
|
||||
We used it for 6.828 in Fall 2006 and Fall 2007
|
||||
and are using it this semester (Fall 2008).
|
||||
We hope that xv6 will be useful in other courses too.
|
||||
This page collects resources to aid the use of xv6
|
||||
in other courses.
|
||||
|
||||
* History and Background
|
||||
|
||||
For many years, MIT had no operating systems course.
|
||||
In the fall of 2002, Frans Kaashoek, Josh Cates, and Emil Sit
|
||||
created a new, experimental course (6.097)
|
||||
to teach operating systems engineering.
|
||||
In the course lectures, the class worked through Sixth Edition Unix (aka V6)
|
||||
using John Lions's famous commentary.
|
||||
In the lab assignments, students wrote most of an exokernel operating
|
||||
system, eventually named Jos, for the Intel x86.
|
||||
Exposing students to multiple systems--V6 and Jos--helped
|
||||
develop a sense of the spectrum of operating system designs.
|
||||
In the fall of 2003, the experimental 6.097 became the
|
||||
official course 6.828; the course has been offered each fall since then.
|
||||
|
||||
V6 presented pedagogic challenges from the start.
|
||||
Students doubted the relevance of an obsolete 30-year-old operating system
|
||||
written in an obsolete programming language (pre-K&R C)
|
||||
running on obsolete hardware (the PDP-11).
|
||||
Students also struggled to learn the low-level details of two different
|
||||
architectures (the PDP-11 and the Intel x86) at the same time.
|
||||
By the summer of 2006, we had decided to replace V6
|
||||
with a new operating system, xv6, modeled on V6
|
||||
but written in ANSI C and running on multiprocessor
|
||||
Intel x86 machines.
|
||||
Xv6's use of the x86 makes it more relevant to
|
||||
students' experience than V6 was
|
||||
and unifies the course around a single architecture.
|
||||
Adding multiprocessor support also helps relevance
|
||||
and makes it easier to discuss threads and concurrency.
|
||||
(In a single processor operating system, concurrency--which only
|
||||
happens because of interrupts--is too easy to view as a special case.
|
||||
A multiprocessor operating system must attack the problem head on.)
|
||||
Finally, writing a new system allowed us to write cleaner versions
|
||||
of the rougher parts of V6, like the scheduler and file system.
|
||||
|
||||
6.828 substituted xv6 for V6 in the fall of 2006.
|
||||
Based on that experience, we cleaned up rough patches
|
||||
of xv6 for the course in the fall of 2007.
|
||||
Since then, xv6 has stabilized, so we are making it
|
||||
available in the hopes that others will find it useful too.
|
||||
|
||||
6.828 uses both xv6 and Jos.
|
||||
Courses taught at UCLA, NYU, and Stanford have used
|
||||
Jos without xv6; we believe other courses could use
|
||||
xv6 without Jos, though we are not aware of any that have.
|
||||
|
||||
|
||||
* Xv6 sources
|
||||
|
||||
The latest xv6 is [xv6-rev2.tar.gz].
|
||||
We distribute the sources in electronic form but also as
|
||||
a printed booklet with line numbers that keep everyone
|
||||
together during lectures. The booklet is available as
|
||||
[xv6-rev2.pdf].
|
||||
|
||||
xv6 compiles using the GNU C compiler,
|
||||
targeted at the x86 using ELF binaries.
|
||||
On BSD and Linux systems, you can use the native compilers;
|
||||
On OS X, which doesn't use ELF binaries,
|
||||
you must use a cross-compiler.
|
||||
Xv6 does boot on real hardware, but typically
|
||||
we run it using the Bochs emulator.
|
||||
Both the GCC cross compiler and Bochs
|
||||
can be found on the [../../2007/tools.html | 6.828 tools page].
|
||||
|
||||
|
||||
* Lectures
|
||||
|
||||
In 6.828, the lectures in the first half of the course
|
||||
introduce the PC hardware, the Intel x86, and then xv6.
|
||||
The lectures in the second half consider advanced topics
|
||||
using research papers; for some, xv6 serves as a useful
|
||||
base for making discussions concrete.
|
||||
This section describe a typical 6.828 lecture schedule,
|
||||
linking to lecture notes and homework.
|
||||
A course using only xv6 (not Jos) will need to adapt
|
||||
a few of the lectures, but we hope these are a useful
|
||||
starting point.
|
||||
|
||||
|
||||
Lecture 1. Operating systems
|
||||
|
||||
The first lecture introduces both the general topic of
|
||||
operating systems and the specific approach of 6.828.
|
||||
After defining ``operating system,'' the lecture
|
||||
examines the implementation of a Unix shell
|
||||
to look at the details the traditional Unix system call interface.
|
||||
This is relevant to both xv6 and Jos: in the final
|
||||
Jos labs, students implement a Unix-like interface
|
||||
and culminating in a Unix shell.
|
||||
|
||||
[l1.html | lecture notes]
|
||||
|
||||
|
||||
Lecture 2. PC hardware and x86 programming
|
||||
|
||||
This lecture introduces the PC architecture, the 16- and 32-bit x86,
|
||||
the stack, and the GCC x86 calling conventions.
|
||||
It also introduces the pieces of a typical C tool chain--compiler,
|
||||
assembler, linker, loader--and the Bochs emulator.
|
||||
|
||||
Reading: PC Assembly Language
|
||||
|
||||
Homework: familiarize with Bochs
|
||||
|
||||
[l2.html | lecture notes]
|
||||
[x86-intro.html | homework]
|
||||
|
||||
|
||||
Lecture 3. Operating system organization
|
||||
|
||||
This lecture continues Lecture 1's discussion of what
|
||||
an operating system does.
|
||||
An operating system provides a ``virtual computer''
|
||||
interface to user space programs.
|
||||
At a high level, the main job of the operating system
|
||||
is to implement that interface
|
||||
using the physical computer it runs on.
|
||||
|
||||
The lecture discusses four approaches to that job:
|
||||
monolithic operating systems, microkernels,
|
||||
virtual machines, and exokernels.
|
||||
Exokernels might not be worth mentioning
|
||||
except that the Jos labs are built around one.
|
||||
|
||||
Reading: Engler et al., Exokernel: An Operating System Architecture
|
||||
for Application-Level Resource Management
|
||||
|
||||
[l3.html | lecture notes]
|
||||
|
||||
|
||||
Lecture 4. Address spaces using segmentation
|
||||
|
||||
This is the first lecture that uses xv6.
|
||||
It introduces the idea of address spaces and the
|
||||
details of the x86 segmentation hardware.
|
||||
It makes the discussion concrete by reading the xv6
|
||||
source code and watching xv6 execute using the Bochs simulator.
|
||||
|
||||
Reading: x86 MMU handout,
|
||||
xv6: bootasm.S, bootother.S, bootmain.c, main.c, init.c, and setupsegs in proc.c.
|
||||
|
||||
Homework: Bochs stack introduction
|
||||
|
||||
[l4.html | lecture notes]
|
||||
[xv6-intro.html | homework]
|
||||
|
||||
|
||||
Lecture 5. Address spaces using page tables
|
||||
|
||||
This lecture continues the discussion of address spaces,
|
||||
examining the other x86 virtual memory mechanism: page tables.
|
||||
Xv6 does not use page tables, so there is no xv6 here.
|
||||
Instead, the lecture uses Jos as a concrete example.
|
||||
An xv6-only course might skip or shorten this discussion.
|
||||
|
||||
Reading: x86 manual excerpts
|
||||
|
||||
Homework: stuff about gdt
|
||||
XXX not appropriate; should be in Lecture 4
|
||||
|
||||
[l5.html | lecture notes]
|
||||
|
||||
|
||||
Lecture 6. Interrupts and exceptions
|
||||
|
||||
How does a user program invoke the operating system kernel?
|
||||
How does the kernel return to the user program?
|
||||
What happens when a hardware device needs attention?
|
||||
This lecture explains the answer to these questions:
|
||||
interrupt and exception handling.
|
||||
|
||||
It explains the x86 trap setup mechanisms and then
|
||||
examines their use in xv6's SETGATE (mmu.h),
|
||||
tvinit (trap.c), idtinit (trap.c), vectors.pl, and vectors.S.
|
||||
|
||||
It then traces through a call to the system call open:
|
||||
init.c, usys.S, vector48 and alltraps (vectors.S), trap (trap.c),
|
||||
syscall (syscall.c),
|
||||
sys_open (sysfile.c), fetcharg, fetchint, argint, argptr, argstr (syscall.c),
|
||||
|
||||
The interrupt controller, briefly:
|
||||
pic_init and pic_enable (picirq.c).
|
||||
The timer and keyboard, briefly:
|
||||
timer_init (timer.c), console_init (console.c).
|
||||
Enabling and disabling of interrupts.
|
||||
|
||||
Reading: x86 manual excerpts,
|
||||
xv6: trapasm.S, trap.c, syscall.c, and usys.S.
|
||||
Skim lapic.c, ioapic.c, picirq.c.
|
||||
|
||||
Homework: Explain the 35 words on the top of the
|
||||
stack at first invocation of <code>syscall</code>.
|
||||
|
||||
[l-interrupt.html | lecture notes]
|
||||
[x86-intr.html | homework]
|
||||
|
||||
|
||||
Lecture 7. Multiprocessors and locking
|
||||
|
||||
This lecture introduces the problems of
|
||||
coordination and synchronization on a
|
||||
multiprocessor
|
||||
and then the solution of mutual exclusion locks.
|
||||
Atomic instructions, test-and-set locks,
|
||||
lock granularity, (the mistake of) recursive locks.
|
||||
|
||||
Although xv6 user programs cannot share memory,
|
||||
the xv6 kernel itself is a program with multiple threads
|
||||
executing concurrently and sharing memory.
|
||||
Illustration: the xv6 scheduler's proc_table_lock (proc.c)
|
||||
and the spin lock implementation (spinlock.c).
|
||||
|
||||
Reading: xv6: spinlock.c. Skim mp.c.
|
||||
|
||||
Homework: Interaction between locking and interrupts.
|
||||
Try not disabling interrupts in the disk driver and watch xv6 break.
|
||||
|
||||
[l-lock.html | lecture notes]
|
||||
[xv6-lock.html | homework]
|
||||
|
||||
|
||||
Lecture 8. Threads, processes and context switching
|
||||
|
||||
The last lecture introduced some of the issues
|
||||
in writing threaded programs, using xv6's processes
|
||||
as an example.
|
||||
This lecture introduces the issues in implementing
|
||||
threads, continuing to use xv6 as the example.
|
||||
|
||||
The lecture defines a thread of computation as a register
|
||||
set and a stack. A process is an address space plus one
|
||||
or more threads of computation sharing that address space.
|
||||
Thus the xv6 kernel can be viewed as a single process
|
||||
with many threads (each user process) executing concurrently.
|
||||
|
||||
Illustrations: thread switching (swtch.S), scheduler (proc.c), sys_fork (sysproc.c)
|
||||
|
||||
Reading: proc.c, swtch.S, sys_fork (sysproc.c)
|
||||
|
||||
Homework: trace through stack switching.
|
||||
|
||||
[l-threads.html | lecture notes (need to be updated to use swtch)]
|
||||
[xv6-sched.html | homework]
|
||||
|
||||
|
||||
Lecture 9. Processes and coordination
|
||||
|
||||
This lecture introduces the idea of sequence coordination
|
||||
and then examines the particular solution illustrated by
|
||||
sleep and wakeup (proc.c).
|
||||
It introduces and refines a simple
|
||||
producer/consumer queue to illustrate the
|
||||
need for sleep and wakeup
|
||||
and then the sleep and wakeup
|
||||
implementations themselves.
|
||||
|
||||
Reading: proc.c, sys_exec, sys_sbrk, sys_wait, sys_exec, sys_kill (sysproc.c).
|
||||
|
||||
Homework: Explain how sleep and wakeup would break
|
||||
without proc_table_lock. Explain how devices would break
|
||||
without second lock argument to sleep.
|
||||
|
||||
[l-coordination.html | lecture notes]
|
||||
[xv6-sleep.html | homework]
|
||||
|
||||
|
||||
Lecture 10. Files and disk I/O
|
||||
|
||||
This is the first of three file system lectures.
|
||||
This lecture introduces the basic file system interface
|
||||
and then considers the on-disk layout of individual files
|
||||
and the free block bitmap.
|
||||
|
||||
Reading: iread, iwrite, fileread, filewrite, wdir, mknod1, and
|
||||
code related to these calls in fs.c, bio.c, ide.c, and file.c.
|
||||
|
||||
Homework: Add print to bwrite to trace every disk write.
|
||||
Explain the disk writes caused by some simple shell commands.
|
||||
|
||||
[l-fs.html | lecture notes]
|
||||
[xv6-disk.html | homework]
|
||||
|
||||
|
||||
Lecture 11. Naming
|
||||
|
||||
The last lecture discussed on-disk file system representation.
|
||||
This lecture covers the implementation of
|
||||
file system paths (namei in fs.c)
|
||||
and also discusses the security problems of a shared /tmp
|
||||
and symbolic links.
|
||||
|
||||
Understanding exec (exec.c) is left as an exercise.
|
||||
|
||||
Reading: namei in fs.c, sysfile.c, file.c.
|
||||
|
||||
Homework: Explain how to implement symbolic links in xv6.
|
||||
|
||||
[l-name.html | lecture notes]
|
||||
[xv6-names.html | homework]
|
||||
|
||||
|
||||
Lecture 12. High-performance file systems
|
||||
|
||||
This lecture is the first of the research paper-based lectures.
|
||||
It discusses the ``soft updates'' paper,
|
||||
using xv6 as a concrete example.
|
||||
|
||||
|
||||
* Feedback
|
||||
|
||||
If you are interested in using xv6 or have used xv6 in a course,
|
||||
we would love to hear from you.
|
||||
If there's anything that we can do to make xv6 easier
|
||||
to adopt, we'd like to hear about it.
|
||||
We'd also be interested to hear what worked well and what didn't.
|
||||
|
||||
Russ Cox (rsc@swtch.com)<br>
|
||||
Frans Kaashoek (kaashoek@mit.edu)<br>
|
||||
Robert Morris (rtm@mit.edu)
|
||||
|
||||
You can reach all of us at 6.828-staff@pdos.csail.mit.edu.
|
||||
|
||||
|
187
web/l-bugs.html
Normal file
187
web/l-bugs.html
Normal file
|
@ -0,0 +1,187 @@
|
|||
<title>OS Bugs</title>
|
||||
<html>
|
||||
<head>
|
||||
</head>
|
||||
<body>
|
||||
|
||||
<h1>OS Bugs</h1>
|
||||
|
||||
<p>Required reading: Bugs as deviant behavior
|
||||
|
||||
<h2>Overview</h2>
|
||||
|
||||
<p>Operating systems must obey many rules for correctness and
|
||||
performance. Examples rules:
|
||||
<ul>
|
||||
<li>Do not call blocking functions with interrupts disabled or spin
|
||||
lock held
|
||||
<li>check for NULL results
|
||||
<li>Do not allocate large stack variables
|
||||
<li>Do no re-use already-allocated memory
|
||||
<li>Check user pointers before using them in kernel mode
|
||||
<li>Release acquired locks
|
||||
</ul>
|
||||
|
||||
<p>In addition, there are standard software engineering rules, like
|
||||
use function results in consistent ways.
|
||||
|
||||
<p>These rules are typically not checked by a compiler, even though
|
||||
they could be checked by a compiler, in principle. The goal of the
|
||||
meta-level compilation project is to allow system implementors to
|
||||
write system-specific compiler extensions that check the source code
|
||||
for rule violations.
|
||||
|
||||
<p>The results are good: many new bugs found (500-1000) in Linux
|
||||
alone. The paper for today studies these bugs and attempts to draw
|
||||
lessons from these bugs.
|
||||
|
||||
<p>Are kernel error worse than user-level errors? That is, if we get
|
||||
the kernel correct, then we won't have system crashes?
|
||||
|
||||
<h2>Errors in JOS kernel</h2>
|
||||
|
||||
<p>What are unstated invariants in the JOS?
|
||||
<ul>
|
||||
<li>Interrupts are disabled in kernel mode
|
||||
<li>Only env 1 has access to disk
|
||||
<li>All registers are saved & restored on context switch
|
||||
<li>Application code is never executed with CPL 0
|
||||
<li>Don't allocate an already-allocated physical page
|
||||
<li>Propagate error messages to user applications (e.g., out of
|
||||
resources)
|
||||
<li>Map pipe before fd
|
||||
<li>Unmap fd before pipe
|
||||
<li>A spawned program should have open only file descriptors 0, 1, and 2.
|
||||
<li>Pass sometimes size in bytes and sometimes in block number to a
|
||||
given file system function.
|
||||
<li>User pointers should be run through TRUP before used by the kernel
|
||||
</ul>
|
||||
|
||||
<p>Could these errors have been caught by metacompilation? Would
|
||||
metacompilation have caught the pipe race condition? (Probably not,
|
||||
it happens in only one place.)
|
||||
|
||||
<p>How confident are you that your code is correct? For example,
|
||||
are you sure interrupts are always disabled in kernel mode? How would
|
||||
you test?
|
||||
|
||||
<h2>Metacompilation</h2>
|
||||
|
||||
<p>A system programmer writes the rule checkers in a high-level,
|
||||
state-machine language (metal). These checkers are dynamically linked
|
||||
into an extensible version of g++, xg++. Xg++ applies the rule
|
||||
checkers to every possible execution path of a function that is being
|
||||
compiled.
|
||||
|
||||
<p>An example rule from
|
||||
the <a
|
||||
href="http://www.stanford.edu/~engler/exe-ccs-06.pdf">OSDI
|
||||
paper</a>:
|
||||
<pre>
|
||||
sm check_interrupts {
|
||||
decl { unsigned} flags;
|
||||
pat enable = { sti(); } | {restore_flags(flags);} ;
|
||||
pat disable = { cli(); };
|
||||
|
||||
is_enabled: disable ==> is_disabled | enable ==> { err("double
|
||||
enable")};
|
||||
...
|
||||
</pre>
|
||||
A more complete version found 82 errors in the Linux 2.3.99 kernel.
|
||||
|
||||
<p>Common mistake:
|
||||
<pre>
|
||||
get_free_buffer ( ... ) {
|
||||
....
|
||||
save_flags (flags);
|
||||
cli ();
|
||||
if ((bh = sh->buffer_pool) == NULL)
|
||||
return NULL;
|
||||
....
|
||||
}
|
||||
</pre>
|
||||
<p>(Figure 2 also lists a simple metarule.)
|
||||
|
||||
<p>Some checkers produce false positives, because of limitations of
|
||||
both static analysis and the checkers, which mostly use local
|
||||
analysis.
|
||||
|
||||
<p>How does the <b>block</b> checker work? The first pass is a rule
|
||||
that marks functions as potentially blocking. After processing a
|
||||
function, the checker emits the function's flow graph to a file
|
||||
(including, annotations and functions called). The second pass takes
|
||||
the merged flow graph of all function calls, and produces a file with
|
||||
all functions that have a path in the control-flow-graph to a blocking
|
||||
function call. For the Linux kernel this results in 3,000 functions
|
||||
that potentially could call sleep. Yet another checker like
|
||||
check_interrupts checks if a function calls any of the 3,000 functions
|
||||
with interrupts disabled. Etc.
|
||||
|
||||
<h2>This paper</h2>
|
||||
|
||||
<p>Writing rules is painful. First, you have to write them. Second,
|
||||
how do you decide what to check? Was it easy to enumerate all
|
||||
conventions for JOS?
|
||||
|
||||
<p>Insight: infer programmer "beliefs" from code and cross-check
|
||||
for contradictions. If <i>cli</i> is always followed by <i>sti</i>,
|
||||
except in one case, perhaps something is wrong. This simplifies
|
||||
life because we can write generic checkers instead of checkers
|
||||
that specifically check for <i>sti</i>, and perhaps we get lucky
|
||||
and find other temporal ordering conventions.
|
||||
|
||||
<p>Do we know which case is wrong? The 999 times or the 1 time that
|
||||
<i>sti</i> is absent? (No, this method cannot figure what the correct
|
||||
sequence is but it can flag that something is weird, which in practice
|
||||
useful.) The method just detects inconsistencies.
|
||||
|
||||
<p>Is every inconsistency an error? No, some inconsistency don't
|
||||
indicate an error. If a call to function <i>f</i> is often followed
|
||||
by call to function <i>g</i>, does that imply that f should always be
|
||||
followed by g? (No!)
|
||||
|
||||
<p>Solution: MUST beliefs and MAYBE beliefs. MUST beliefs are
|
||||
invariants that must hold; any inconsistency indicates an error. If a
|
||||
pointer is dereferences, then the programmer MUST believe that the
|
||||
pointer is pointing to something that can be dereferenced (i.e., the
|
||||
pointer is definitely not zero). MUST beliefs can be checked using
|
||||
"internal inconsistencies".
|
||||
|
||||
<p>An aside, can zero pointers pointers be detected during runtime?
|
||||
(Sure, unmap the page at address zero.) Why is metacompilation still
|
||||
valuable? (At runtime you will find only the null pointers that your
|
||||
test code dereferenced; not all possible dereferences of null
|
||||
pointers.) An even more convincing example for Metacompilation is
|
||||
tracking user pointers that the kernel dereferences. (Is this a MUST
|
||||
belief?)
|
||||
|
||||
<p>MAYBE beliefs are invariants that are suggested by the code, but
|
||||
they maybe coincidences. MAYBE beliefs are ranked by statistical
|
||||
analysis, and perhaps augmented with input about functions names
|
||||
(e.g., alloc and free are important). Is it computationally feasible
|
||||
to check every MAYBE belief? Could there be much noise?
|
||||
|
||||
<p>What errors won't this approach catch?
|
||||
|
||||
<h2>Paper discussion</h2>
|
||||
|
||||
<p>This paper is best discussed by studying every code fragment. Most
|
||||
code fragments are pieces of code from Linux distributions; these
|
||||
mistakes are real!
|
||||
|
||||
<p>Section 3.1. what is the error? how does metacompilation catch
|
||||
it?
|
||||
|
||||
<p>Figure 1. what is the error? is there one?
|
||||
|
||||
<p>Code fragments from 6.1. what is the error? how does metacompilation catch
|
||||
it?
|
||||
|
||||
<p>Figure 3. what is the error? how does metacompilation catch
|
||||
it?
|
||||
|
||||
<p>Section 8.3. what is the error? how does metacompilation catch
|
||||
it?
|
||||
|
||||
</body>
|
||||
|
354
web/l-coordination.html
Normal file
354
web/l-coordination.html
Normal file
|
@ -0,0 +1,354 @@
|
|||
<title>L9</title>
|
||||
<html>
|
||||
<head>
|
||||
</head>
|
||||
<body>
|
||||
|
||||
<h1>Coordination and more processes</h1>
|
||||
|
||||
<p>Required reading: remainder of proc.c, sys_exec, sys_sbrk,
|
||||
sys_wait, sys_exit, and sys_kill.
|
||||
|
||||
<h2>Overview</h2>
|
||||
|
||||
<p>Big picture: more programs than processors. How to share the
|
||||
limited number of processors among the programs? Last lecture
|
||||
covered basic mechanism: threads and the distinction between process
|
||||
and thread. Today expand: how to coordinate the interactions
|
||||
between threads explicitly, and some operations on processes.
|
||||
|
||||
<p>Sequence coordination. This is a diferrent type of coordination
|
||||
than mutual-exclusion coordination (which has its goal to make
|
||||
atomic actions so that threads don't interfere). The goal of
|
||||
sequence coordination is for threads to coordinate the sequences in
|
||||
which they run.
|
||||
|
||||
<p>For example, a thread may want to wait until another thread
|
||||
terminates. One way to do so is to have the thread run periodically,
|
||||
let it check if the other thread terminated, and if not give up the
|
||||
processor again. This is wasteful, especially if there are many
|
||||
threads.
|
||||
|
||||
<p>With primitives for sequence coordination one can do better. The
|
||||
thread could tell the thread manager that it is waiting for an event
|
||||
(e.g., another thread terminating). When the other thread
|
||||
terminates, it explicitly wakes up the waiting thread. This is more
|
||||
work for the programmer, but more efficient.
|
||||
|
||||
<p>Sequence coordination often interacts with mutual-exclusion
|
||||
coordination, as we will see below.
|
||||
|
||||
<p>The operating system literature has a rich set of primivites for
|
||||
sequence coordination. We study a very simple version of condition
|
||||
variables in xv6: sleep and wakeup, with a single lock.
|
||||
|
||||
<h2>xv6 code examples</h2>
|
||||
|
||||
<h3>Sleep and wakeup - usage</h3>
|
||||
|
||||
Let's consider implementing a producer/consumer queue
|
||||
(like a pipe) that can be used to hold a single non-null char pointer:
|
||||
|
||||
<pre>
|
||||
struct pcq {
|
||||
void *ptr;
|
||||
};
|
||||
|
||||
void*
|
||||
pcqread(struct pcq *q)
|
||||
{
|
||||
void *p;
|
||||
|
||||
while((p = q->ptr) == 0)
|
||||
;
|
||||
q->ptr = 0;
|
||||
return p;
|
||||
}
|
||||
|
||||
void
|
||||
pcqwrite(struct pcq *q, void *p)
|
||||
{
|
||||
while(q->ptr != 0)
|
||||
;
|
||||
q->ptr = p;
|
||||
}
|
||||
</pre>
|
||||
|
||||
<p>Easy and correct, at least assuming there is at most one
|
||||
reader and at most one writer at a time.
|
||||
|
||||
<p>Unfortunately, the while loops are inefficient.
|
||||
Instead of polling, it would be great if there were
|
||||
primitives saying ``wait for some event to happen''
|
||||
and ``this event happened''.
|
||||
That's what sleep and wakeup do.
|
||||
|
||||
<p>Second try:
|
||||
|
||||
<pre>
|
||||
void*
|
||||
pcqread(struct pcq *q)
|
||||
{
|
||||
void *p;
|
||||
|
||||
if(q->ptr == 0)
|
||||
sleep(q);
|
||||
p = q->ptr;
|
||||
q->ptr = 0;
|
||||
wakeup(q); /* wake pcqwrite */
|
||||
return p;
|
||||
}
|
||||
|
||||
void
|
||||
pcqwrite(struct pcq *q, void *p)
|
||||
{
|
||||
if(q->ptr != 0)
|
||||
sleep(q);
|
||||
q->ptr = p;
|
||||
wakeup(q); /* wake pcqread */
|
||||
return p;
|
||||
}
|
||||
</pre>
|
||||
|
||||
That's better, but there is still a problem.
|
||||
What if the wakeup happens between the check in the if
|
||||
and the call to sleep?
|
||||
|
||||
<p>Add locks:
|
||||
|
||||
<pre>
|
||||
struct pcq {
|
||||
void *ptr;
|
||||
struct spinlock lock;
|
||||
};
|
||||
|
||||
void*
|
||||
pcqread(struct pcq *q)
|
||||
{
|
||||
void *p;
|
||||
|
||||
acquire(&q->lock);
|
||||
if(q->ptr == 0)
|
||||
sleep(q, &q->lock);
|
||||
p = q->ptr;
|
||||
q->ptr = 0;
|
||||
wakeup(q); /* wake pcqwrite */
|
||||
release(&q->lock);
|
||||
return p;
|
||||
}
|
||||
|
||||
void
|
||||
pcqwrite(struct pcq *q, void *p)
|
||||
{
|
||||
acquire(&q->lock);
|
||||
if(q->ptr != 0)
|
||||
sleep(q, &q->lock);
|
||||
q->ptr = p;
|
||||
wakeup(q); /* wake pcqread */
|
||||
release(&q->lock);
|
||||
return p;
|
||||
}
|
||||
</pre>
|
||||
|
||||
This is okay, and now safer for multiple readers and writers,
|
||||
except that wakeup wakes up everyone who is asleep on chan,
|
||||
not just one guy.
|
||||
So some of the guys who wake up from sleep might not
|
||||
be cleared to read or write from the queue. Have to go back to looping:
|
||||
|
||||
<pre>
|
||||
struct pcq {
|
||||
void *ptr;
|
||||
struct spinlock lock;
|
||||
};
|
||||
|
||||
void*
|
||||
pcqread(struct pcq *q)
|
||||
{
|
||||
void *p;
|
||||
|
||||
acquire(&q->lock);
|
||||
while(q->ptr == 0)
|
||||
sleep(q, &q->lock);
|
||||
p = q->ptr;
|
||||
q->ptr = 0;
|
||||
wakeup(q); /* wake pcqwrite */
|
||||
release(&q->lock);
|
||||
return p;
|
||||
}
|
||||
|
||||
void
|
||||
pcqwrite(struct pcq *q, void *p)
|
||||
{
|
||||
acquire(&q->lock);
|
||||
while(q->ptr != 0)
|
||||
sleep(q, &q->lock);
|
||||
q->ptr = p;
|
||||
wakeup(q); /* wake pcqread */
|
||||
release(&q->lock);
|
||||
return p;
|
||||
}
|
||||
</pre>
|
||||
|
||||
The difference between this an our original is that
|
||||
the body of the while loop is a much more efficient way to pause.
|
||||
|
||||
<p>Now we've figured out how to use it, but we
|
||||
still need to figure out how to implement it.
|
||||
|
||||
<h3>Sleep and wakeup - implementation</h3>
|
||||
<p>
|
||||
Simple implementation:
|
||||
|
||||
<pre>
|
||||
void
|
||||
sleep(void *chan, struct spinlock *lk)
|
||||
{
|
||||
struct proc *p = curproc[cpu()];
|
||||
|
||||
release(lk);
|
||||
p->chan = chan;
|
||||
p->state = SLEEPING;
|
||||
sched();
|
||||
}
|
||||
|
||||
void
|
||||
wakeup(void *chan)
|
||||
{
|
||||
for(each proc p) {
|
||||
if(p->state == SLEEPING && p->chan == chan)
|
||||
p->state = RUNNABLE;
|
||||
}
|
||||
}
|
||||
</pre>
|
||||
|
||||
<p>What's wrong? What if the wakeup runs right after
|
||||
the release(lk) in sleep?
|
||||
It still misses the sleep.
|
||||
|
||||
<p>Move the lock down:
|
||||
<pre>
|
||||
void
|
||||
sleep(void *chan, struct spinlock *lk)
|
||||
{
|
||||
struct proc *p = curproc[cpu()];
|
||||
|
||||
p->chan = chan;
|
||||
p->state = SLEEPING;
|
||||
release(lk);
|
||||
sched();
|
||||
}
|
||||
|
||||
void
|
||||
wakeup(void *chan)
|
||||
{
|
||||
for(each proc p) {
|
||||
if(p->state == SLEEPING && p->chan == chan)
|
||||
p->state = RUNNABLE;
|
||||
}
|
||||
}
|
||||
</pre>
|
||||
|
||||
<p>This almost works. Recall from last lecture that we also need
|
||||
to acquire the proc_table_lock before calling sched, to
|
||||
protect p->jmpbuf.
|
||||
|
||||
<pre>
|
||||
void
|
||||
sleep(void *chan, struct spinlock *lk)
|
||||
{
|
||||
struct proc *p = curproc[cpu()];
|
||||
|
||||
p->chan = chan;
|
||||
p->state = SLEEPING;
|
||||
acquire(&proc_table_lock);
|
||||
release(lk);
|
||||
sched();
|
||||
}
|
||||
</pre>
|
||||
|
||||
<p>The problem is that now we're using lk to protect
|
||||
access to the p->chan and p->state variables
|
||||
but other routines besides sleep and wakeup
|
||||
(in particular, proc_kill) will need to use them and won't
|
||||
know which lock protects them.
|
||||
So instead of protecting them with lk, let's use proc_table_lock:
|
||||
|
||||
<pre>
|
||||
void
|
||||
sleep(void *chan, struct spinlock *lk)
|
||||
{
|
||||
struct proc *p = curproc[cpu()];
|
||||
|
||||
acquire(&proc_table_lock);
|
||||
release(lk);
|
||||
p->chan = chan;
|
||||
p->state = SLEEPING;
|
||||
sched();
|
||||
}
|
||||
void
|
||||
wakeup(void *chan)
|
||||
{
|
||||
acquire(&proc_table_lock);
|
||||
for(each proc p) {
|
||||
if(p->state == SLEEPING && p->chan == chan)
|
||||
p->state = RUNNABLE;
|
||||
}
|
||||
release(&proc_table_lock);
|
||||
}
|
||||
</pre>
|
||||
|
||||
<p>One could probably make things work with lk as above,
|
||||
but the relationship between data and locks would be
|
||||
more complicated with no real benefit. Xv6 takes the easy way out
|
||||
and says that elements in the proc structure are always protected
|
||||
by proc_table_lock.
|
||||
|
||||
<h3>Use example: exit and wait</h3>
|
||||
|
||||
<p>If proc_wait decides there are children to be waited for,
|
||||
it calls sleep at line 2462.
|
||||
When a process exits, we proc_exit scans the process table
|
||||
to find the parent and wakes it at 2408.
|
||||
|
||||
<p>Which lock protects sleep and wakeup from missing each other?
|
||||
Proc_table_lock. Have to tweak sleep again to avoid double-acquire:
|
||||
|
||||
<pre>
|
||||
if(lk != &proc_table_lock) {
|
||||
acquire(&proc_table_lock);
|
||||
release(lk);
|
||||
}
|
||||
</pre>
|
||||
|
||||
<h3>New feature: kill</h3>
|
||||
|
||||
<p>Proc_kill marks a process as killed (line 2371).
|
||||
When the process finally exits the kernel to user space,
|
||||
or if a clock interrupt happens while it is in user space,
|
||||
it will be destroyed (line 2886, 2890, 2912).
|
||||
|
||||
<p>Why wait until the process ends up in user space?
|
||||
|
||||
<p>What if the process is stuck in sleep? It might take a long
|
||||
time to get back to user space.
|
||||
Don't want to have to wait for it, so make sleep wake up early
|
||||
(line 2373).
|
||||
|
||||
<p>This means all callers of sleep should check
|
||||
whether they have been killed, but none do.
|
||||
Bug in xv6.
|
||||
|
||||
<h3>System call handlers</h3>
|
||||
|
||||
<p>Sheet 32
|
||||
|
||||
<p>Fork: discussed copyproc in earlier lectures.
|
||||
Sys_fork (line 3218) just calls copyproc
|
||||
and marks the new proc runnable.
|
||||
Does fork create a new process or a new thread?
|
||||
Is there any shared context?
|
||||
|
||||
<p>Exec: we'll talk about exec later, when we talk about file systems.
|
||||
|
||||
<p>Sbrk: Saw growproc earlier. Why setupsegs before returning?
|
222
web/l-fs.html
Normal file
222
web/l-fs.html
Normal file
|
@ -0,0 +1,222 @@
|
|||
<title>L10</title>
|
||||
<html>
|
||||
<head>
|
||||
</head>
|
||||
<body>
|
||||
|
||||
<h1>File systems</h1>
|
||||
|
||||
<p>Required reading: iread, iwrite, and wdir, and code related to
|
||||
these calls in fs.c, bio.c, ide.c, file.c, and sysfile.c
|
||||
|
||||
<h2>Overview</h2>
|
||||
|
||||
<p>The next 3 lectures are about file systems:
|
||||
<ul>
|
||||
<li>Basic file system implementation
|
||||
<li>Naming
|
||||
<li>Performance
|
||||
</ul>
|
||||
|
||||
<p>Users desire to store their data durable so that data survives when
|
||||
the user turns of his computer. The primary media for doing so are:
|
||||
magnetic disks, flash memory, and tapes. We focus on magnetic disks
|
||||
(e.g., through the IDE interface in xv6).
|
||||
|
||||
<p>To allow users to remember where they stored a file, they can
|
||||
assign a symbolic name to a file, which appears in a directory.
|
||||
|
||||
<p>The data in a file can be organized in a structured way or not.
|
||||
The structured variant is often called a database. UNIX uses the
|
||||
unstructured variant: files are streams of bytes. Any particular
|
||||
structure is likely to be useful to only a small class of
|
||||
applications, and other applications will have to work hard to fit
|
||||
their data into one of the pre-defined structures. Besides, if you
|
||||
want structure, you can easily write a user-mode library program that
|
||||
imposes that format on any file. The end-to-end argument in action.
|
||||
(Databases have special requirements and support an important class of
|
||||
applications, and thus have a specialized plan.)
|
||||
|
||||
<p>The API for a minimal file system consists of: open, read, write,
|
||||
seek, close, and stat. Dup duplicates a file descriptor. For example:
|
||||
<pre>
|
||||
fd = open("x", O_RDWR);
|
||||
read (fd, buf, 100);
|
||||
write (fd, buf, 512);
|
||||
close (fd)
|
||||
</pre>
|
||||
|
||||
<p>Maintaining the file offset behind the read/write interface is an
|
||||
interesting design decision . The alternative is that the state of a
|
||||
read operation should be maintained by the process doing the reading
|
||||
(i.e., that the pointer should be passed as an argument to read).
|
||||
This argument is compelling in view of the UNIX fork() semantics,
|
||||
which clones a process which shares the file descriptors of its
|
||||
parent. A read by the parent of a shared file descriptor (e.g.,
|
||||
stdin, changes the read pointer seen by the child). On the other
|
||||
hand the alternative would make it difficult to get "(data; ls) > x"
|
||||
right.
|
||||
|
||||
<p>Unix API doesn't specify that the effects of write are immediately
|
||||
on the disk before a write returns. It is up to the implementation
|
||||
of the file system within certain bounds. Choices include (that
|
||||
aren't non-exclusive):
|
||||
<ul>
|
||||
<li>At some point in the future, if the system stays up (e.g., after
|
||||
30 seconds);
|
||||
<li>Before the write returns;
|
||||
<li>Before close returns;
|
||||
<li>User specified (e.g., before fsync returns).
|
||||
</ul>
|
||||
|
||||
<p>A design issue is the semantics of a file system operation that
|
||||
requires multiple disk writes. In particular, what happens if the
|
||||
logical update requires writing multiple disks blocks and the power
|
||||
fails during the update? For example, to create a new file,
|
||||
requires allocating an inode (which requires updating the list of
|
||||
free inodes on disk), writing a directory entry to record the
|
||||
allocated i-node under the name of the new file (which may require
|
||||
allocating a new block and updating the directory inode). If the
|
||||
power fails during the operation, the list of free inodes and blocks
|
||||
may be inconsistent with the blocks and inodes in use. Again this is
|
||||
up to implementation of the file system to keep on disk data
|
||||
structures consistent:
|
||||
<ul>
|
||||
<li>Don't worry about it much, but use a recovery program to bring
|
||||
file system back into a consistent state.
|
||||
<li>Journaling file system. Never let the file system get into an
|
||||
inconsistent state.
|
||||
</ul>
|
||||
|
||||
<p>Another design issue is the semantics are of concurrent writes to
|
||||
the same data item. What is the order of two updates that happen at
|
||||
the same time? For example, two processes open the same file and write
|
||||
to it. Modern Unix operating systems allow the application to lock a
|
||||
file to get exclusive access. If file locking is not used and if the
|
||||
file descriptor is shared, then the bytes of the two writes will get
|
||||
into the file in some order (this happens often for log files). If
|
||||
the file descriptor is not shared, the end result is not defined. For
|
||||
example, one write may overwrite the other one (e.g., if they are
|
||||
writing to the same part of the file.)
|
||||
|
||||
<p>An implementation issue is performance, because writing to magnetic
|
||||
disk is relatively expensive compared to computing. Three primary ways
|
||||
to improve performance are: careful file system layout that induces
|
||||
few seeks, an in-memory cache of frequently-accessed blocks, and
|
||||
overlap I/O with computation so that file operations don't have to
|
||||
wait until their completion and so that that the disk driver has more
|
||||
data to write, which allows disk scheduling. (We will talk about
|
||||
performance in detail later.)
|
||||
|
||||
<h2>xv6 code examples</h2>
|
||||
|
||||
<p>xv6 implements a minimal Unix file system interface. xv6 doesn't
|
||||
pay attention to file system layout. It overlaps computation and I/O,
|
||||
but doesn't do any disk scheduling. Its cache is write-through, which
|
||||
simplifies keep on disk datastructures consistent, but is bad for
|
||||
performance.
|
||||
|
||||
<p>On disk files are represented by an inode (struct dinode in fs.h),
|
||||
and blocks. Small files have up to 12 block addresses in their inode;
|
||||
large files use files the last address in the inode as a disk address
|
||||
for a block with 128 disk addresses (512/4). The size of a file is
|
||||
thus limited to 12 * 512 + 128*512 bytes. What would you change to
|
||||
support larger files? (Ans: e.g., double indirect blocks.)
|
||||
|
||||
<p>Directories are files with a bit of structure to them. The file
|
||||
contains of records of the type struct dirent. The entry contains the
|
||||
name for a file (or directory) and its corresponding inode number.
|
||||
How many files can appear in a directory?
|
||||
|
||||
<p>In memory files are represented by struct inode in fsvar.h. What is
|
||||
the role of the additional fields in struct inode?
|
||||
|
||||
<p>What is xv6's disk layout? How does xv6 keep track of free blocks
|
||||
and inodes? See balloc()/bfree() and ialloc()/ifree(). Is this
|
||||
layout a good one for performance? What are other options?
|
||||
|
||||
<p>Let's assume that an application created an empty file x with
|
||||
contains 512 bytes, and that the application now calls read(fd, buf,
|
||||
100), that is, it is requesting to read 100 bytes into buf.
|
||||
Furthermore, let's assume that the inode for x is is i. Let's pick
|
||||
up what happens by investigating readi(), line 4483.
|
||||
<ul>
|
||||
<li>4488-4492: can iread be called on other objects than files? (Yes.
|
||||
For example, read from the keyboard.) Everything is a file in Unix.
|
||||
<li>4495: what does bmap do?
|
||||
<ul>
|
||||
<li>4384: what block is being read?
|
||||
</ul>
|
||||
<li>4483: what does bread do? does bread always cause a read to disk?
|
||||
<ul>
|
||||
<li>4006: what does bget do? it implements a simple cache of
|
||||
recently-read disk blocks.
|
||||
<ul>
|
||||
<li>How big is the cache? (see param.h)
|
||||
<li>3972: look if the requested block is in the cache by walking down
|
||||
a circular list.
|
||||
<li>3977: we had a match.
|
||||
<li>3979: some other process has "locked" the block, wait until it
|
||||
releases. the other processes releases the block using brelse().
|
||||
Why lock a block?
|
||||
<ul>
|
||||
<li>Atomic read and update. For example, allocating an inode: read
|
||||
block containing inode, mark it allocated, and write it back. This
|
||||
operation must be atomic.
|
||||
</ul>
|
||||
<li>3982: it is ours now.
|
||||
<li>3987: it is not in the cache; we need to find a cache entry to
|
||||
hold the block.
|
||||
<li>3987: what is the cache replacement strategy? (see also brelse())
|
||||
<li>3988: found an entry that we are going to use.
|
||||
<li>3989: mark it ours but don't mark it valid (there is no valid data
|
||||
in the entry yet).
|
||||
</ul>
|
||||
<li>4007: if the block was in the cache and the entry has the block's
|
||||
data, return.
|
||||
<li>4010: if the block wasn't in the cache, read it from disk. are
|
||||
read's synchronous or asynchronous?
|
||||
<ul>
|
||||
<li>3836: a bounded buffer of outstanding disk requests.
|
||||
<li>3809: tell the disk to move arm and generate an interrupt.
|
||||
<li>3851: go to sleep and run some other process to run. time sharing
|
||||
in action.
|
||||
<li>3792: interrupt: arm is in the right position; wakeup requester.
|
||||
<li>3856: read block from disk.
|
||||
<li>3860: remove request from bounded buffer. wakeup processes that
|
||||
are waiting for a slot.
|
||||
<li>3864: start next disk request, if any. xv6 can overlap I/O with
|
||||
computation.
|
||||
</ul>
|
||||
<li>4011: mark the cache entry has holding the data.
|
||||
</ul>
|
||||
<li>4498: To where is the block copied? is dst a valid user address?
|
||||
</ul>
|
||||
|
||||
<p>Now let's suppose that the process is writing 512 bytes at the end
|
||||
of the file a. How many disk writes will happen?
|
||||
<ul>
|
||||
<li>4567: allocate a new block
|
||||
<ul>
|
||||
<li>4518: allocate a block: scan block map, and write entry
|
||||
<li>4523: How many disk operations if the process would have been appending
|
||||
to a large file? (Answer: read indirect block, scan block map, write
|
||||
block map.)
|
||||
</ul>
|
||||
<li>4572: read the block that the process will be writing, in case the
|
||||
process writes only part of the block.
|
||||
<li>4574: write it. is it synchronous or asynchronous? (Ans:
|
||||
synchronous but with timesharing.)
|
||||
</ul>
|
||||
|
||||
<p>Lots of code to implement reading and writing of files. How about
|
||||
directories?
|
||||
<ul>
|
||||
<li>4722: look for the directory, reading directory block and see if a
|
||||
directory entry is unused (inum == 0).
|
||||
<li>4729: use it and update it.
|
||||
<li>4735: write the modified block.
|
||||
</ul>
|
||||
<p>Reading and writing of directories is trivial.
|
||||
|
||||
</body>
|
174
web/l-interrupt.html
Normal file
174
web/l-interrupt.html
Normal file
|
@ -0,0 +1,174 @@
|
|||
<html>
|
||||
<head><title>Lecture 6: Interrupts & Exceptions</title></head>
|
||||
<body>
|
||||
|
||||
<h1>Interrupts & Exceptions</h1>
|
||||
|
||||
<p>
|
||||
Required reading: xv6 <code>trapasm.S</code>, <code>trap.c</code>, <code>syscall.c</code>, <code>usys.S</code>.
|
||||
<br>
|
||||
You will need to consult
|
||||
<a href="../readings/ia32/IA32-3.pdf">IA32 System
|
||||
Programming Guide</a> chapter 5 (skip 5.7.1, 5.8.2, 5.12.2).
|
||||
|
||||
<h2>Overview</h2>
|
||||
|
||||
<p>
|
||||
Big picture: kernel is trusted third-party that runs the machine.
|
||||
Only the kernel can execute privileged instructions (e.g.,
|
||||
changing MMU state).
|
||||
The processor enforces this protection through the ring bits
|
||||
in the code segment.
|
||||
If a user application needs to carry out a privileged operation
|
||||
or other kernel-only service,
|
||||
it must ask the kernel nicely.
|
||||
How can a user program change to the kernel address space?
|
||||
How can the kernel transfer to a user address space?
|
||||
What happens when a device attached to the computer
|
||||
needs attention?
|
||||
These are the topics for today's lecture.
|
||||
|
||||
<p>
|
||||
There are three kinds of events that must be handled
|
||||
by the kernel, not user programs:
|
||||
(1) a system call invoked by a user program,
|
||||
(2) an illegal instruction or other kind of bad processor state (memory fault, etc.).
|
||||
and
|
||||
(3) an interrupt from a hardware device.
|
||||
|
||||
<p>
|
||||
Although these three events are different, they all use the same
|
||||
mechanism to transfer control to the kernel.
|
||||
This mechanism consists of three steps that execute as one atomic unit.
|
||||
(a) change the processor to kernel mode;
|
||||
(b) save the old processor somewhere (usually the kernel stack);
|
||||
and (c) change the processor state to the values set up as
|
||||
the “official kernel entry values.”
|
||||
The exact implementation of this mechanism differs
|
||||
from processor to processor, but the idea is the same.
|
||||
|
||||
<p>
|
||||
We'll work through examples of these today in lecture.
|
||||
You'll see all three in great detail in the labs as well.
|
||||
|
||||
<p>
|
||||
A note on terminology: sometimes we'll
|
||||
use interrupt (or trap) to mean both interrupts and exceptions.
|
||||
|
||||
<h2>
|
||||
Setting up traps on the x86
|
||||
</h2>
|
||||
|
||||
<p>
|
||||
See handout Table 5-1, Figure 5-1, Figure 5-2.
|
||||
|
||||
<p>
|
||||
xv6 Sheet 07: <code>struct gatedesc</code> and <code>SETGATE</code>.
|
||||
|
||||
<p>
|
||||
xv6 Sheet 28: <code>tvinit</code> and <code>idtinit</code>.
|
||||
Note setting of gate for <code>T_SYSCALL</code>
|
||||
|
||||
<p>
|
||||
xv6 Sheet 29: <code>vectors.pl</code> (also see generated <code>vectors.S</code>).
|
||||
|
||||
<h2>
|
||||
System calls
|
||||
</h2>
|
||||
|
||||
<p>
|
||||
xv6 Sheet 16: <code>init.c</code> calls <code>open("console")</code>.
|
||||
How is that implemented?
|
||||
|
||||
<p>
|
||||
xv6 <code>usys.S</code> (not in book).
|
||||
(No saving of registers. Why?)
|
||||
|
||||
<p>
|
||||
Breakpoint <code>0x1b:"open"</code>,
|
||||
step past <code>int</code> instruction into kernel.
|
||||
|
||||
<p>
|
||||
See handout Figure 9-4 [sic].
|
||||
|
||||
<p>
|
||||
xv6 Sheet 28: in <code>vectors.S</code> briefly, then in <code>alltraps</code>.
|
||||
Step through to <code>call trap</code>, examine registers and stack.
|
||||
How will the kernel find the argument to <code>open</code>?
|
||||
|
||||
<p>
|
||||
xv6 Sheet 29: <code>trap</code>, on to <code>syscall</code>.
|
||||
|
||||
<p>
|
||||
xv6 Sheet 31: <code>syscall</code> looks at <code>eax</code>,
|
||||
calls <code>sys_open</code>.
|
||||
|
||||
<p>
|
||||
(Briefly)
|
||||
xv6 Sheet 52: <code>sys_open</code> uses <code>argstr</code> and <code>argint</code>
|
||||
to get its arguments. How do they work?
|
||||
|
||||
<p>
|
||||
xv6 Sheet 30: <code>fetchint</code>, <code>fetcharg</code>, <code>argint</code>,
|
||||
<code>argptr</code>, <code>argstr</code>.
|
||||
|
||||
<p>
|
||||
What happens if a user program divides by zero
|
||||
or accesses unmapped memory?
|
||||
Exception. Same path as system call until <code>trap</code>.
|
||||
|
||||
<p>
|
||||
What happens if kernel divides by zero or accesses unmapped memory?
|
||||
|
||||
<h2>
|
||||
Interrupts
|
||||
</h2>
|
||||
|
||||
<p>
|
||||
Like system calls, except:
|
||||
devices generate them at any time,
|
||||
there are no arguments in CPU registers,
|
||||
nothing to return to,
|
||||
usually can't ignore them.
|
||||
|
||||
<p>
|
||||
How do they get generated?
|
||||
Device essentially phones up the
|
||||
interrupt controller and asks to talk to the CPU.
|
||||
Interrupt controller then buzzes the CPU and
|
||||
tells it, “keyboard on line 1.”
|
||||
Interrupt controller is essentially the CPU's
|
||||
<strike>secretary</strike> administrative assistant,
|
||||
managing the phone lines on the CPU's behalf.
|
||||
|
||||
<p>
|
||||
Have to set up interrupt controller.
|
||||
|
||||
<p>
|
||||
(Briefly) xv6 Sheet 63: <code>pic_init</code> sets up the interrupt controller,
|
||||
<code>irq_enable</code> tells the interrupt controller to let the given
|
||||
interrupt through.
|
||||
|
||||
<p>
|
||||
(Briefly) xv6 Sheet 68: <code>pit8253_init</code> sets up the clock chip,
|
||||
telling it to interrupt on <code>IRQ_TIMER</code> 100 times/second.
|
||||
<code>console_init</code> sets up the keyboard, enabling <code>IRQ_KBD</code>.
|
||||
|
||||
<p>
|
||||
In Bochs, set breakpoint at 0x8:"vector0"
|
||||
and continue, loading kernel.
|
||||
Step through clock interrupt, look at
|
||||
stack, registers.
|
||||
|
||||
<p>
|
||||
Was the processor executing in kernel or user mode
|
||||
at the time of the clock interrupt?
|
||||
Why? (Have any user-space instructions executed at all?)
|
||||
|
||||
<p>
|
||||
Can the kernel get an interrupt at any time?
|
||||
Why or why not? <code>cli</code> and <code>sti</code>,
|
||||
<code>irq_enable</code>.
|
||||
|
||||
</body>
|
||||
</html>
|
322
web/l-lock.html
Normal file
322
web/l-lock.html
Normal file
|
@ -0,0 +1,322 @@
|
|||
<title>L7</title>
|
||||
<html>
|
||||
<head>
|
||||
</head>
|
||||
<body>
|
||||
|
||||
<h1>Locking</h1>
|
||||
|
||||
<p>Required reading: spinlock.c
|
||||
|
||||
<h2>Why coordinate?</h2>
|
||||
|
||||
<p>Mutual-exclusion coordination is an important topic in operating
|
||||
systems, because many operating systems run on
|
||||
multiprocessors. Coordination techniques protect variables that are
|
||||
shared among multiple threads and updated concurrently. These
|
||||
techniques allow programmers to implement atomic sections so that one
|
||||
thread can safely update the shared variables without having to worry
|
||||
that another thread intervening. For example, processes in xv6 may
|
||||
run concurrently on different processors and in kernel-mode share
|
||||
kernel data structures. We must ensure that these updates happen
|
||||
correctly.
|
||||
|
||||
<p>List and insert example:
|
||||
<pre>
|
||||
|
||||
struct List {
|
||||
int data;
|
||||
struct List *next;
|
||||
};
|
||||
|
||||
List *list = 0;
|
||||
|
||||
insert(int data) {
|
||||
List *l = new List;
|
||||
l->data = data;
|
||||
l->next = list; // A
|
||||
list = l; // B
|
||||
}
|
||||
</pre>
|
||||
|
||||
<p>What needs to be atomic? The two statements labeled A and B should
|
||||
always be executed together, as an indivisible fragment of code. If
|
||||
two processors execute A and B interleaved, then we end up with an
|
||||
incorrect list. To see that this is the case, draw out the list after
|
||||
the sequence A1 (statement executed A by processor 1), A2 (statement A
|
||||
executed by processor 2), B2, and B1.
|
||||
|
||||
<p>How could this erroneous sequence happen? The varilable <i>list</i>
|
||||
lives in physical memory shared among multiple processors, connected
|
||||
by a bus. The accesses to the shared memory will be ordered in some
|
||||
total order by the bus/memory system. If the programmer doesn't
|
||||
coordinate the execution of the statements A and B, any order can
|
||||
happen, including the erroneous one.
|
||||
|
||||
<p>The erroneous case is called a race condition. The problem with
|
||||
races is that they are difficult to reproduce. For example, if you
|
||||
put print statements in to debug the incorrect behavior, you might
|
||||
change the time and the race might not happen anymore.
|
||||
|
||||
<h2>Atomic instructions</h2>
|
||||
|
||||
<p>The programmer must be able express that A and B should be executed
|
||||
as single atomic instruction. We generally use a concept like locks
|
||||
to mark an atomic region, acquiring the lock at the beginning of the
|
||||
section and releasing it at the end:
|
||||
|
||||
<pre>
|
||||
void acquire(int *lock) {
|
||||
while (TSL(lock) != 0) ;
|
||||
}
|
||||
|
||||
void release (int *lock) {
|
||||
*lock = 0;
|
||||
}
|
||||
</pre>
|
||||
|
||||
<p>Acquire and release, of course, need to be atomic too, which can,
|
||||
for example, be done with a hardware atomic TSL (try-set-lock)
|
||||
instruction:
|
||||
|
||||
<p>The semantics of TSL are:
|
||||
<pre>
|
||||
R <- [mem] // load content of mem into register R
|
||||
[mem] <- 1 // store 1 in mem.
|
||||
</pre>
|
||||
|
||||
<p>In a harware implementation, the bus arbiter guarantees that both
|
||||
the load and store are executed without any other load/stores coming
|
||||
in between.
|
||||
|
||||
<p>We can use locks to implement an atomic insert, or we can use
|
||||
TSL directly:
|
||||
<pre>
|
||||
int insert_lock = 0;
|
||||
|
||||
insert(int data) {
|
||||
|
||||
/* acquire the lock: */
|
||||
while(TSL(&insert_lock) != 0)
|
||||
;
|
||||
|
||||
/* critical section: */
|
||||
List *l = new List;
|
||||
l->data = data;
|
||||
l->next = list;
|
||||
list = l;
|
||||
|
||||
/* release the lock: */
|
||||
insert_lock = 0;
|
||||
}
|
||||
</pre>
|
||||
|
||||
<p>It is the programmer's job to make sure that locks are respected. If
|
||||
a programmer writes another function that manipulates the list, the
|
||||
programmer must must make sure that the new functions acquires and
|
||||
releases the appropriate locks. If the programmer doesn't, race
|
||||
conditions occur.
|
||||
|
||||
<p>This code assumes that stores commit to memory in program order and
|
||||
that all stores by other processors started before insert got the lock
|
||||
are observable by this processor. That is, after the other processor
|
||||
released a lock, all the previous stores are committed to memory. If
|
||||
a processor executes instructions out of order, this assumption won't
|
||||
hold and we must, for example, a barrier instruction that makes the
|
||||
assumption true.
|
||||
|
||||
|
||||
<h2>Example: Locking on x86</h2>
|
||||
|
||||
<p>Here is one way we can implement acquire and release using the x86
|
||||
xchgl instruction:
|
||||
|
||||
<pre>
|
||||
struct Lock {
|
||||
unsigned int locked;
|
||||
};
|
||||
|
||||
acquire(Lock *lck) {
|
||||
while(TSL(&(lck->locked)) != 0)
|
||||
;
|
||||
}
|
||||
|
||||
release(Lock *lck) {
|
||||
lck->locked = 0;
|
||||
}
|
||||
|
||||
int
|
||||
TSL(int *addr)
|
||||
{
|
||||
register int content = 1;
|
||||
// xchgl content, *addr
|
||||
// xchgl exchanges the values of its two operands, while
|
||||
// locking the memory bus to exclude other operations.
|
||||
asm volatile ("xchgl %0,%1" :
|
||||
"=r" (content),
|
||||
"=m" (*addr) :
|
||||
"0" (content),
|
||||
"m" (*addr));
|
||||
return(content);
|
||||
}
|
||||
</pre>
|
||||
|
||||
<p>the instruction "XCHG %eax, (content)" works as follows:
|
||||
<ol>
|
||||
<li> freeze other CPUs' memory activity
|
||||
<li> temp := content
|
||||
<li> content := %eax
|
||||
<li> %eax := temp
|
||||
<li> un-freeze other CPUs
|
||||
</ol>
|
||||
|
||||
<p>steps 1 and 5 make XCHG special: it is "locked" special signal
|
||||
lines on the inter-CPU bus, bus arbitration
|
||||
|
||||
<p>This implementation doesn't scale to a large number of processors;
|
||||
in a later lecture we will see how we could do better.
|
||||
|
||||
<h2>Lock granularity</h2>
|
||||
|
||||
<p>Release/acquire is ideal for short atomic sections: increment a
|
||||
counter, search in i-node cache, allocate a free buffer.
|
||||
|
||||
<p>What are spin locks not so great for? Long atomic sections may
|
||||
waste waiters' CPU time and it is to sleep while holding locks. In
|
||||
xv6 we try to avoid long atomic sections by carefully coding (can
|
||||
you find an example?). xv6 doesn't release the processor when
|
||||
holding a lock, but has an additional set of coordination primitives
|
||||
(sleep and wakeup), which we will study later.
|
||||
|
||||
<p>My list_lock protects all lists; inserts to different lists are
|
||||
blocked. A lock per list would waste less time spinning so you might
|
||||
want "fine-grained" locks, one for every object BUT acquire/release
|
||||
are expensive (500 cycles on my 3 ghz machine) because they need to
|
||||
talk off-chip.
|
||||
|
||||
<p>Also, "correctness" is not that simple with fine-grained locks if
|
||||
need to maintain global invariants; e.g., "every buffer must be on
|
||||
exactly one of free list and device list". Per-list locks are
|
||||
irrelevant for this invariant. So you might want "large-grained",
|
||||
which reduces overhead but reduces concurrency.
|
||||
|
||||
<p>This tension is hard to get right. One often starts out with
|
||||
"large-grained locks" and measures the performance of the system on
|
||||
some workloads. When more concurrency is desired (to get better
|
||||
performance), an implementor may switch to a more fine-grained
|
||||
scheme. Operating system designers fiddle with this all the time.
|
||||
|
||||
<h2>Recursive locks and modularity</h2>
|
||||
|
||||
<p>When designing a system we desire clean abstractions and good
|
||||
modularity. We like a caller not have to know about how a callee
|
||||
implements a particul functions. Locks make achieving modularity
|
||||
more complicated. For example, what to do when the caller holds a
|
||||
lock, then calls a function, which also needs to the lock to perform
|
||||
its job.
|
||||
|
||||
<p>There are no transparent solutions that allow the caller and callee
|
||||
to be unaware of which lokcs they use. One transparent, but
|
||||
unsatisfactory option is recursive locks: If a callee asks for a
|
||||
lock that its caller has, then we allow the callee to proceed.
|
||||
Unfortunately, this solution is not ideal either.
|
||||
|
||||
<p>Consider the following. If lock x protects the internals of some
|
||||
struct foo, then if the caller acquires lock x, it know that the
|
||||
internals of foo are in a sane state and it can fiddle with them.
|
||||
And then the caller must restore them to a sane state before release
|
||||
lock x, but until then anything goes.
|
||||
|
||||
<p>This assumption doesn't hold with recursive locking. After
|
||||
acquiring lock x, the acquirer knows that either it is the first to
|
||||
get this lock, in which case the internals are in a sane state, or
|
||||
maybe some caller holds the lock and has messed up the internals and
|
||||
didn't realize when calling the callee that it was going to try to
|
||||
look at them too. So the fact that a function acquired the lock x
|
||||
doesn't guarantee anything at all. In short, locks protect against
|
||||
callers and callees just as much as they protect against other
|
||||
threads.
|
||||
|
||||
<p>Since transparent solutions aren't ideal, it is better to consider
|
||||
locks part of the function specification. The programmer must
|
||||
arrange that a caller doesn't invoke another function while holding
|
||||
a lock that the callee also needs.
|
||||
|
||||
<h2>Locking in xv6</h2>
|
||||
|
||||
<p>xv6 runs on a multiprocessor and is programmed to allow multiple
|
||||
threads of computation to run concurrently. In xv6 an interrupt might
|
||||
run on one processor and a process in kernel mode may run on another
|
||||
processor, sharing a kernel data structure with the interrupt routing.
|
||||
xv6 uses locks, implemented using an atomic instruction, to coordinate
|
||||
concurrent activities.
|
||||
|
||||
<p>Let's check out why xv6 needs locks by following what happens when
|
||||
we start a second processor:
|
||||
<ul>
|
||||
<li>1516: mp_init (called from main0)
|
||||
<li>1606: mp_startthem (called from main0)
|
||||
<li>1302: mpmain
|
||||
<li>2208: scheduler.
|
||||
<br>Now we have several processors invoking the scheduler
|
||||
function. xv6 better ensure that multiple processors don't run the
|
||||
same process! does it?
|
||||
<br>Yes, if multiple schedulers run concurrently, only one will
|
||||
acquire proc_table_lock, and proceed looking for a runnable
|
||||
process. if it finds a process, it will mark it running, longjmps to
|
||||
it, and the process will release proc_table_lock. the next instance
|
||||
of scheduler will skip this entry, because it is marked running, and
|
||||
look for another runnable process.
|
||||
</ul>
|
||||
|
||||
<p>Why hold proc_table_lock during a context switch? It protects
|
||||
p->state; the process has to hold some lock to avoid a race with
|
||||
wakeup() and yield(), as we will see in the next lectures.
|
||||
|
||||
<p>Why not a lock per proc entry? It might be expensive in in whole
|
||||
table scans (in wait, wakeup, scheduler). proc_table_lock also
|
||||
protects some larger invariants, for example it might be hard to get
|
||||
proc_wait() right with just per entry locks. Right now the check to
|
||||
see if there are any exited children and the sleep are atomic -- but
|
||||
that would be hard with per entry locks. One could have both, but
|
||||
that would probably be neither clean nor fast.
|
||||
|
||||
<p>Of course, there is only processor searching the proc table if
|
||||
acquire is implemented correctly. Let's check out acquire in
|
||||
spinlock.c:
|
||||
<ul>
|
||||
<li>1807: no recursive locks!
|
||||
<li>1811: why disable interrupts on the current processor? (if
|
||||
interrupt code itself tries to take a held lock, xv6 will deadlock;
|
||||
the panic will fire on 1808.)
|
||||
<ul>
|
||||
<li>can a process on a processor hold multiple locks?
|
||||
</ul>
|
||||
<li>1814: the (hopefully) atomic instruction.
|
||||
<ul>
|
||||
<li>see sheet 4, line 0468.
|
||||
</ul>
|
||||
<li>1819: make sure that stores issued on other processors before we
|
||||
got the lock are observed by this processor. these may be stores to
|
||||
the shared data structure that is protected by the lock.
|
||||
</ul>
|
||||
|
||||
<p>
|
||||
|
||||
<h2>Locking in JOS</h2>
|
||||
|
||||
<p>JOS is meant to run on single-CPU machines, and the plan can be
|
||||
simple. The simple plan is disabling/enabling interrupts in the
|
||||
kernel (IF flags in the EFLAGS register). Thus, in the kernel,
|
||||
threads release the processors only when they want to and can ensure
|
||||
that they don't release the processor during a critical section.
|
||||
|
||||
<p>In user mode, JOS runs with interrupts enabled, but Unix user
|
||||
applications don't share data structures. The data structures that
|
||||
must be protected, however, are the ones shared in the library
|
||||
operating system (e.g., pipes). In JOS we will use special-case
|
||||
solutions, as you will find out in lab 6. For example, to implement
|
||||
pipe we will assume there is one reader and one writer. The reader
|
||||
and writer never update each other's variables; they only read each
|
||||
other's variables. Carefully programming using this rule we can avoid
|
||||
races.
|
262
web/l-mkernel.html
Normal file
262
web/l-mkernel.html
Normal file
|
@ -0,0 +1,262 @@
|
|||
<title>Microkernel lecture</title>
|
||||
<html>
|
||||
<head>
|
||||
</head>
|
||||
<body>
|
||||
|
||||
<h1>Microkernels</h1>
|
||||
|
||||
<p>Required reading: Improving IPC by kernel design
|
||||
|
||||
<h2>Overview</h2>
|
||||
|
||||
<p>This lecture looks at the microkernel organization. In a
|
||||
microkernel, services that a monolithic kernel implements in the
|
||||
kernel are running as user-level programs. For example, the file
|
||||
system, UNIX process management, pager, and network protocols each run
|
||||
in a separate user-level address space. The microkernel itself
|
||||
supports only the services that are necessary to allow system services
|
||||
to run well in user space; a typical microkernel has at least support
|
||||
for creating address spaces, threads, and inter process communication.
|
||||
|
||||
<p>The potential advantages of a microkernel are simplicity of the
|
||||
kernel (small), isolation of operating system components (each runs in
|
||||
its own user-level address space), and flexibility (we can have a file
|
||||
server and a database server). One potential disadvantage is
|
||||
performance loss, because what in a monolithich kernel requires a
|
||||
single system call may require in a microkernel multiple system calls
|
||||
and context switches.
|
||||
|
||||
<p>One way in how microkernels differ from each other is the exact
|
||||
kernel API they implement. For example, Mach (a system developed at
|
||||
CMU, which influenced a number of commercial operating systems) has
|
||||
the following system calls: processes (create, terminate, suspend,
|
||||
resume, priority, assign, info, threads), threads (fork, exit, join,
|
||||
detach, yield, self), ports and messages (a port is a unidirectionally
|
||||
communication channel with a message queue and supporting primitives
|
||||
to send, destroy, etc), and regions/memory objects (allocate,
|
||||
deallocate, map, copy, inherit, read, write).
|
||||
|
||||
<p>Some microkernels are more "microkernel" than others. For example,
|
||||
some microkernels implement the pager in user space but the basic
|
||||
virtual memory abstractions in the kernel (e.g, Mach); others, are
|
||||
more extreme, and implement most of the virtual memory in user space
|
||||
(L4). Yet others are less extreme: many servers run in their own
|
||||
address space, but in kernel mode (Chorus).
|
||||
|
||||
<p>All microkernels support multiple threads per address space. xv6
|
||||
and Unix until recently didn't; why? Because, in Unix system services
|
||||
are typically implemented in the kernel, and those are the primary
|
||||
programs that need multiple threads to handle events concurrently
|
||||
(waiting for disk and processing new I/O requests). In microkernels,
|
||||
these services are implemented in user-level address spaces and so
|
||||
they need a mechanism to deal with handling operations concurrently.
|
||||
(Of course, one can argue if fork efficient enough, there is no need
|
||||
to have threads.)
|
||||
|
||||
<h2>L3/L4</h2>
|
||||
|
||||
<p>L3 is a predecessor to L4. L3 provides data persistence, DOS
|
||||
emulation, and ELAN runtime system. L4 is a reimplementation of L3,
|
||||
but without the data persistence. L4KA is a project at
|
||||
sourceforge.net, and you can download the code for the latest
|
||||
incarnation of L4 from there.
|
||||
|
||||
<p>L4 is a "second-generation" microkernel, with 7 calls: IPC (of
|
||||
which there are several types), id_nearest (find a thread with an ID
|
||||
close the given ID), fpage_unmap (unmap pages, mapping is done as a
|
||||
side-effect of IPC), thread_switch (hand processor to specified
|
||||
thread), lthread_ex_regs (manipulate thread registers),
|
||||
thread_schedule (set scheduling policies), task_new (create a new
|
||||
address space with some default number of threads). These calls
|
||||
provide address spaces, tasks, threads, interprocess communication,
|
||||
and unique identifiers. An address space is a set of mappings.
|
||||
Multiple threads may share mappings, a thread may grants mappings to
|
||||
another thread (through IPC). Task is the set of threads sharing an
|
||||
address space.
|
||||
|
||||
<p>A thread is the execution abstraction; it belongs to an address
|
||||
space, a UID, a register set, a page fault handler, and an exception
|
||||
handler. A UID of a thread is its task number plus the number of the
|
||||
thread within that task.
|
||||
|
||||
<p>IPC passes data by value or by reference to another address space.
|
||||
It also provide for sequence coordination. It is used for
|
||||
communication between client and servers, to pass interrupts to a
|
||||
user-level exception handler, to pass page faults to an external
|
||||
pager. In L4, device drivers are implemented has a user-level
|
||||
processes with the device mapped into their address space.
|
||||
Linux runs as a user-level process.
|
||||
|
||||
<p>L4 provides quite a scala of messages types: inline-by-value,
|
||||
strings, and virtual memory mappings. The send and receive descriptor
|
||||
specify how many, if any.
|
||||
|
||||
<p>In addition, there is a system call for timeouts and controling
|
||||
thread scheduling.
|
||||
|
||||
<h2>L3/L4 paper discussion</h2>
|
||||
|
||||
<ul>
|
||||
|
||||
<li>This paper is about performance. What is a microsecond? Is 100
|
||||
usec bad? Is 5 usec so much better we care? How many instructions
|
||||
does 50-Mhz x86 execute in 100 usec? What can we compute with that
|
||||
number of instructions? How many disk operations in that time? How
|
||||
many interrupts can we take? (The livelock paper, which we cover in a
|
||||
few lectures, mentions 5,000 network pkts per second, and each packet
|
||||
generates two interrrupts.)
|
||||
|
||||
<li>In performance calculations, what is the appropriate/better metric?
|
||||
Microseconds or cycles?
|
||||
|
||||
<li>Goal: improve IPC performance by a factor 10 by careful kernel
|
||||
design that is fully aware of the hardware it is running on.
|
||||
Principle: performance rules! Optimize for the common case. Because
|
||||
in L3 interrupts are propagated to user-level using IPC, the system
|
||||
may have to be able to support many IPCs per second (as many as the
|
||||
device can generate interrupts).
|
||||
|
||||
<li>IPC consists of transfering control and transfering data. The
|
||||
minimal cost for transfering control is 127 cycles, plus 45 cycles for
|
||||
TLB misses (see table 3). What are the x86 instructions to enter and
|
||||
leave the kernel? (int, iret) Why do they consume so much time?
|
||||
(Flush pipeline) Do modern processors perform these operations more
|
||||
efficient? Worse now. Faster processors optimized for straight-line
|
||||
code; Traps/Exceptions flush deeper pipeline, cache misses cost more
|
||||
cycles.
|
||||
|
||||
<li>What are the 5 TLB misses: 1) B's thread control block; loading %cr3
|
||||
flushes TLB, so 2) kernel text causes miss; iret, accesses both 3) stack and
|
||||
4+5) user text - two pages B's user code looks at message
|
||||
|
||||
<li>Interface:
|
||||
<ul>
|
||||
<li>call (threadID, send-message, receive-message, timeout);
|
||||
<li>reply_and_receive (reply-message, receive-message, timeout);
|
||||
</ul>
|
||||
|
||||
<li>Optimizations:
|
||||
<ul>
|
||||
|
||||
<li>New system call: reply_and_receive. Effect: 2 system calls per
|
||||
RPC.
|
||||
|
||||
<li>Complex messages: direct string, indirect strings, and memory
|
||||
objects.
|
||||
|
||||
<li>Direct transfer by temporary mapping through a communication
|
||||
window. The communication window is mapped in B address space and in
|
||||
A's kernel address space; why is this better than just mapping a page
|
||||
shared between A and B's address space? 1) Multi-level security, it
|
||||
makes it hard to reason about information flow; 2) Receiver can't
|
||||
check message legality (might change after check); 3) When server has
|
||||
many clients, could run out of virtual address space Requires shared
|
||||
memory region to be established ahead of time; 4) Not application
|
||||
friendly, since data may already be at another address, i.e.
|
||||
applications would have to copy anyway--possibly more copies.
|
||||
|
||||
<li>Why not use the following approach: map the region copy-on-write
|
||||
(or read-only) in A's address space after send and read-only in B's
|
||||
address space? Now B may have to copy data or cannot receive data in
|
||||
its final destination.
|
||||
|
||||
<li>On the x86 implemented by coping B's PDE into A's address space.
|
||||
Why two PDEs? (Maximum message size is 4 Meg, so guaranteed to work
|
||||
if the message starts in the bottom for 4 Mbyte of an 8 Mbyte mapped
|
||||
region.) Why not just copy PTEs? Would be much more expensive
|
||||
|
||||
<li> What does it mean for the TLB to be "window clean"? Why do we
|
||||
care? Means TLB contains no mappings within communication window. We
|
||||
care because mapping is cheap (copy PDE), but invalidation not; x86
|
||||
only lets you invalidate one page at a time, or whole TLB Does TLB
|
||||
invalidation of communication window turn out to be a problem? Not
|
||||
usually, because have to load %cr3 during IPC anyway
|
||||
|
||||
<li>Thread control block registers, links to various double-linked
|
||||
lists, pgdir, uid, etc.. Lower part of thread UID contains TCB
|
||||
number. Can also dededuce TCB address from stack by taking SP AND
|
||||
bitmask (the SP comes out of the TSS when just switching to kernel).
|
||||
|
||||
<li> Kernel stack is on same page as tcb. why? 1) Minimizes TLB
|
||||
misses (since accessing kernel stack will bring in tcb); 2) Allows
|
||||
very efficient access to tcb -- just mask off lower 12 bits of %esp;
|
||||
3) With VM, can use lower 32-bits of thread id to indicate which tcb;
|
||||
using one page per tcb means no need to check if thread is swapped out
|
||||
(Can simply not map that tcb if shouldn't access it).
|
||||
|
||||
<li>Invariant on queues: queues always hold in-memory TCBs.
|
||||
|
||||
<li>Wakeup queue: set of 8 unordered wakeup lists (wakup time mod 8),
|
||||
and smart representation of time so that 32-bit integers can be used
|
||||
in the common case (base + offset in msec; bump base and recompute all
|
||||
offsets ~4 hours. maximum timeout is ~24 days, 2^31 msec).
|
||||
|
||||
<li>What is the problem addressed by lazy scheduling?
|
||||
Conventional approach to scheduling:
|
||||
<pre>
|
||||
A sends message to B:
|
||||
Move A from ready queue to waiting queue
|
||||
Move B from waiting queue to ready queue
|
||||
This requires 58 cycles, including 4 TLB misses. What are TLB misses?
|
||||
One each for head of ready and waiting queues
|
||||
One each for previous queue element during the remove
|
||||
</pre>
|
||||
<li> Lazy scheduling:
|
||||
<pre>
|
||||
Ready queue must contain all ready threads except current one
|
||||
Might contain other threads that aren't actually ready, though
|
||||
Each wakeup queue contains all threads waiting in that queue
|
||||
Again, might contain other threads, too
|
||||
Scheduler removes inappropriate queue entries when scanning
|
||||
queue
|
||||
</pre>
|
||||
|
||||
<li>Why does this help performance? Only three situations in which
|
||||
thread gives up CPU but stays ready: send syscall (as opposed to
|
||||
call), preemption, and hardware interrupts. So very often can IPC into
|
||||
thread while not putting it on ready list.
|
||||
|
||||
<li>Direct process switch. This section just says you should use
|
||||
kernel threads instead of continuations.
|
||||
|
||||
<li>Short messages via registers.
|
||||
|
||||
<li>Avoiding unnecessary copies. Basically can send and receive
|
||||
messages w. same vector. Makes forwarding efficient, which is
|
||||
important for Clans/Chiefs model.
|
||||
|
||||
<li>Segment register optimization. Loading segments registers is
|
||||
slow, have to access GDT, etc. But common case is that users don't
|
||||
change their segment registers. Observation: it is faster to check
|
||||
that segment descriptor than load it. So just check that segment
|
||||
registers are okay. Only need to load if user code changed them.
|
||||
|
||||
<li>Registers for paramater passing where ever possible: systems calls
|
||||
and IPC.
|
||||
|
||||
<li>Minimizing TLB misses. Try to cram as many things as possible onto
|
||||
same page: IPC kernel code, GDT, IDT, TSS, all on same page. Actually
|
||||
maybe can't fit whole tables but put the important parts of tables on
|
||||
the same page (maybe beginning of TSS, IDT, or GDT only?)
|
||||
|
||||
<li>Coding tricks: short offsets, avoid jumps, avoid checks, pack
|
||||
often-used data on same cache lines, lazily save/restore CPU state
|
||||
like debug and FPU registers. Much of the kernel is written in
|
||||
assembly!
|
||||
|
||||
<li>What are the results? figure 7 and 8 look good.
|
||||
|
||||
<li>Is fast IPC enough to get good overall system performance? This
|
||||
paper doesn't make a statement either way; we have to read their 1997
|
||||
paper to find find the answer to that question.
|
||||
|
||||
<li>Is the principle of optimizing for performance right? In general,
|
||||
it is wrong to optimize for performance; other things matter more. Is
|
||||
IPC the one exception? Maybe, perhaps not. Was Liedtke fighting a
|
||||
losing battle against CPU makers? Should fast IPC time be a hardware,
|
||||
or just an OS issue?
|
||||
|
||||
</ul>
|
||||
|
||||
</body>
|
181
web/l-name.html
Normal file
181
web/l-name.html
Normal file
|
@ -0,0 +1,181 @@
|
|||
<title>L11</title>
|
||||
<html>
|
||||
<head>
|
||||
</head>
|
||||
<body>
|
||||
|
||||
<h1>Naming in file systems</h1>
|
||||
|
||||
<p>Required reading: nami(), and all other file system code.
|
||||
|
||||
<h2>Overview</h2>
|
||||
|
||||
<p>To help users to remember where they stored their data, most
|
||||
systems allow users to assign their own names to their data.
|
||||
Typically the data is organized in files and users assign names to
|
||||
files. To deal with many files, users can organize their files in
|
||||
directories, in a hierarchical manner. Each name is a pathname, with
|
||||
the components separated by "/".
|
||||
|
||||
<p>To avoid that users have to type long abolute names (i.e., names
|
||||
starting with "/" in Unix), users can change their working directory
|
||||
and use relative names (i.e., naming that don't start with "/").
|
||||
|
||||
<p>User file namespace operations include create, mkdir, mv, ln
|
||||
(link), unlink, and chdir. (How is "mv a b" implemented in xv6?
|
||||
Answer: "link a b"; "unlink a".) To be able to name the current
|
||||
directory and the parent directory every directory includes two
|
||||
entries "." and "..". Files and directories can reclaimed if users
|
||||
cannot name it anymore (i.e., after the last unlink).
|
||||
|
||||
<p>Recall from last lecture, all directories entries contain a name,
|
||||
followed by an inode number. The inode number names an inode of the
|
||||
file system. How can we merge file systems from different disks into
|
||||
a single name space?
|
||||
|
||||
<p>A user grafts new file systems on a name space using mount. Umount
|
||||
removes a file system from the name space. (In DOS, a file system is
|
||||
named by its device letter.) Mount takes the root inode of the
|
||||
to-be-mounted file system and grafts it on the inode of the name space
|
||||
entry where the file system is mounted (e.g., /mnt/disk1). The
|
||||
in-memory inode of /mnt/disk1 records the major and minor number of
|
||||
the file system mounted on it. When namei sees an inode on which a
|
||||
file system is mounted, it looks up the root inode of the mounted file
|
||||
system, and proceeds with that inode.
|
||||
|
||||
<p>Mount is not a durable operation; it doesn't surive power failures.
|
||||
After a power failure, the system administrator must remount the file
|
||||
system (i.e., often in a startup script that is run from init).
|
||||
|
||||
<p>Links are convenient, because with users can create synonyms for
|
||||
file names. But, it creates the potential of introducing cycles in
|
||||
the naning tree. For example, consider link("a/b/c", "a"). This
|
||||
makes c a synonym for a. This cycle can complicate matters; for
|
||||
example:
|
||||
<ul>
|
||||
<li>If a user subsequently calls unlink ("a"), then the user cannot
|
||||
name the directory "b" and the link "c" anymore, but how can the
|
||||
file system decide that?
|
||||
</ul>
|
||||
|
||||
<p>This problem can be solved by detecting cycles. The second problem
|
||||
can be solved by computing with files are reacheable from "/" and
|
||||
reclaim all the ones that aren't reacheable. Unix takes a simpler
|
||||
approach: avoid cycles by disallowing users to create links for
|
||||
directories. If there are no cycles, then reference counts can be
|
||||
used to see if a file is still referenced. In the inode maintain a
|
||||
field for counting references (nlink in xv6's dinode). link
|
||||
increases the reference count, and unlink decreases the count; if
|
||||
the count reaches zero the inode and disk blocks can be reclaimed.
|
||||
|
||||
<p>How to handle symbolic links across file systems (i.e., from one
|
||||
mounted file system to another)? Since inodes are not unique across
|
||||
file systems, we cannot create a link across file systems; the
|
||||
directory entry only contains an inode number, not the inode number
|
||||
and the name of the disk on which the inode is located. To handle
|
||||
this case, Unix provides a second type of link, which are called
|
||||
soft links.
|
||||
|
||||
<p>Soft links are a special file type (e.g., T_SYMLINK). If namei
|
||||
encounters a inode of type T_SYMLINK, it resolves the the name in
|
||||
the symlink file to an inode, and continues from there. With
|
||||
symlinks one can create cycles and they can point to non-existing
|
||||
files.
|
||||
|
||||
<p>The design of the name system can have security implications. For
|
||||
example, if you tests if a name exists, and then use the name,
|
||||
between testing and using it an adversary can have change the
|
||||
binding from name to object. Such problems are called TOCTTOU.
|
||||
|
||||
<p>An example of TOCTTOU is follows. Let's say root runs a script
|
||||
every night to remove file in /tmp. This gets rid off the files
|
||||
that editors might left behind, but we will never be used again. An
|
||||
adversary can exploit this script as follows:
|
||||
<pre>
|
||||
Root Attacker
|
||||
mkdir ("/tmp/etc")
|
||||
creat ("/tmp/etc/passw")
|
||||
readdir ("tmp");
|
||||
lstat ("tmp/etc");
|
||||
readdir ("tmp/etc");
|
||||
rename ("tmp/etc", "/tmp/x");
|
||||
symlink ("etc", "/tmp/etc");
|
||||
unlink ("tmp/etc/passwd");
|
||||
</pre>
|
||||
Lstat checks whether /tmp/etc is not symbolic link, but by the time it
|
||||
runs unlink the attacker had time to creat a symbolic link in the
|
||||
place of /tmp/etc, with a password file of the adversary's choice.
|
||||
|
||||
<p>This problem could have been avoided if every user or process group
|
||||
had its own private /tmp, or if access to the shared one was
|
||||
mediated.
|
||||
|
||||
<h2>V6 code examples</h2>
|
||||
|
||||
<p> namei (sheet 46) is the core of the Unix naming system. namei can
|
||||
be called in several ways: NAMEI_LOOKUP (resolve a name to an inode
|
||||
and lock inode), NAMEI_CREATE (resolve a name, but lock parent
|
||||
inode), and NAMEI_DELETE (resolve a name, lock parent inode, and
|
||||
return offset in the directory). The reason is that namei is
|
||||
complicated is that we want to atomically test if a name exist and
|
||||
remove/create it, if it does; otherwise, two concurrent processes
|
||||
could interfere with each other and directory could end up in an
|
||||
inconsistent state.
|
||||
|
||||
<p>Let's trace open("a", O_RDWR), focussing on namei:
|
||||
<ul>
|
||||
<li>5263: we will look at creating a file in a bit.
|
||||
<li>5277: call namei with NAMEI_LOOKUP
|
||||
<li>4629: if path name start with "/", lookup root inode (1).
|
||||
<li>4632: otherwise, use inode for current working directory.
|
||||
<li>4638: consume row of "/", for example in "/////a////b"
|
||||
<li>4641: if we are done with NAMEI_LOOKUP, return inode (e.g.,
|
||||
namei("/")).
|
||||
<li>4652: if the inode we are searching for a name isn't of type
|
||||
directory, give up.
|
||||
<li>4657-4661: determine length of the current component of the
|
||||
pathname we are resolving.
|
||||
<li>4663-4681: scan the directory for the component.
|
||||
<li>4682-4696: the entry wasn't found. if we are the end of the
|
||||
pathname and NAMEI_CREATE is set, lock parent directory and return a
|
||||
pointer to the start of the component. In all other case, unlock
|
||||
inode of directory, and return 0.
|
||||
<li>4701: if NAMEI_DELETE is set, return locked parent inode and the
|
||||
offset of the to-be-deleted component in the directory.
|
||||
<li>4707: lookup inode of the component, and go to the top of the loop.
|
||||
</ul>
|
||||
|
||||
<p>Now let's look at creating a file in a directory:
|
||||
<ul>
|
||||
<li>5264: if the last component doesn't exist, but first part of the
|
||||
pathname resolved to a directory, then dp will be 0, last will point
|
||||
to the beginning of the last component, and ip will be the locked
|
||||
parent directory.
|
||||
<li>5266: create an entry for last in the directory.
|
||||
<li>4772: mknod1 allocates a new named inode and adds it to an
|
||||
existing directory.
|
||||
<li>4776: ialloc. skan inode block, find unused entry, and write
|
||||
it. (if lucky 1 read and 1 write.)
|
||||
<li>4784: fill out the inode entry, and write it. (another write)
|
||||
<li>4786: write the entry into the directory (if lucky, 1 write)
|
||||
</ul>
|
||||
|
||||
</ul>
|
||||
Why must the parent directory be locked? If two processes try to
|
||||
create the same name in the same directory, only one should succeed
|
||||
and the other one, should receive an error (file exist).
|
||||
|
||||
<p>Link, unlink, chdir, mount, umount could have taken file
|
||||
descriptors instead of their path argument. In fact, this would get
|
||||
rid of some possible race conditions (some of which have security
|
||||
implications, TOCTTOU). However, this would require that the current
|
||||
working directory be remembered by the process, and UNIX didn't have
|
||||
good ways of maintaining static state shared among all processes
|
||||
belonging to a given user. The easiest way is to create shared state
|
||||
is to place it in the kernel.
|
||||
|
||||
<p>We have one piece of code in xv6 that we haven't studied: exec.
|
||||
With all the ground work we have done this code can be easily
|
||||
understood (see sheet 54).
|
||||
|
||||
</body>
|
249
web/l-okws.txt
Normal file
249
web/l-okws.txt
Normal file
|
@ -0,0 +1,249 @@
|
|||
|
||||
Security
|
||||
-------------------
|
||||
I. 2 Intro Examples
|
||||
II. Security Overview
|
||||
III. Server Security: Offense + Defense
|
||||
IV. Unix Security + POLP
|
||||
V. Example: OKWS
|
||||
VI. How to Build a Website
|
||||
|
||||
I. Intro Examples
|
||||
--------------------
|
||||
1. Apache + OpenSSL 0.9.6a (CAN 2002-0656)
|
||||
- SSL = More security!
|
||||
|
||||
unsigned int j;
|
||||
p=(unsigned char *)s->init_buf->data;
|
||||
j= *(p++);
|
||||
s->session->session_id_length=j;
|
||||
memcpy(s->session->session_id,p,j);
|
||||
|
||||
- the result: an Apache worm
|
||||
|
||||
2. SparkNotes.com 2000:
|
||||
- New profile feature that displays "public" information about users
|
||||
but bug that made e-mail addresses "public" by default.
|
||||
- New program for getting that data:
|
||||
|
||||
http://www.sparknotes.com/getprofile.cgi?id=1343
|
||||
|
||||
II. Security Overview
|
||||
----------------------
|
||||
|
||||
What Is Security?
|
||||
- Protecting your system from attack.
|
||||
|
||||
What's an attack?
|
||||
- Stealing data
|
||||
- Corrupting data
|
||||
- Controlling resources
|
||||
- DOS
|
||||
|
||||
Why attack?
|
||||
- Money
|
||||
- Blackmail / extortion
|
||||
- Vendetta
|
||||
- intellectual curiosity
|
||||
- fame
|
||||
|
||||
Security is a Big topic
|
||||
|
||||
- Server security -- today's focus. There's some machine sitting on the
|
||||
Internet somewhere, with a certain interface exposed, and attackers
|
||||
want to circumvent it.
|
||||
- Why should you trust your software?
|
||||
|
||||
- Client security
|
||||
- Clients are usually servers, so they have many of the same issues.
|
||||
- Slight simplification: people across the network cannot typically
|
||||
initiate connections.
|
||||
- Has a "fallible operator":
|
||||
- Spyware
|
||||
- Drive-by-Downloads
|
||||
|
||||
- Client security turns out to be much harder -- GUI considerations,
|
||||
look inside the browser and the applications.
|
||||
- Systems community can more easily handle server security.
|
||||
- We think mainly of servers.
|
||||
|
||||
III. Server Security: Offense and Defense
|
||||
-----------------------------------------
|
||||
- Show picture of a Web site.
|
||||
|
||||
Attacks | Defense
|
||||
----------------------------------------------------------------------------
|
||||
1. Break into DB from net | 1. FW it off
|
||||
2. Break into WS on telnet | 2. FW it off
|
||||
3. Buffer overrun in Apache | 3. Patch apache / use better lang?
|
||||
4. Buffer overrun in our code | 4. Use better lang / isolate it
|
||||
5. SQL injection | 5. Better escaping / don't interpret code.
|
||||
6. Data scraping. | 6. Use a sparse UID space.
|
||||
7. PW sniffing | 7. ???
|
||||
8. Fetch /etc/passwd and crack | 8. Don't expose /etc/passwd
|
||||
PW |
|
||||
9. Root escalation from apache | 9. No setuid programs available to Apache
|
||||
10. XSS |10. Filter JS and input HTML code.
|
||||
11. Keystroke recorded on sys- |11. Client security
|
||||
admin's desktop (planetlab) |
|
||||
12. DDOS |12. ???
|
||||
|
||||
Summary:
|
||||
- That we want private data to be available to right people makes
|
||||
this problem hard in the first place. Internet servers are there
|
||||
for a reason.
|
||||
- Security != "just encrypt your data;" this in fact can sometimes
|
||||
make the problem worse.
|
||||
- Best to prevent break-ins from happening in the first place.
|
||||
- If they do happen, want to limit their damage (POLP).
|
||||
- Security policies are difficult to express / package up neatly.
|
||||
|
||||
IV. Design According to POLP (in Unix)
|
||||
---------------------------------------
|
||||
- Assume any piece of a system can be compromised, by either bad
|
||||
programming or malicious attack.
|
||||
- Try to limit the damage done by such a compromise (along the lines
|
||||
of the 4 attack goals).
|
||||
|
||||
<Draw a picture of a server process on Unix, w/ other processes>
|
||||
|
||||
What's the goal on Unix?
|
||||
- Keep processes from communicating that don't have to:
|
||||
- limit FS, IPC, signals, ptrace
|
||||
- Strip away unneeded privilege
|
||||
- with respect to network, FS.
|
||||
- Strip away FS access.
|
||||
|
||||
How on Unix?
|
||||
- setuid/setgid
|
||||
- system call interposition
|
||||
- chroot (away from setuid executables, /etc/passwd, /etc/ssh/..)
|
||||
|
||||
<show Code snippet>
|
||||
|
||||
How do you write chroot'ed programs?
|
||||
- What about shared libraries?
|
||||
- /etc/resolv.conf?
|
||||
- Can chroot'ed programs access the FS at all? What if they need
|
||||
to write to the FS or read from the FS?
|
||||
- Fd's are *capabilities*; can pass them to chroot'ed services,
|
||||
thereby opening new files on its behalf.
|
||||
- Unforgeable - can only get them from the kernel via open/socket, etc.
|
||||
|
||||
Unix Shortcomings (round 1)
|
||||
- It's bad to run as root!
|
||||
- Yet, need root for:
|
||||
- chroot
|
||||
- setuid/setgid to a lower-privileged user
|
||||
- create a new user ID
|
||||
- Still no guarantee that we've cut off all channels
|
||||
- 200 syscalls!
|
||||
- Default is to give most/all privileges.
|
||||
- Can "break out" of chroot jails?
|
||||
- Can still exploit race conditions in the kernel to escalate privileges.
|
||||
|
||||
Sidebar
|
||||
- setuid / setuid misunderstanding
|
||||
- root / root misunderstanding
|
||||
- effective vs. real vs. saved set-user-ID
|
||||
|
||||
V. OKWS
|
||||
-------
|
||||
- Taking these principles as far as possible.
|
||||
- C.f. Figure 1 From the paper..
|
||||
- Discussion of which privileges are in which processes
|
||||
|
||||
<Table of how to hack, what you get, etc...>
|
||||
|
||||
- Technical details: how to launch a new service
|
||||
- Within the launcher (running as root):
|
||||
|
||||
<on board:>
|
||||
|
||||
// receive FDs from logger, pubd, demux
|
||||
fork ();
|
||||
chroot ("/var/okws/run");
|
||||
chdir ("/coredumps/51001");
|
||||
setgid (51001);
|
||||
setuid (51001);
|
||||
exec ("login", fds ... );
|
||||
|
||||
- Note no chroot -- why not?
|
||||
- Once launched, how does a service get new connections?
|
||||
- Note the goal - minimum tampering with each other in the
|
||||
case of a compromise.
|
||||
|
||||
Shortcoming of Unix (2)
|
||||
- A lot of plumbing involved with this system. FDs flying everywhere.
|
||||
- Isolation still not fine enough. If a service gets taken over,
|
||||
can compromise all users of that service.
|
||||
|
||||
VI. Reflections on Building Websites
|
||||
---------------------------------
|
||||
- OKWS interesting "experiment"
|
||||
- Need for speed; also, good gzip support.
|
||||
- If you need compiled code, it's a good way to go.
|
||||
- RPC-like system a must for backend communication
|
||||
- Connection-pooling for free
|
||||
|
||||
Biggest difficulties:
|
||||
- Finding good C++ programmers.
|
||||
- Compile times.
|
||||
- The DB is still always the problem.
|
||||
|
||||
Hard to Find good Alternatives
|
||||
- Python / Perl - you might spend a lot of time writing C code /
|
||||
integrating with lower level languages.
|
||||
- Have to worry about DB pooling.
|
||||
- Java -- must viable, and is getting better. Scary you can't peer
|
||||
inside.
|
||||
- .Net / C#-based system might be the way to go.
|
||||
|
||||
|
||||
=======================================================================
|
||||
|
||||
Extra Material:
|
||||
|
||||
Capabilities (From the Eros Paper in SOSP 1999)
|
||||
|
||||
- "Unforgeable pair made up of an object ID and a set of authorized
|
||||
operations (an interface) on that object."
|
||||
- c.f. Dennis and van Horn. "Programming semantics for multiprogrammed
|
||||
computations," Communications of the ACM 9(3):143-154, Mar 1966.
|
||||
- Thus:
|
||||
<object ID, set of authorized OPs on that object>
|
||||
- Examples:
|
||||
"Process X can write to file at inode Y"
|
||||
"Process P can read from file at inode Z"
|
||||
- Familiar example: Unix file descriptors
|
||||
|
||||
- Why are they secure?
|
||||
- Capabilities are "unforgeable"
|
||||
- Processes can get them only through authorized interfaces
|
||||
- Capabilities are only given to processes authorized to hold them
|
||||
|
||||
- How do you get them?
|
||||
- From the kernel (e.g., open)
|
||||
- From other applications (e.g., FD passing)
|
||||
|
||||
- How do you use them?
|
||||
- read (fd), write(fd).
|
||||
|
||||
- How do you revoke them once granted?
|
||||
- In Unix, you do not.
|
||||
- In some systems, a central authority ("reference monitor") can revoke.
|
||||
|
||||
- How do you store them persistently?
|
||||
- Can have circular dependencies (unlike an FS).
|
||||
- What happens when the system starts up?
|
||||
- Revert to checkpointed state.
|
||||
- Often capability systems chose a single-level store.
|
||||
|
||||
- Capability systems, a historical prospective:
|
||||
- KeyKOS, Eros, Cyotos (UP research)
|
||||
- Never saw any applications
|
||||
- IBM Systems (System 38, later AS/400, later 'i Series')
|
||||
- Commercially viable
|
||||
- Problems:
|
||||
- All bets are off when a capability is sent to the wrong place.
|
||||
- Firewall analogy?
|
249
web/l-plan9.html
Normal file
249
web/l-plan9.html
Normal file
|
@ -0,0 +1,249 @@
|
|||
<html>
|
||||
<head>
|
||||
<title>Plan 9</title>
|
||||
</head>
|
||||
<body>
|
||||
|
||||
<h1>Plan 9</h1>
|
||||
|
||||
<p>Required reading: Plan 9 from Bell Labs</p>
|
||||
|
||||
<h2>Background</h2>
|
||||
|
||||
<p>Had moved away from the ``one computing system'' model of
|
||||
Multics and Unix.</p>
|
||||
|
||||
<p>Many computers (`workstations'), self-maintained, not a coherent whole.</p>
|
||||
|
||||
<p>Pike and Thompson had been batting around ideas about a system glued together
|
||||
by a single protocol as early as 1984.
|
||||
Various small experiments involving individual pieces (file server, OS, computer)
|
||||
tried throughout 1980s.</p>
|
||||
|
||||
<p>Ordered the hardware for the ``real thing'' in beginning of 1989,
|
||||
built up WORM file server, kernel, throughout that year.</p>
|
||||
|
||||
<p>Some time in early fall 1989, Pike and Thompson were
|
||||
trying to figure out a way to fit the window system in.
|
||||
On way home from dinner, both independently realized that
|
||||
needed to be able to mount a user-space file descriptor,
|
||||
not just a network address.</p>
|
||||
|
||||
<p>Around Thanksgiving 1989, spent a few days rethinking the whole
|
||||
thing, added bind, new mount, flush, and spent a weekend
|
||||
making everything work again. The protocol at that point was
|
||||
essentially identical to the 9P in the paper.</p>
|
||||
|
||||
<p>In May 1990, tried to use system as self-hosting.
|
||||
File server kept breaking, had to keep rewriting window system.
|
||||
Dozen or so users by then, mostly using terminal windows to
|
||||
connect to Unix.</p>
|
||||
|
||||
<p>Paper written and submitted to UKUUG in July 1990.</p>
|
||||
|
||||
<p>Because it was an entirely new system, could take the
|
||||
time to fix problems as they arose, <i>in the right place</i>.</p>
|
||||
|
||||
|
||||
<h2>Design Principles</h2>
|
||||
|
||||
<p>Three design principles:</p>
|
||||
|
||||
<p>
|
||||
1. Everything is a file.<br>
|
||||
2. There is a standard protocol for accessing files.<br>
|
||||
3. Private, malleable name spaces (bind, mount).
|
||||
</p>
|
||||
|
||||
<h3>Everything is a file.</h3>
|
||||
|
||||
<p>Everything is a file (more everything than Unix: networks, graphics).</p>
|
||||
|
||||
<pre>
|
||||
% ls -l /net
|
||||
% lp /dev/screen
|
||||
% cat /mnt/wsys/1/text
|
||||
</pre>
|
||||
|
||||
<h3>Standard protocol for accessing files</h3>
|
||||
|
||||
<p>9P is the only protocol the kernel knows: other protocols
|
||||
(NFS, disk file systems, etc.) are provided by user-level translators.</p>
|
||||
|
||||
<p>Only one protocol, so easy to write filters and other
|
||||
converters. <i>Iostats</i> puts itself between the kernel
|
||||
and a command.</p>
|
||||
|
||||
<pre>
|
||||
% iostats -xvdfdf /bin/ls
|
||||
</pre>
|
||||
|
||||
<h3>Private, malleable name spaces</h3>
|
||||
|
||||
<p>Each process has its own private name space that it
|
||||
can customize at will.
|
||||
(Full disclosure: can arrange groups of
|
||||
processes to run in a shared name space. Otherwise how do
|
||||
you implement <i>mount</i> and <i>bind</i>?)</p>
|
||||
|
||||
<p><i>Iostats</i> remounts the root of the name space
|
||||
with its own filter service.</p>
|
||||
|
||||
<p>The window system mounts a file system that it serves
|
||||
on <tt>/mnt/wsys</tt>.</p>
|
||||
|
||||
<p>The network is actually a kernel device (no 9P involved)
|
||||
but it still serves a file interface that other programs
|
||||
use to access the network.
|
||||
Easy to move out to user space (or replace) if necessary:
|
||||
<i>import</i> network from another machine.</p>
|
||||
|
||||
<h3>Implications</h3>
|
||||
|
||||
<p>Everything is a file + can share files => can share everything.</p>
|
||||
|
||||
<p>Per-process name spaces help move toward ``each process has its own
|
||||
private machine.''</p>
|
||||
|
||||
<p>One protocol: easy to build custom filters to add functionality
|
||||
(e.g., reestablishing broken network connections).
|
||||
|
||||
<h3>File representation for networks, graphics, etc.</h3>
|
||||
|
||||
<p>Unix sockets are file descriptors, but you can't use the
|
||||
usual file operations on them. Also far too much detail that
|
||||
the user doesn't care about.</p>
|
||||
|
||||
<p>In Plan 9:
|
||||
<pre>dial("tcp!plan9.bell-labs.com!http");
|
||||
</pre>
|
||||
(Protocol-independent!)</p>
|
||||
|
||||
<p>Dial more or less does:<br>
|
||||
write to /net/cs: tcp!plan9.bell-labs.com!http
|
||||
read back: /net/tcp/clone 204.178.31.2!80
|
||||
write to /net/tcp/clone: connect 204.178.31.2!80
|
||||
read connection number: 4
|
||||
open /net/tcp/4/data
|
||||
</p>
|
||||
|
||||
<p>Details don't really matter. Two important points:
|
||||
protocol-independent, and ordinary file operations
|
||||
(open, read, write).</p>
|
||||
|
||||
<p>Networks can be shared just like any other files.</p>
|
||||
|
||||
<p>Similar story for graphics, other resources.</p>
|
||||
|
||||
<h2>Conventions</h2>
|
||||
|
||||
<p>Per-process name spaces mean that even full path names are ambiguous
|
||||
(<tt>/bin/cat</tt> means different things on different machines,
|
||||
or even for different users).</p>
|
||||
|
||||
<p><i>Convention</i> binds everything together.
|
||||
On a 386, <tt>bind /386/bin /bin</tt>.
|
||||
|
||||
<p>In Plan 9, always know where the resource <i>should</i> be
|
||||
(e.g., <tt>/net</tt>, <tt>/dev</tt>, <tt>/proc</tt>, etc.),
|
||||
but not which one is there.</p>
|
||||
|
||||
<p>Can break conventions: on a 386, <tt>bind /alpha/bin /bin</tt>, just won't
|
||||
have usable binaries in <tt>/bin</tt> anymore.</p>
|
||||
|
||||
<p>Object-oriented in the sense of having objects (files) that all
|
||||
present the same interface and can be substituted for one another
|
||||
to arrange the system in different ways.</p>
|
||||
|
||||
<p>Very little ``type-checking'': <tt>bind /net /proc; ps</tt>.
|
||||
Great benefit (generality) but must be careful (no safety nets).</p>
|
||||
|
||||
|
||||
<h2>Other Contributions</h2>
|
||||
|
||||
<h3>Portability</h3>
|
||||
|
||||
<p>Plan 9 still is the most portable operating system.
|
||||
Not much machine-dependent code, no fancy features
|
||||
tied to one machine's MMU, multiprocessor from the start (1989).</p>
|
||||
|
||||
<p>Many other systems are still struggling with converting to SMPs.</p>
|
||||
|
||||
<p>Has run on MIPS, Motorola 68000, Nextstation, Sparc, x86, PowerPC, Alpha, others.</p>
|
||||
|
||||
<p>All the world is not an x86.</p>
|
||||
|
||||
<h3>Alef</h3>
|
||||
|
||||
<p>New programming language: convenient, but difficult to maintain.
|
||||
Retired when author (Winterbottom) stopped working on Plan 9.</p>
|
||||
|
||||
<p>Good ideas transferred to C library plus conventions.</p>
|
||||
|
||||
<p>All the world is not C.</p>
|
||||
|
||||
<h3>UTF-8</h3>
|
||||
|
||||
<p>Thompson invented UTF-8. Pike and Thompson
|
||||
converted Plan 9 to use it over the first weekend of September 1992,
|
||||
in time for X/Open to choose it as the Unicode standard byte format
|
||||
at a meeting the next week.</p>
|
||||
|
||||
<p>UTF-8 is now the standard character encoding for Unicode on
|
||||
all systems and interoperating between systems.</p>
|
||||
|
||||
<h3>Simple, easy to modify base for experiments</h3>
|
||||
|
||||
<p>Whole system source code is available, simple, easy to
|
||||
understand and change.
|
||||
There's a reason it only took a couple days to convert to UTF-8.</p>
|
||||
|
||||
<pre>
|
||||
49343 file server kernel
|
||||
|
||||
181611 main kernel
|
||||
78521 ipaq port (small kernel)
|
||||
20027 TCP/IP stack
|
||||
15365 ipaq-specific code
|
||||
43129 portable code
|
||||
|
||||
1326778 total lines of source code
|
||||
</pre>
|
||||
|
||||
<h3>Dump file system</h3>
|
||||
|
||||
<p>Snapshot idea might well have been ``in the air'' at the time.
|
||||
(<tt>OldFiles</tt> in AFS appears to be independently derived,
|
||||
use of WORM media was common research topic.)</p>
|
||||
|
||||
<h3>Generalized Fork</h3>
|
||||
|
||||
<p>Picked up by other systems: FreeBSD, Linux.</p>
|
||||
|
||||
<h3>Authentication</h3>
|
||||
|
||||
<p>No global super-user.
|
||||
Newer, more Plan 9-like authentication described in later paper.</p>
|
||||
|
||||
<h3>New Compilers</h3>
|
||||
|
||||
<p>Much faster than gcc, simpler.</p>
|
||||
|
||||
<p>8s to build acme for Linux using gcc; 1s to build acme for Plan 9 using 8c (but running on Linux)</p>
|
||||
|
||||
<h3>IL Protocol</h3>
|
||||
|
||||
<p>Now retired.
|
||||
For better or worse, TCP has all the installed base.
|
||||
IL didn't work very well on asymmetric or high-latency links
|
||||
(e.g., cable modems).</p>
|
||||
|
||||
<h2>Idea propagation</h2>
|
||||
|
||||
<p>Many ideas have propagated out to varying degrees.</p>
|
||||
|
||||
<p>Linux even has bind and user-level file servers now (FUSE),
|
||||
but still not per-process name spaces.</p>
|
||||
|
||||
|
||||
</body>
|
202
web/l-scalablecoord.html
Normal file
202
web/l-scalablecoord.html
Normal file
|
@ -0,0 +1,202 @@
|
|||
<title>Scalable coordination</title>
|
||||
<html>
|
||||
<head>
|
||||
</head>
|
||||
<body>
|
||||
|
||||
<h1>Scalable coordination</h1>
|
||||
|
||||
<p>Required reading: Mellor-Crummey and Scott, Algorithms for Scalable
|
||||
Synchronization on Shared-Memory Multiprocessors, TOCS, Feb 1991.
|
||||
|
||||
<h2>Overview</h2>
|
||||
|
||||
<p>Shared memory machines are bunch of CPUs, sharing physical memory.
|
||||
Typically each processor also mantains a cache (for performance),
|
||||
which introduces the problem of keep caches coherent. If processor 1
|
||||
writes a memory location whose value processor 2 has cached, then
|
||||
processor 2's cache must be updated in some way. How?
|
||||
<ul>
|
||||
|
||||
<li>Bus-based schemes. Any CPU can access "dance with" any memory
|
||||
equally ("dance hall arch"). Use "Snoopy" protocols: Each CPU's cache
|
||||
listens to the memory bus. With write-through architecture, invalidate
|
||||
copy when see a write. Or can have "ownership" scheme with write-back
|
||||
cache (E.g., Pentium cache have MESI bits---modified, exclusive,
|
||||
shared, invalid). If E bit set, CPU caches exclusively and can do
|
||||
write back. But bus places limits on scalability.
|
||||
|
||||
<li>More scalability w. NUMA schemes (non-uniform memory access). Each
|
||||
CPU comes with fast "close" memory. Slower to access memory that is
|
||||
stored with another processor. Use a directory to keep track of who is
|
||||
caching what. For example, processor 0 is responsible for all memory
|
||||
starting with address "000", processor 1 is responsible for all memory
|
||||
starting with "001", etc.
|
||||
|
||||
<li>COMA - cache-only memory architecture. Each CPU has local RAM,
|
||||
treated as cache. Cache lines migrate around to different nodes based
|
||||
on access pattern. Data only lives in cache, no permanent memory
|
||||
location. (These machines aren't too popular any more.)
|
||||
|
||||
</ul>
|
||||
|
||||
|
||||
<h2>Scalable locks</h2>
|
||||
|
||||
<p>This paper is about cost and scalability of locking; what if you
|
||||
have 10 CPUs waiting for the same lock? For example, what would
|
||||
happen if xv6 runs on an SMP with many processors?
|
||||
|
||||
<p>What's the cost of a simple spinning acquire/release? Algorithm 1
|
||||
*without* the delays, which is like xv6's implementation of acquire
|
||||
and release (xv6 uses XCHG instead of test_and_set):
|
||||
<pre>
|
||||
each of the 10 CPUs gets the lock in turn
|
||||
meanwhile, remaining CPUs in XCHG on lock
|
||||
lock must be X in cache to run XCHG
|
||||
otherwise all might read, then all might write
|
||||
so bus is busy all the time with XCHGs!
|
||||
can we avoid constant XCHGs while lock is held?
|
||||
</pre>
|
||||
|
||||
<p>test-and-test-and-set
|
||||
<pre>
|
||||
only run expensive TSL if not locked
|
||||
spin on ordinary load instruction, so cache line is S
|
||||
acquire(l)
|
||||
while(1){
|
||||
while(l->locked != 0) { }
|
||||
if(TSL(&l->locked) == 0)
|
||||
return;
|
||||
}
|
||||
</pre>
|
||||
|
||||
<p>suppose 10 CPUs are waiting, let's count cost in total bus
|
||||
transactions
|
||||
<pre>
|
||||
CPU1 gets lock in one cycle
|
||||
sets lock's cache line to I in other CPUs
|
||||
9 CPUs each use bus once in XCHG
|
||||
then everyone has the line S, so they spin locally
|
||||
CPU1 release the lock
|
||||
CPU2 gets the lock in one cycle
|
||||
8 CPUs each use bus once...
|
||||
So 10 + 9 + 8 + ... = 50 transactions, O(n^2) in # of CPUs!
|
||||
Look at "test-and-test-and-set" in Figure 6
|
||||
</pre>
|
||||
<p> Can we have <i>n</i> CPUs acquire a lock in O(<i>n</i>) time?
|
||||
|
||||
<p>What is the point of the exponential backoff in Algorithm 1?
|
||||
<pre>
|
||||
Does it buy us O(n) time for n acquires?
|
||||
Is there anything wrong with it?
|
||||
may not be fair
|
||||
exponential backoff may increase delay after release
|
||||
</pre>
|
||||
|
||||
<p>What's the point of the ticket locks, Algorithm 2?
|
||||
<pre>
|
||||
one interlocked instruction to get my ticket number
|
||||
then I spin on now_serving with ordinary load
|
||||
release() just increments now_serving
|
||||
</pre>
|
||||
|
||||
<p>why is that good?
|
||||
<pre>
|
||||
+ fair
|
||||
+ no exponential backoff overshoot
|
||||
+ no spinning on
|
||||
</pre>
|
||||
|
||||
<p>but what's the cost, in bus transactions?
|
||||
<pre>
|
||||
while lock is held, now_serving is S in all caches
|
||||
release makes it I in all caches
|
||||
then each waiters uses a bus transaction to get new value
|
||||
so still O(n^2)
|
||||
</pre>
|
||||
|
||||
<p>What's the point of the array-based queuing locks, Algorithm 3?
|
||||
<pre>
|
||||
a lock has an array of "slots"
|
||||
waiter allocates a slot, spins on that slot
|
||||
release wakes up just next slot
|
||||
so O(n) bus transactions to get through n waiters: good!
|
||||
anderson lines in Figure 4 and 6 are flat-ish
|
||||
they only go up because lock data structures protected by simpler lock
|
||||
but O(n) space *per lock*!
|
||||
</pre>
|
||||
|
||||
<p>Algorithm 5 (MCS), the new algorithm of the paper, uses
|
||||
compare_and_swap:
|
||||
<pre>
|
||||
int compare_and_swap(addr, v1, v2) {
|
||||
int ret = 0;
|
||||
// stop all memory activity and ignore interrupts
|
||||
if (*addr == v1) {
|
||||
*addr = v2;
|
||||
ret = 1;
|
||||
}
|
||||
// resume other memory activity and take interrupts
|
||||
return ret;
|
||||
}
|
||||
</pre>
|
||||
|
||||
<p>What's the point of the MCS lock, Algorithm 5?
|
||||
<pre>
|
||||
constant space per lock, rather than O(n)
|
||||
one "qnode" per thread, used for whatever lock it's waiting for
|
||||
lock holder's qnode points to start of list
|
||||
lock variable points to end of list
|
||||
acquire adds your qnode to end of list
|
||||
then you spin on your own qnode
|
||||
release wakes up next qnode
|
||||
</pre>
|
||||
|
||||
<h2>Wait-free or non-blocking data structures</h2>
|
||||
|
||||
<p>The previous implementations all block threads when there is
|
||||
contention for a lock. Other atomic hardware operations allows one
|
||||
to build implementation wait-free data structures. For example, one
|
||||
can make an insert of an element in a shared list that don't block a
|
||||
thread. Such versions are called wait free.
|
||||
|
||||
<p>A linked list with locks is as follows:
|
||||
<pre>
|
||||
Lock list_lock;
|
||||
|
||||
insert(int x) {
|
||||
element *n = new Element;
|
||||
n->x = x;
|
||||
|
||||
acquire(&list_lock);
|
||||
n->next = list;
|
||||
list = n;
|
||||
release(&list_lock);
|
||||
}
|
||||
</pre>
|
||||
|
||||
<p>A wait-free implementation is as follows:
|
||||
<pre>
|
||||
insert (int x) {
|
||||
element *n = new Element;
|
||||
n->x = x;
|
||||
do {
|
||||
n->next = list;
|
||||
} while (compare_and_swap (&list, n->next, n) == 0);
|
||||
}
|
||||
</pre>
|
||||
<p>How many bus transactions with 10 CPUs inserting one element in the
|
||||
list? Could you do better?
|
||||
|
||||
<p><a href="http://www.cl.cam.ac.uk/netos/papers/2007-cpwl.pdf">This
|
||||
paper by Fraser and Harris</a> compares lock-based implementations
|
||||
versus corresponding non-blocking implementations of a number of data
|
||||
structures.
|
||||
|
||||
<p>It is not possible to make every operation wait-free, and there are
|
||||
times we will need an implementation of acquire and release.
|
||||
research on non-blocking data structures is active; the last word
|
||||
isn't said on this topic yet.
|
||||
|
||||
</body>
|
340
web/l-schedule.html
Normal file
340
web/l-schedule.html
Normal file
|
@ -0,0 +1,340 @@
|
|||
<title>Scheduling</title>
|
||||
<html>
|
||||
<head>
|
||||
</head>
|
||||
<body>
|
||||
|
||||
<h1>Scheduling</h1>
|
||||
|
||||
<p>Required reading: Eliminating receive livelock
|
||||
|
||||
<p>Notes based on prof. Morris's lecture on scheduling (6.824, fall'02).
|
||||
|
||||
<h2>Overview</h2>
|
||||
|
||||
<ul>
|
||||
|
||||
<li>What is scheduling? The OS policies and mechanisms to allocates
|
||||
resources to entities. A good scheduling policy ensures that the most
|
||||
important entitity gets the resources it needs. This topic was
|
||||
popular in the days of time sharing, when there was a shortage of
|
||||
resources. It seemed irrelevant in era of PCs and workstations, when
|
||||
resources were plenty. Now the topic is back from the dead to handle
|
||||
massive Internet servers with paying customers. The Internet exposes
|
||||
web sites to international abuse and overload, which can lead to
|
||||
resource shortages. Furthermore, some customers are more important
|
||||
than others (e.g., the ones that buy a lot).
|
||||
|
||||
<li>Key problems:
|
||||
<ul>
|
||||
<li>Gap between desired policy and available mechanism. The desired
|
||||
policies often include elements that not implementable with the
|
||||
mechanisms available to the operation system. Furthermore, often
|
||||
there are many conflicting goals (low latency, high throughput, and
|
||||
fairness), and the scheduler must make a trade-off between the goals.
|
||||
|
||||
<li>Interaction between different schedulers. One have to take a
|
||||
systems view. Just optimizing the CPU scheduler may do little to for
|
||||
the overall desired policy.
|
||||
</ul>
|
||||
|
||||
<li>Resources you might want to schedule: CPU time, physical memory,
|
||||
disk and network I/O, and I/O bus bandwidth.
|
||||
|
||||
<li>Entities that you might want to give resources to: users,
|
||||
processes, threads, web requests, or MIT accounts.
|
||||
|
||||
<li>Many polices for resource to entity allocation are possible:
|
||||
strict priority, divide equally, shortest job first, minimum guarantee
|
||||
combined with admission control.
|
||||
|
||||
<li>General plan for scheduling mechanisms
|
||||
<ol>
|
||||
<li> Understand where scheduling is occuring.
|
||||
<li> Expose scheduling decisions, allow control.
|
||||
<li> Account for resource consumption, to allow intelligent control.
|
||||
</ol>
|
||||
|
||||
<li>Simple example from 6.828 kernel. The policy for scheduling
|
||||
environments is to give each one equal CPU time. The mechanism used to
|
||||
implement this policy is a clock interrupt every 10 msec and then
|
||||
selecting the next environment in a round-robin fashion.
|
||||
|
||||
<p>But this only works if processes are compute-bound. What if a
|
||||
process gives up some of its 10 ms to wait for input? Do we have to
|
||||
keep track of that and give it back?
|
||||
|
||||
<p>How long should the quantum be? is 10 msec the right answer?
|
||||
Shorter quantum will lead to better interactive performance, but
|
||||
lowers overall system throughput because we will reschedule more,
|
||||
which has overhead.
|
||||
|
||||
<p>What if the environment computes for 1 msec and sends an IPC to
|
||||
the file server environment? Shouldn't the file server get more CPU
|
||||
time because it operates on behalf of all other functions?
|
||||
|
||||
<p>Potential improvements for the 6.828 kernel: track "recent" CPU use
|
||||
(e.g., over the last second) and always run environment with least
|
||||
recent CPU use. (Still, if you sleep long enough you lose.) Other
|
||||
solution: directed yield; specify on the yield to which environment
|
||||
you are donating the remainder of the quantuam (e.g., to the file
|
||||
server so that it can compute on the environment's behalf).
|
||||
|
||||
<li>Pitfall: Priority Inversion
|
||||
<pre>
|
||||
Assume policy is strict priority.
|
||||
Thread T1: low priority.
|
||||
Thread T2: medium priority.
|
||||
Thread T3: high priority.
|
||||
T1: acquire(l)
|
||||
context switch to T3
|
||||
T3: acquire(l)... must wait for T1 to release(l)...
|
||||
context switch to T2
|
||||
T2 computes for a while
|
||||
T3 is indefinitely delayed despite high priority.
|
||||
Can solve if T3 lends its priority to holder of lock it is waiting for.
|
||||
So T1 runs, not T2.
|
||||
[this is really a multiple scheduler problem.]
|
||||
[since locks schedule access to locked resource.]
|
||||
</pre>
|
||||
|
||||
<li>Pitfall: Efficiency. Efficiency often conflicts with fairness (or
|
||||
any other policy). Long time quantum for efficiency in CPU scheduling
|
||||
versus low delay. Shortest seek versus FIFO disk scheduling.
|
||||
Contiguous read-ahead vs data needed now. For example, scheduler
|
||||
swaps out my idle emacs to let gcc run faster with more phys mem.
|
||||
What happens when I type a key? These don't fit well into a "who gets
|
||||
to go next" scheduler framework. Inefficient scheduling may make
|
||||
<i>everybody</i> slower, including high priority users.
|
||||
|
||||
<li>Pitfall: Multiple Interacting Schedulers. Suppose you want your
|
||||
emacs to have priority over everything else. Give it high CPU
|
||||
priority. Does that mean nothing else will run if emacs wants to run?
|
||||
Disk scheduler might not know to favor emacs's disk I/Os. Typical
|
||||
UNIX disk scheduler favors disk efficiency, not process prio. Suppose
|
||||
emacs needs more memory. Other processes have dirty pages; emacs must
|
||||
wait. Does disk scheduler know these other processes' writes are high
|
||||
prio?
|
||||
|
||||
<li>Pitfall: Server Processes. Suppose emacs uses X windows to
|
||||
display. The X server must serve requests from many clients. Does it
|
||||
know that emacs' requests should be given priority? Does the OS know
|
||||
to raise X's priority when it is serving emacs? Similarly for DNS,
|
||||
and NFS. Does the network know to give emacs' NFS requests priority?
|
||||
|
||||
</ul>
|
||||
|
||||
<p>In short, scheduling is a system problem. There are many
|
||||
schedulers; they interact. The CPU scheduler is usually the easy
|
||||
part. The hardest part is system structure. For example, the
|
||||
<i>existence</i> of interrupts is bad for scheduling. Conflicting
|
||||
goals may limit effectiveness.
|
||||
|
||||
<h2>Case study: modern UNIX</h2>
|
||||
|
||||
<p>Goals:
|
||||
<ul>
|
||||
<li>Simplicity (e.g. avoid complex locking regimes).
|
||||
<li>Quick response to device interrupts.
|
||||
<li> Favor interactive response.
|
||||
</ul>
|
||||
|
||||
<p>UNIX has a number of execution environments. We care about
|
||||
scheduling transitions among them. Some transitions aren't possible,
|
||||
some can't be be controlled. The execution environments are:
|
||||
|
||||
<ul>
|
||||
<li>Process, user half
|
||||
<li>Process, kernel half
|
||||
<li>Soft interrupts: timer, network
|
||||
<li>Device interrupts
|
||||
</ul>
|
||||
|
||||
<p>The rules are:
|
||||
<ul>
|
||||
<li>User is pre-emptible.
|
||||
<li>Kernel half and software interrupts are not pre-emptible.
|
||||
<li>Device handlers may not make blocking calls (e.g., sleep)
|
||||
<li>Effective priorities: intr > soft intr > kernel half > user
|
||||
</ul>
|
||||
|
||||
</ul>
|
||||
|
||||
<p>Rules are implemented as follows:
|
||||
|
||||
<ul>
|
||||
|
||||
<li>UNIX: Process User Half. Runs in process address space, on
|
||||
per-process stack. Interruptible. Pre-emptible: interrupt may cause
|
||||
context switch. We don't trust user processes to yield CPU.
|
||||
Voluntarily enters kernel half via system calls and faults.
|
||||
|
||||
<li>UNIX: Process Kernel Half. Runs in kernel address space, on
|
||||
per-process kernel stack. Executes system calls and faults for its
|
||||
process. Interruptible (but can defer interrupts in critical
|
||||
sections). Not pre-emptible. Only yields voluntarily, when waiting
|
||||
for an event. E.g. disk I/O done. This simplifies concurrency
|
||||
control; locks often not required. No user process runs if any kernel
|
||||
half wants to run. Many process' kernel halfs may be sleeping in the
|
||||
kernel.
|
||||
|
||||
<li>UNIX: Device Interrupts. Hardware asks CPU for an interrupt to ask
|
||||
for attention. Disk read/write completed, or network packet received.
|
||||
Runs in kernel space, on special interrupt stack. Interrupt routine
|
||||
cannot block; must return. Interrupts are interruptible. They nest
|
||||
on the one interrupt stack. Interrupts are not pre-emptible, and
|
||||
cannot really yield. The real-time clock is a device and interrupts
|
||||
every 10ms (or whatever). Process scheduling decisions can be made
|
||||
when interrupt returns (e.g. wake up the process waiting for this
|
||||
event). You want interrupt processing to be fast, since it has
|
||||
priority. Don't do any more work than you have to. You're blocking
|
||||
processes and other interrupts. Typically, an interrupt does the
|
||||
minimal work necessary to keep the device happy, and then call wakeup
|
||||
on a thread.
|
||||
|
||||
<li>UNIX: Soft Interrupts. (Didn't exist in xv6) Used when device
|
||||
handling is expensive. But no obvious process context in which to
|
||||
run. Examples include IP forwarding, TCP input processing. Runs in
|
||||
kernel space, on interrupt stack. Interruptable. Not pre-emptable,
|
||||
can't really yield. Triggered by hardware interrupt. Called when
|
||||
outermost hardware interrupt returns. Periodic scheduling decisions
|
||||
are made in timer s/w interrupt. Scheduled by hardware timer
|
||||
interrupt (i.e., if current process has run long enough, switch).
|
||||
</ul>
|
||||
|
||||
<p>Is this good software structure? Let's talk about receive
|
||||
livelock.
|
||||
|
||||
<h2>Paper discussion</h2>
|
||||
|
||||
<ul>
|
||||
|
||||
<li>What is application that the paper is addressing: IP forwarding.
|
||||
What functionality does a network interface offer to driver?
|
||||
<ul>
|
||||
<li> Read packets
|
||||
<li> Poke hardware to send packets
|
||||
<li> Interrupts when packet received/transmit complete
|
||||
<li> Buffer many input packets
|
||||
</ul>
|
||||
|
||||
<li>What devices in the 6.828 kernel are interrupt driven? Which one
|
||||
are polling? Is this ideal?
|
||||
|
||||
<li>Explain Figure 6-1. Why does it go up? What determines how high
|
||||
the peak is? Why does it go down? What determines how fast it goes
|
||||
does? Answer:
|
||||
<pre>
|
||||
(fraction of packets discarded)(work invested in discarded packets)
|
||||
-------------------------------------------
|
||||
(total work CPU is capable of)
|
||||
</pre>
|
||||
|
||||
<li>Suppose I wanted to test an NFS server for livelock.
|
||||
<pre>
|
||||
Run client with this loop:
|
||||
while(1){
|
||||
send NFS READ RPC;
|
||||
wait for response;
|
||||
}
|
||||
</pre>
|
||||
What would I see? Is the NFS server probably subject to livelock?
|
||||
(No--offered load subject to feedback).
|
||||
|
||||
<li>What other problems are we trying to address?
|
||||
<ul>
|
||||
<li>Increased latency for packet delivery and forwarding (e.g., start
|
||||
disk head moving when first NFS read request comes)
|
||||
<li>Transmit starvation
|
||||
<li>User-level CPU starvation
|
||||
</ul>
|
||||
|
||||
<li>Why not tell the O/S scheduler to give interrupts lower priority?
|
||||
Non-preemptible.
|
||||
Could you fix this by making interrupts faster? (Maybe, if coupled
|
||||
with some limit on input rate.)
|
||||
|
||||
<li>Why not completely process each packet in the interrupt handler?
|
||||
(I.e. forward it?) Other parts of kernel don't expect to run at high
|
||||
interrupt-level (e.g., some packet processing code might invoke a function
|
||||
that sleeps). Still might want an output queue
|
||||
|
||||
<li>What about using polling instead of interrupts? Solves overload
|
||||
problem, but killer for latency.
|
||||
|
||||
<li>What's the paper's solution?
|
||||
<ul>
|
||||
<li>No IP input queue.
|
||||
<li>Input processing and device input polling in kernel thread.
|
||||
<li>Device receive interrupt just wakes up thread. And leaves
|
||||
interrupts *disabled* for that device.
|
||||
<li>Thread does all input processing, then re-enables interrupts.
|
||||
</ul>
|
||||
<p>Why does this work? What happens when packets arrive too fast?
|
||||
What happens when packets arrive slowly?
|
||||
|
||||
<li>Explain Figure 6-3.
|
||||
<ul>
|
||||
<li>Why does "Polling (no quota)" work badly? (Input still starves
|
||||
xmit complete processing.)
|
||||
<li>Why does it immediately fall to zero, rather than gradually decreasing?
|
||||
(xmit complete processing must be very cheap compared to input.)
|
||||
</ul>
|
||||
|
||||
<li>Explain Figure 6-4.
|
||||
<ul>
|
||||
|
||||
<li>Why does "Polling, no feedback" behave badly? There's a queue in
|
||||
front of screend. We can still give 100% to input thread, 0% to
|
||||
screend.
|
||||
|
||||
<li>Why does "Polling w/ feedback" behave well? Input thread yields
|
||||
when queue to screend fills.
|
||||
|
||||
<li>What if screend hangs, what about other consumers of packets?
|
||||
(e.g., can you ssh to machine to fix screend?) Fortunately screend
|
||||
typically is only application. Also, re-enable input after timeout.
|
||||
|
||||
</ul>
|
||||
|
||||
<li>Why are the two solutions different?
|
||||
<ol>
|
||||
<li> Polling thread <i>with quotas</i>.
|
||||
<li> Feedback from full queue.
|
||||
</ol>
|
||||
(I believe they should have used #2 for both.)
|
||||
|
||||
<li>If we apply the proposed fixes, does the phenomemon totally go
|
||||
away? (e.g. for web server, waits for disk, &c.)
|
||||
<ul>
|
||||
<li>Can the net device throw away packets without slowing down host?
|
||||
<li>Problem: We want to drop packets for applications with big queues.
|
||||
But requires work to determine which application a packet belongs to
|
||||
Solution: NI-LRP (have network interface sort packets)
|
||||
</ul>
|
||||
|
||||
<li>What about latency question? (Look at figure 14 p. 243.)
|
||||
<ul>
|
||||
<li>1st packet looks like an improvement over non-polling. But 2nd
|
||||
packet transmitted later with poling. Why? (No new packets added to
|
||||
xmit buffer until xmit interrupt)
|
||||
<li>Why? In traditional BSD, to
|
||||
amortize cost of poking device. Maybe better to poke a second time
|
||||
anyway.
|
||||
</ul>
|
||||
|
||||
<li>What if processing has more complex structure?
|
||||
<ul>
|
||||
<li>Chain of processing stages with queues? Does feedback work?
|
||||
What happens when a late stage is slow?
|
||||
<li>Split at some point, multiple parallel paths? No so great; one
|
||||
slow path blocks all paths.
|
||||
</ul>
|
||||
|
||||
<li>Can we formulate any general principles from paper?
|
||||
<ul>
|
||||
<li>Don't spend time on new work before completing existing work.
|
||||
<li>Or give new work lower priority than partially-completed work.
|
||||
</ul>
|
||||
|
||||
</ul>
|
316
web/l-threads.html
Normal file
316
web/l-threads.html
Normal file
|
@ -0,0 +1,316 @@
|
|||
<title>L8</title>
|
||||
<html>
|
||||
<head>
|
||||
</head>
|
||||
<body>
|
||||
|
||||
<h1>Threads, processes, and context switching</h1>
|
||||
|
||||
<p>Required reading: proc.c (focus on scheduler() and sched()),
|
||||
setjmp.S, and sys_fork (in sysproc.c)
|
||||
|
||||
<h2>Overview</h2>
|
||||
|
||||
|
||||
<p>Big picture: more programs than processors. How to share the
|
||||
limited number of processors among the programs?
|
||||
|
||||
<p>Observation: most programs don't need the processor continuously,
|
||||
because they frequently have to wait for input (from user, disk,
|
||||
network, etc.)
|
||||
|
||||
<p>Idea: when one program must wait, it releases the processor, and
|
||||
gives it to another program.
|
||||
|
||||
<p>Mechanism: thread of computation, an active active computation. A
|
||||
thread is an abstraction that contains the minimal state that is
|
||||
necessary to stop an active and an resume it at some point later.
|
||||
What that state is depends on the processor. On x86, it is the
|
||||
processor registers (see setjmp.S).
|
||||
|
||||
<p>Address spaces and threads: address spaces and threads are in
|
||||
principle independent concepts. One can switch from one thread to
|
||||
another thread in the same address space, or one can switch from one
|
||||
thread to another thread in another address space. Example: in xv6,
|
||||
one switches address spaces by switching segmentation registers (see
|
||||
setupsegs). Does xv6 ever switch from one thread to another in the
|
||||
same address space? (Answer: yes, v6 switches, for example, from the
|
||||
scheduler, proc[0], to the kernel part of init, proc[1].) In the JOS
|
||||
kernel we switch from the kernel thread to a user thread, but we don't
|
||||
switch kernel space necessarily.
|
||||
|
||||
<p>Process: one address space plus one or more threads of computation.
|
||||
In xv6 all <i>user</i> programs contain one thread of computation and
|
||||
one address space, and the concepts of address space and threads of
|
||||
computation are not separated but bundled together in the concept of a
|
||||
process. When switching from the kernel program (which has multiple
|
||||
threads) to a user program, xv6 switches threads (switching from a
|
||||
kernel stack to a user stack) and address spaces (the hardware uses
|
||||
the kernel segment registers and the user segment registers).
|
||||
|
||||
<p>xv6 supports the following operations on processes:
|
||||
<ul>
|
||||
<li>fork; create a new process, which is a copy of the parent.
|
||||
<li>exec; execute a program
|
||||
<li>exit: terminte process
|
||||
<li>wait: wait for a process to terminate
|
||||
<li>kill: kill process
|
||||
<li>sbrk: grow the address space of a process.
|
||||
</ul>
|
||||
This interfaces doesn't separate threads and address spaces. For
|
||||
example, with this interface one cannot create additional threads in
|
||||
the same threads. Modern Unixes provides additional primitives
|
||||
(called pthreads, POSIX threads) to create additional threads in a
|
||||
process and coordinate their activities.
|
||||
|
||||
<p>Scheduling. The thread manager needs a method for deciding which
|
||||
thread to run if multiple threads are runnable. The xv6 policy is to
|
||||
run the processes round robin. Why round robin? What other methods
|
||||
can you imagine?
|
||||
|
||||
<p>Preemptive scheduling. To force a thread to release the processor
|
||||
periodically (in case the thread never calls sleep), a thread manager
|
||||
can use preemptive scheduling. The thread manager uses the clock chip
|
||||
to generate periodically a hardware interrupt, which will cause
|
||||
control to transfer to the thread manager, which then can decide to
|
||||
run another thread (e.g., see trap.c).
|
||||
|
||||
<h2>xv6 code examples</h2>
|
||||
|
||||
<p>Thread switching is implemented in xv6 using setjmp and longjmp,
|
||||
which take a jumpbuf as an argument. setjmp saves its context in a
|
||||
jumpbuf for later use by longjmp. longjmp restores the context saved
|
||||
by the last setjmp. It then causes execution to continue as if the
|
||||
call of setjmp has just returned 1.
|
||||
<ul>
|
||||
<li>setjmp saves: ebx, exc, edx, esi, edi, esp, ebp, and eip.
|
||||
<li>longjmp restores them, and puts 1 in eax!
|
||||
</ul>
|
||||
|
||||
<p> Example of thread switching: proc[0] switches to scheduler:
|
||||
<ul>
|
||||
<li>1359: proc[0] calls iget, which calls sleep, which calls sched.
|
||||
<li>2261: The stack before the call to setjmp in sched is:
|
||||
<pre>
|
||||
CPU 0:
|
||||
eax: 0x10a144 1089860
|
||||
ecx: 0x6c65746e 1818588270
|
||||
edx: 0x0 0
|
||||
ebx: 0x10a0e0 1089760
|
||||
esp: 0x210ea8 2166440
|
||||
ebp: 0x210ebc 2166460
|
||||
esi: 0x107f20 1081120
|
||||
edi: 0x107740 1079104
|
||||
eip: 0x1023c9
|
||||
eflags 0x12
|
||||
cs: 0x8
|
||||
ss: 0x10
|
||||
ds: 0x10
|
||||
es: 0x10
|
||||
fs: 0x10
|
||||
gs: 0x10
|
||||
00210ea8 [00210ea8] 10111e
|
||||
00210eac [00210eac] 210ebc
|
||||
00210eb0 [00210eb0] 10239e
|
||||
00210eb4 [00210eb4] 0001
|
||||
00210eb8 [00210eb8] 10a0e0
|
||||
00210ebc [00210ebc] 210edc
|
||||
00210ec0 [00210ec0] 1024ce
|
||||
00210ec4 [00210ec4] 1010101
|
||||
00210ec8 [00210ec8] 1010101
|
||||
00210ecc [00210ecc] 1010101
|
||||
00210ed0 [00210ed0] 107740
|
||||
00210ed4 [00210ed4] 0001
|
||||
00210ed8 [00210ed8] 10cd74
|
||||
00210edc [00210edc] 210f1c
|
||||
00210ee0 [00210ee0] 100bbc
|
||||
00210ee4 [00210ee4] 107740
|
||||
</pre>
|
||||
<li>2517: stack at beginning of setjmp:
|
||||
<pre>
|
||||
CPU 0:
|
||||
eax: 0x10a144 1089860
|
||||
ecx: 0x6c65746e 1818588270
|
||||
edx: 0x0 0
|
||||
ebx: 0x10a0e0 1089760
|
||||
esp: 0x210ea0 2166432
|
||||
ebp: 0x210ebc 2166460
|
||||
esi: 0x107f20 1081120
|
||||
edi: 0x107740 1079104
|
||||
eip: 0x102848
|
||||
eflags 0x12
|
||||
cs: 0x8
|
||||
ss: 0x10
|
||||
ds: 0x10
|
||||
es: 0x10
|
||||
fs: 0x10
|
||||
gs: 0x10
|
||||
00210ea0 [00210ea0] 1023cf <--- return address (sched)
|
||||
00210ea4 [00210ea4] 10a144
|
||||
00210ea8 [00210ea8] 10111e
|
||||
00210eac [00210eac] 210ebc
|
||||
00210eb0 [00210eb0] 10239e
|
||||
00210eb4 [00210eb4] 0001
|
||||
00210eb8 [00210eb8] 10a0e0
|
||||
00210ebc [00210ebc] 210edc
|
||||
00210ec0 [00210ec0] 1024ce
|
||||
00210ec4 [00210ec4] 1010101
|
||||
00210ec8 [00210ec8] 1010101
|
||||
00210ecc [00210ecc] 1010101
|
||||
00210ed0 [00210ed0] 107740
|
||||
00210ed4 [00210ed4] 0001
|
||||
00210ed8 [00210ed8] 10cd74
|
||||
00210edc [00210edc] 210f1c
|
||||
</pre>
|
||||
<li>2519: What is saved in jmpbuf of proc[0]?
|
||||
<li>2529: return 0!
|
||||
<li>2534: What is in jmpbuf of cpu 0? The stack is as follows:
|
||||
<pre>
|
||||
CPU 0:
|
||||
eax: 0x0 0
|
||||
ecx: 0x6c65746e 1818588270
|
||||
edx: 0x108aa4 1084068
|
||||
ebx: 0x10a0e0 1089760
|
||||
esp: 0x210ea0 2166432
|
||||
ebp: 0x210ebc 2166460
|
||||
esi: 0x107f20 1081120
|
||||
edi: 0x107740 1079104
|
||||
eip: 0x10286e
|
||||
eflags 0x46
|
||||
cs: 0x8
|
||||
ss: 0x10
|
||||
ds: 0x10
|
||||
es: 0x10
|
||||
fs: 0x10
|
||||
gs: 0x10
|
||||
00210ea0 [00210ea0] 1023fe
|
||||
00210ea4 [00210ea4] 108aa4
|
||||
00210ea8 [00210ea8] 10111e
|
||||
00210eac [00210eac] 210ebc
|
||||
00210eb0 [00210eb0] 10239e
|
||||
00210eb4 [00210eb4] 0001
|
||||
00210eb8 [00210eb8] 10a0e0
|
||||
00210ebc [00210ebc] 210edc
|
||||
00210ec0 [00210ec0] 1024ce
|
||||
00210ec4 [00210ec4] 1010101
|
||||
00210ec8 [00210ec8] 1010101
|
||||
00210ecc [00210ecc] 1010101
|
||||
00210ed0 [00210ed0] 107740
|
||||
00210ed4 [00210ed4] 0001
|
||||
00210ed8 [00210ed8] 10cd74
|
||||
00210edc [00210edc] 210f1c
|
||||
</pre>
|
||||
<li>2547: return 1! stack looks as follows:
|
||||
<pre>
|
||||
CPU 0:
|
||||
eax: 0x1 1
|
||||
ecx: 0x108aa0 1084064
|
||||
edx: 0x108aa4 1084068
|
||||
ebx: 0x10074 65652
|
||||
esp: 0x108d40 1084736
|
||||
ebp: 0x108d5c 1084764
|
||||
esi: 0x10074 65652
|
||||
edi: 0xffde 65502
|
||||
eip: 0x102892
|
||||
eflags 0x6
|
||||
cs: 0x8
|
||||
ss: 0x10
|
||||
ds: 0x10
|
||||
es: 0x10
|
||||
fs: 0x10
|
||||
gs: 0x10
|
||||
00108d40 [00108d40] 10231c
|
||||
00108d44 [00108d44] 10a144
|
||||
00108d48 [00108d48] 0010
|
||||
00108d4c [00108d4c] 0021
|
||||
00108d50 [00108d50] 0000
|
||||
00108d54 [00108d54] 0000
|
||||
00108d58 [00108d58] 10a0e0
|
||||
00108d5c [00108d5c] 0000
|
||||
00108d60 [00108d60] 0001
|
||||
00108d64 [00108d64] 0000
|
||||
00108d68 [00108d68] 0000
|
||||
00108d6c [00108d6c] 0000
|
||||
00108d70 [00108d70] 0000
|
||||
00108d74 [00108d74] 0000
|
||||
00108d78 [00108d78] 0000
|
||||
00108d7c [00108d7c] 0000
|
||||
</pre>
|
||||
<li>2548: where will longjmp return? (answer: 10231c, in scheduler)
|
||||
<li>2233:Scheduler on each processor selects in a round-robin fashion the
|
||||
first runnable process. Which process will that be? (If we are
|
||||
running with one processor.) (Ans: proc[0].)
|
||||
<li>2229: what will be saved in cpu's jmpbuf?
|
||||
<li>What is in proc[0]'s jmpbuf?
|
||||
<li>2548: return 1. Stack looks as follows:
|
||||
<pre>
|
||||
CPU 0:
|
||||
eax: 0x1 1
|
||||
ecx: 0x6c65746e 1818588270
|
||||
edx: 0x0 0
|
||||
ebx: 0x10a0e0 1089760
|
||||
esp: 0x210ea0 2166432
|
||||
ebp: 0x210ebc 2166460
|
||||
esi: 0x107f20 1081120
|
||||
edi: 0x107740 1079104
|
||||
eip: 0x102892
|
||||
eflags 0x2
|
||||
cs: 0x8
|
||||
ss: 0x10
|
||||
ds: 0x10
|
||||
es: 0x10
|
||||
fs: 0x10
|
||||
gs: 0x10
|
||||
00210ea0 [00210ea0] 1023cf <--- return to sleep
|
||||
00210ea4 [00210ea4] 108aa4
|
||||
00210ea8 [00210ea8] 10111e
|
||||
00210eac [00210eac] 210ebc
|
||||
00210eb0 [00210eb0] 10239e
|
||||
00210eb4 [00210eb4] 0001
|
||||
00210eb8 [00210eb8] 10a0e0
|
||||
00210ebc [00210ebc] 210edc
|
||||
00210ec0 [00210ec0] 1024ce
|
||||
00210ec4 [00210ec4] 1010101
|
||||
00210ec8 [00210ec8] 1010101
|
||||
00210ecc [00210ecc] 1010101
|
||||
00210ed0 [00210ed0] 107740
|
||||
00210ed4 [00210ed4] 0001
|
||||
00210ed8 [00210ed8] 10cd74
|
||||
00210edc [00210edc] 210f1c
|
||||
</pre>
|
||||
</ul>
|
||||
|
||||
<p>Why switch from proc[0] to the processor stack, and then to
|
||||
proc[0]'s stack? Why not instead run the scheduler on the kernel
|
||||
stack of the last process that run on that cpu?
|
||||
|
||||
<ul>
|
||||
|
||||
<li>If the scheduler wanted to use the process stack, then it couldn't
|
||||
have any stack variables live across process scheduling, since
|
||||
they'd be different depending on which process just stopped running.
|
||||
|
||||
<li>Suppose process p goes to sleep on CPU1, so CPU1 is idling in
|
||||
scheduler() on p's stack. Someone wakes up p. CPU2 decides to run
|
||||
p. Now p is running on its stack, and CPU1 is also running on the
|
||||
same stack. They will likely scribble on each others' local
|
||||
variables, return pointers, etc.
|
||||
|
||||
<li>The same thing happens if CPU1 tries to reuse the process's page
|
||||
tables to avoid a TLB flush. If the process gets killed and cleaned
|
||||
up by the other CPU, now the page tables are wrong. I think some OSes
|
||||
actually do this (with appropriate ref counting).
|
||||
|
||||
</ul>
|
||||
|
||||
<p>How is preemptive scheduling implemented in xv6? Answer see trap.c
|
||||
line 2905 through 2917, and the implementation of yield() on sheet
|
||||
22.
|
||||
|
||||
<p>How long is a timeslice for a user process? (possibly very short;
|
||||
very important lock is held across context switch!)
|
||||
|
||||
</body>
|
||||
|
||||
|
||||
|
462
web/l-vm.html
Normal file
462
web/l-vm.html
Normal file
|
@ -0,0 +1,462 @@
|
|||
<html>
|
||||
<head>
|
||||
<title>Virtual Machines</title>
|
||||
</head>
|
||||
|
||||
<body>
|
||||
|
||||
<h1>Virtual Machines</h1>
|
||||
|
||||
<p>Required reading: Disco</p>
|
||||
|
||||
<h2>Overview</h2>
|
||||
|
||||
<p>What is a virtual machine? IBM definition: a fully protected and
|
||||
isolated copy of the underlying machine's hardware.</p>
|
||||
|
||||
<p>Another view is that it provides another example of a kernel API.
|
||||
In contrast to other kernel APIs (unix, microkernel, and exokernel),
|
||||
the virtual machine operating system exports as the kernel API the
|
||||
processor API (e.g., the x86 interface). Thus, each program running
|
||||
in user space sees the services offered by a processor, and each
|
||||
program sees its own processor. Of course, we don't want to make a
|
||||
system call for each instruction, and in fact one of the main
|
||||
challenges in virtual machine operation systems is to design the
|
||||
system in such a way that the physical processor executes the virtual
|
||||
processor API directly, at processor speed.
|
||||
|
||||
<p>
|
||||
Virtual machines can be useful for a number of reasons:
|
||||
<ol>
|
||||
|
||||
<li>Run multiple operating systems on single piece of hardware. For
|
||||
example, in one process, you run Linux, and in another you run
|
||||
Windows/XP. If the kernel API is identical to the x86 (and faithly
|
||||
emulates x86 instructions, state, protection levels, page tables),
|
||||
then Linux and Windows/XP, the virual machine operationg system can
|
||||
run these <i>guest</i> operating systems without modifications.
|
||||
|
||||
<ul>
|
||||
<li>Run "older" programs on the same hardware (e.g., run one x86
|
||||
virtual machine in real mode to execute old DOS apps).
|
||||
|
||||
<li>Or run applications that require different operating system.
|
||||
</ul>
|
||||
|
||||
<li>Fault isolation: like processes on UNIX but more complete, because
|
||||
the guest operating systems runs on the virtual machine in user space.
|
||||
Thus, faults in the guest OS cannot effect any other software.
|
||||
|
||||
<li>Customizing the apparent hardware: virtual machine may have
|
||||
different view of hardware than is physically present.
|
||||
|
||||
<li>Simplify deployment/development of software for scalable
|
||||
processors (e.g., Disco).
|
||||
|
||||
</ol>
|
||||
</p>
|
||||
|
||||
<p>If your operating system isn't a virtual machine operating system,
|
||||
what are the alternatives? Processor simulation (e.g., bochs) or
|
||||
binary emulation (WINE). Simulation runs instructions purely in
|
||||
software and is slow (e.g., 100x slow down for bochs); virtualization
|
||||
gets out of the way whenever possible and can be efficient.
|
||||
|
||||
<p>Simulation gives portability whereas virtualization focuses on
|
||||
performance. However, this means that you need to model your hardware
|
||||
very carefully in software. Binary emulation focuses on just getting
|
||||
system call for a particular operating system's interface. Binary
|
||||
emulation can be hard because it is targetted towards a particular
|
||||
operating system (and even that can change between revisions).
|
||||
</p>
|
||||
|
||||
<p>To provide each process with its own virtual processor that exports
|
||||
the same API as the physical processor, what features must
|
||||
the virtual machine operating system virtualize?
|
||||
<ol>
|
||||
<li>CPU: instructions -- trap all privileged instructions</li>
|
||||
<li>Memory: address spaces -- map "physical" pages managed
|
||||
by the guest OS to <i>machine</i>pages, handle translation, etc.</li>
|
||||
<li>Devices: any I/O communication needs to be trapped and passed
|
||||
through/handled appropriately.</li>
|
||||
</ol>
|
||||
</p>
|
||||
The software that implements the virtualization is typically called
|
||||
the monitor, instead of the virtual machine operating system.
|
||||
|
||||
<p>Virtual machine monitors (VMM) can be implemented in two ways:
|
||||
<ol>
|
||||
<li>Run VMM directly on hardware: like Disco.</li>
|
||||
<li>Run VMM as an application (though still running as root, with
|
||||
integration into OS) on top of a <i>host</i> OS: like VMware. Provides
|
||||
additional hardware support at low development cost in
|
||||
VMM. Intercept CPU-level I/O requests and translate them into
|
||||
system calls (e.g. <code>read()</code>).</li>
|
||||
</ol>
|
||||
</p>
|
||||
|
||||
<p>The three primary functions of a virtual machine monitor are:
|
||||
<ul>
|
||||
<li>virtualize processor (CPU, memory, and devices)
|
||||
<li>dispatch events (e.g., forward page fault trap to guest OS).
|
||||
<li>allocate resources (e.g., divide real memory in some way between
|
||||
the physical memory of each guest OS).
|
||||
</ul>
|
||||
|
||||
<h2>Virtualization in detail</h2>
|
||||
|
||||
<h3>Memory virtualization</h3>
|
||||
|
||||
<p>
|
||||
Understanding memory virtualization. Let's consider the MIPS example
|
||||
from the paper. Ideally, we'd be able to intercept and rewrite all
|
||||
memory address references. (e.g., by intercepting virtual memory
|
||||
calls). Why can't we do this on the MIPS? (There are addresses that
|
||||
don't go through address translation --- but we don't want the virtual
|
||||
machine to directly access memory!) What does Disco do to get around
|
||||
this problem? (Relink the kernel outside this address space.)
|
||||
</p>
|
||||
|
||||
<p>
|
||||
Having gotten around that problem, how do we handle things in general?
|
||||
</p>
|
||||
<pre>
|
||||
// Disco's tlb miss handler.
|
||||
// Called when a memory reference for virtual adddress
|
||||
// 'VA' is made, but there is not VA->MA (virtual -> machine)
|
||||
// mapping in the cpu's TLB.
|
||||
void tlb_miss_handler (VA)
|
||||
{
|
||||
// see if we have a mapping in our "shadow" tlb (which includes
|
||||
// "main" tlb)
|
||||
tlb_entry *t = tlb_lookup (thiscpu->l2tlb, va);
|
||||
if (t && defined (thiscpu->pmap[t->pa])) // is there a MA for this PA?
|
||||
tlbwrite (va, thiscpu->pmap[t->pa], t->otherdata);
|
||||
else if (t)
|
||||
// get a machine page, copy physical page into, and tlbwrite
|
||||
else
|
||||
// trap to the virtual CPU/OS's handler
|
||||
}
|
||||
|
||||
// Disco's procedure which emulates the MIPS
|
||||
// instruction which writes to the tlb.
|
||||
//
|
||||
// VA -- virtual addresss
|
||||
// PA -- physical address (NOT MA machine address!)
|
||||
// otherdata -- perms and stuff
|
||||
void emulate_tlbwrite_instruction (VA, PA, otherdata)
|
||||
{
|
||||
tlb_insert (thiscpu->l2tlb, VA, PA, otherdata); // cache
|
||||
if (!defined (thiscpu->pmap[PA])) { // fill in pmap dynamically
|
||||
MA = allocate_machine_page ();
|
||||
thiscpu->pmap[PA] = MA; // See 4.2.2
|
||||
thiscpu->pmapbackmap[MA] = PA;
|
||||
thiscpu->memmap[MA] = VA; // See 4.2.3 (for TLB shootdowns)
|
||||
}
|
||||
tlbwrite (va, thiscpu->pmap[PA], otherdata);
|
||||
}
|
||||
|
||||
// Disco's procedure which emulates the MIPS
|
||||
// instruction which read the tlb.
|
||||
tlb_entry *emulate_tlbread_instruction (VA)
|
||||
{
|
||||
// Must return a TLB entry that has a "Physical" address;
|
||||
// This is recorded in our secondary TLB cache.
|
||||
// (We don't have to read from the hardware TLB since
|
||||
// all writes to the hardware TLB are mediated by Disco.
|
||||
// Thus we can always keep the l2tlb up to date.)
|
||||
return tlb_lookup (thiscpu->l2tlb, va);
|
||||
}
|
||||
</pre>
|
||||
|
||||
<h3>CPU virtualization</h3>
|
||||
|
||||
<p>Requirements:
|
||||
<ol>
|
||||
<li>Results of executing non-privileged instructions in privileged and
|
||||
user mode must be equivalent. (Why? B/c the virtual "privileged"
|
||||
system will not be running in true "privileged" mode.)
|
||||
<li>There must be a way to protect the VM from the real machine. (Some
|
||||
sort of memory protection/address translation. For fault isolation.)</li>
|
||||
<li>There must be a way to detect and transfer control to the VMM when
|
||||
the VM tries to execute a sensitive instruction (e.g. a privileged
|
||||
instruction, or one that could expose the "virtualness" of the
|
||||
VM.) It must be possible to emulate these instructions in
|
||||
software. Can be classified into completely virtualizable
|
||||
(i.e. there are protection mechanisms that cause traps for all
|
||||
instructions), partly (insufficient or incomplete trap
|
||||
mechanisms), or not at all (e.g. no MMU).
|
||||
</ol>
|
||||
</p>
|
||||
|
||||
<p>The MIPS didn't quite meet the second criteria, as discussed
|
||||
above. But, it does have a supervisor mode that is between user mode and
|
||||
kernel mode where any privileged instruction will trap.</p>
|
||||
|
||||
<p>What might a the VMM trap handler look like?</p>
|
||||
<pre>
|
||||
void privilege_trap_handler (addr) {
|
||||
instruction, args = decode_instruction (addr)
|
||||
switch (instruction) {
|
||||
case foo:
|
||||
emulate_foo (thiscpu, args, ...);
|
||||
break;
|
||||
case bar:
|
||||
emulate_bar (thiscpu, args, ...);
|
||||
break;
|
||||
case ...:
|
||||
...
|
||||
}
|
||||
}
|
||||
</pre>
|
||||
<p>The <code>emulator_foo</code> bits will have to evaluate the
|
||||
state of the virtual CPU and compute the appropriate "fake" answer.
|
||||
</p>
|
||||
|
||||
<p>What sort of state is needed in order to appropriately emulate all
|
||||
of these things?
|
||||
<pre>
|
||||
- all user registers
|
||||
- CPU specific regs (e.g. on x86, %crN, debugging, FP...)
|
||||
- page tables (or tlb)
|
||||
- interrupt tables
|
||||
</pre>
|
||||
This is needed for each virtual processor.
|
||||
</p>
|
||||
|
||||
<h3>Device I/O virtualization</h3>
|
||||
|
||||
<p>We intercept all communication to the I/O devices: read/writes to
|
||||
reserved memory addresses cause page faults into special handlers
|
||||
which will emulate or pass through I/O as appropriate.
|
||||
</p>
|
||||
|
||||
<p>
|
||||
In a system like Disco, the sequence would look something like:
|
||||
<ol>
|
||||
<li>VM executes instruction to access I/O</li>
|
||||
<li>Trap generated by CPU (based on memory or privilege protection)
|
||||
transfers control to VMM.</li>
|
||||
<li>VMM emulates I/O instruction, saving information about where this
|
||||
came from (for demultiplexing async reply from hardware later) .</li>
|
||||
<li>VMM reschedules a VM.</li>
|
||||
</ol>
|
||||
</p>
|
||||
|
||||
<p>
|
||||
Interrupts will require some additional work:
|
||||
<ol>
|
||||
<li>Interrupt occurs on real machine, transfering control to VMM
|
||||
handler.</li>
|
||||
<li>VMM determines the VM that ought to receive this interrupt.</li>
|
||||
<li>VMM causes a simulated interrupt to occur in the VM, and reschedules a
|
||||
VM.</li>
|
||||
<li>VM runs its interrupt handler, which may involve other I/O
|
||||
instructions that need to be trapped.</li>
|
||||
</ol>
|
||||
</p>
|
||||
|
||||
<p>
|
||||
The above can be slow! So sometimes you want the guest operating
|
||||
system to be aware that it is a guest and allow it to avoid the slow
|
||||
path. Special device drivers or changing instructions that would cause
|
||||
traps into memory read/write instructions.
|
||||
</p>
|
||||
|
||||
<h2>Intel x86/vmware</h2>
|
||||
|
||||
<p>VMware, unlike Disco, runs as an application on a guest OS and
|
||||
cannot modify the guest OS. Furthermore, it must virtualize the x86
|
||||
instead of MIPS processor. Both of these differences make good design
|
||||
challenges.
|
||||
|
||||
<p>The first challenge is that the monitor runs in user space, yet it
|
||||
must dispatch traps and it must execute privilege instructions, which
|
||||
both require kernel privileges. To address this challenge, the
|
||||
monitor downloads a piece of code, a kernel module, into the guest
|
||||
OS. Most modern operating systems are constructed as a core kernel,
|
||||
extended with downloadable kernel modules.
|
||||
Privileged users can insert kernel modules at run-time.
|
||||
|
||||
<p>The monitor downloads a kernel module that reads the IDT, copies
|
||||
it, and overwrites the hard-wired entries with addresses for stubs in
|
||||
the just downloaded kernel module. When a trap happens, the kernel
|
||||
module inspects the PC, and either forwards the trap to the monitor
|
||||
running in user space or to the guest OS. If the trap is caused
|
||||
because a guest OS execute a privileged instructions, the monitor can
|
||||
emulate that privilege instruction by asking the kernel module to
|
||||
perform that instructions (perhaps after modifying the arguments to
|
||||
the instruction).
|
||||
|
||||
<p>The second challenge is virtualizing the x86
|
||||
instructions. Unfortunately, x86 doesn't meet the 3 requirements for
|
||||
CPU virtualization. the first two requirements above. If you run
|
||||
the CPU in ring 3, <i>most</i> x86 instructions will be fine,
|
||||
because most privileged instructions will result in a trap, which
|
||||
can then be forwarded to vmware for emulation. For example,
|
||||
consider a guest OS loading the root of a page table in CR3. This
|
||||
results in trap (the guest OS runs in user space), which is
|
||||
forwarded to the monitor, which can emulate the load to CR3 as
|
||||
follows:
|
||||
|
||||
<pre>
|
||||
// addr is a physical address
|
||||
void emulate_lcr3 (thiscpu, addr)
|
||||
{
|
||||
thiscpu->cr3 = addr;
|
||||
Pte *fakepdir = lookup (addr, oldcr3cache);
|
||||
if (!fakepdir) {
|
||||
fakedir = ppage_alloc ();
|
||||
store (oldcr3cache, addr, fakedir);
|
||||
// May wish to scan through supplied page directory to see if
|
||||
// we have to fix up anything in particular.
|
||||
// Exact settings will depend on how we want to handle
|
||||
// problem cases below and our own MM.
|
||||
}
|
||||
asm ("movl fakepdir,%cr3");
|
||||
// Must make sure our page fault handler is in sync with what we do here.
|
||||
}
|
||||
</pre>
|
||||
|
||||
<p>To virtualize the x86, the monitor must intercept any modifications
|
||||
to the page table and substitute appropriate responses. And update
|
||||
things like the accessed/dirty bits. The monitor can arrange for this
|
||||
to happen by making all page table pages inaccessible so that it can
|
||||
emulate loads and stores to page table pages. This setup allow the
|
||||
monitor to virtualize the memory interface of the x86.</p>
|
||||
|
||||
<p>Unfortunately, not all instructions that must be virtualized result
|
||||
in traps:
|
||||
<ul>
|
||||
<li><code>pushf/popf</code>: <code>FL_IF</code> is handled different,
|
||||
for example. In user-mode setting FL_IF is just ignored.</li>
|
||||
<li>Anything (<code>push</code>, <code>pop</code>, <code>mov</code>)
|
||||
that reads or writes from <code>%cs</code>, which contains the
|
||||
privilege level.
|
||||
<li>Setting the interrupt enable bit in EFLAGS has different
|
||||
semantics in user space and kernel space. In user space, it
|
||||
is ignored; in kernel space, the bit is set.
|
||||
<li>And some others... (total, 17 instructions).
|
||||
</ul>
|
||||
These instructions are unpriviliged instructions (i.e., don't cause a
|
||||
trap when executed by a guest OS) but expose physical processor state.
|
||||
These could reveal details of virtualization that should not be
|
||||
revealed. For example, if guest OS sets the interrupt enable bit for
|
||||
its virtual x86, the virtualized EFLAGS should reflect that the bit is
|
||||
set, even though the guest OS is running in user space.
|
||||
|
||||
<p>How can we virtualize these instructions? An approach is to decode
|
||||
the instruction stream that is provided by the user and look for bad
|
||||
instructions. When we find them, replace them with an interrupt
|
||||
(<code>INT 3</code>) that will allow the VMM to handle it
|
||||
correctly. This might look something like:
|
||||
</p>
|
||||
|
||||
<pre>
|
||||
void initcode () {
|
||||
scan_for_nonvirtual (0x7c00);
|
||||
}
|
||||
|
||||
void scan_for_nonvirtualizable (thiscpu, startaddr) {
|
||||
addr = startaddr;
|
||||
instr = disassemble (addr);
|
||||
while (instr is not branch or bad) {
|
||||
addr += len (instr);
|
||||
instr = disassemble (addr);
|
||||
}
|
||||
// remember that we wanted to execute this instruction.
|
||||
replace (addr, "int 3");
|
||||
record (thiscpu->rewrites, addr, instr);
|
||||
}
|
||||
|
||||
void breakpoint_handler (tf) {
|
||||
oldinstr = lookup (thiscpu->rewrites, tf->eip);
|
||||
if (oldinstr is branch) {
|
||||
newcs:neweip = evaluate branch
|
||||
scan_for_nonvirtualizable (thiscpu, newcs:neweip)
|
||||
return;
|
||||
} else { // something non virtualizable
|
||||
// dispatch to appropriate emulation
|
||||
}
|
||||
}
|
||||
</pre>
|
||||
<p>All pages must be scanned in this way. Fortunately, most pages
|
||||
probably are okay and don't really need any special handling so after
|
||||
scanning them once, we can just remember that the page is okay and let
|
||||
it run natively.
|
||||
</p>
|
||||
|
||||
<p>What if a guest OS generates instructions, writes them to memory,
|
||||
and then wants to execute them? We must detect self-modifying code
|
||||
(e.g. must simulate buffer overflow attacks correctly.) When a write
|
||||
to a physical page that happens to be in code segment happens, must
|
||||
trap the write and then rescan the affected portions of the page.</p>
|
||||
|
||||
<p>What about self-examining code? Need to protect it some
|
||||
how---possibly by playing tricks with instruction/data TLB caches, or
|
||||
introducing a private segment for code (%cs) that is different than
|
||||
the segment used for reads/writes (%ds).
|
||||
</p>
|
||||
|
||||
<h2>Some Disco paper notes</h2>
|
||||
|
||||
<p>
|
||||
Disco has some I/O specific optimizations.
|
||||
</p>
|
||||
<ul>
|
||||
<li>Disk reads only need to happen once and can be shared between
|
||||
virtual machines via copy-on-write virtual memory tricks.</li>
|
||||
<li>Network cards do not need to be fully virtualized --- intra
|
||||
VM communication doesn't need a real network card backing it.</li>
|
||||
<li>Special handling for NFS so that all VMs "share" a buffer cache.</li>
|
||||
</ul>
|
||||
|
||||
<p>
|
||||
Disco developers clearly had access to IRIX source code.
|
||||
</p>
|
||||
<ul>
|
||||
<li>Need to deal with KSEG0 segment of MIPS memory by relinking kernel
|
||||
at different address space.</li>
|
||||
<li>Ensuring page-alignment of network writes (for the purposes of
|
||||
doing memory map tricks.)</li>
|
||||
</ul>
|
||||
|
||||
<p>Performance?</p>
|
||||
<ul>
|
||||
<li>Evaluated in simulation.</li>
|
||||
<li>Where are the overheads? Where do they come from?</li>
|
||||
<li>Does it run better than NUMA IRIX?</li>
|
||||
</ul>
|
||||
|
||||
<p>Premise. Are virtual machine the preferred approach to extending
|
||||
operating systems? Have scalable multiprocessors materialized?</p>
|
||||
|
||||
<h2>Related papers</h2>
|
||||
|
||||
<p>John Scott Robin, Cynthia E. Irvine. <a
|
||||
href="http://www.cs.nps.navy.mil/people/faculty/irvine/publications/2000/VMM-usenix00-0611.pdf">Analysis of the
|
||||
Intel Pentium's Ability to Support a Secure Virtual Machine
|
||||
Monitor</a>.</p>
|
||||
|
||||
<p>Jeremy Sugerman, Ganesh Venkitachalam, Beng-Hong Lim. <a
|
||||
href="http://www.vmware.com/resources/techresources/530">Virtualizing
|
||||
I/O Devices on VMware Workstation's Hosted Virtual Machine
|
||||
Monitor</a>. In Proceedings of the 2001 Usenix Technical Conference.</p>
|
||||
|
||||
<p>Kevin Lawton, Drew Northup. <a
|
||||
href="http://savannah.nongnu.org/projects/plex86">Plex86 Virtual
|
||||
Machine</a>.</p>
|
||||
|
||||
<p><a href="http://www.cl.cam.ac.uk/netos/papers/2003-xensosp.pdf">Xen
|
||||
and the Art of Virtualization</a>, Paul Barham, Boris
|
||||
Dragovic, Keir Fraser, Steven Hand, Tim Harris, Alex Ho, Rolf
|
||||
Neugebauer, Ian Pratt, Andrew Warfield, SOSP 2003</p>
|
||||
|
||||
<p><a href="http://www.vmware.com/pdf/asplos235_adams.pdf">A comparison of
|
||||
software and hardware techniques for x86 virtualizaton</a>Keith Adams
|
||||
and Ole Agesen, ASPLOS 2006</p>
|
||||
|
||||
</body>
|
||||
|
||||
</html>
|
||||
|
246
web/l-xfi.html
Normal file
246
web/l-xfi.html
Normal file
|
@ -0,0 +1,246 @@
|
|||
<html>
|
||||
<head>
|
||||
<title>XFI</title>
|
||||
</head>
|
||||
<body>
|
||||
|
||||
<h1>XFI</h1>
|
||||
|
||||
<p>Required reading: XFI: software guards for system address spaces.
|
||||
|
||||
<h2>Introduction</h2>
|
||||
|
||||
<p>Problem: how to use untrusted code (an "extension") in a trusted
|
||||
program?
|
||||
<ul>
|
||||
<li>Use untrusted jpeg codec in Web browser
|
||||
<li>Use an untrusted driver in the kernel
|
||||
</ul>
|
||||
|
||||
<p>What are the dangers?
|
||||
<ul>
|
||||
<li>No fault isolations: extension modifies trusted code unintentionally
|
||||
<li>No protection: extension causes a security hole
|
||||
<ul>
|
||||
<li>Extension has a buffer overrun problem
|
||||
<li>Extension calls trusted program's functions
|
||||
<li>Extensions calls a trusted program's functions that is allowed to
|
||||
call, but supplies "bad" arguments
|
||||
<li>Extensions calls privileged hardware instructions (when extending
|
||||
kernel)
|
||||
<li>Extensions reads data out of trusted program it shouldn't.
|
||||
</ul>
|
||||
</ul>
|
||||
|
||||
<p>Possible solutions approaches:
|
||||
<ul>
|
||||
|
||||
<li>Run extension in its own address space with minimal
|
||||
privileges. Rely on hardware and operating system protection
|
||||
mechanism.
|
||||
|
||||
<li>Restrict the language in which the extension is written:
|
||||
<ul>
|
||||
|
||||
<li>Packet filter language. Language is limited in its capabilities,
|
||||
and it easy to guarantee "safe" execution.
|
||||
|
||||
<li>Type-safe language. Language runtime and compiler guarantee "safe"
|
||||
execution.
|
||||
</ul>
|
||||
|
||||
<li>Software-based sandboxing.
|
||||
|
||||
</ul>
|
||||
|
||||
<h2>Software-based sandboxing</h2>
|
||||
|
||||
<p>Sandboxer. A compiler or binary-rewriter sandboxes all unsafe
|
||||
instructions in an extension by inserting additional instructions.
|
||||
For example, every indirect store is preceded by a few instructions
|
||||
that compute and check the target of the store at runtime.
|
||||
|
||||
<p>Verifier. When the extension is loaded in the trusted program, the
|
||||
verifier checks if the extension is appropriately sandboxed (e.g.,
|
||||
are all indirect stores sandboxed? does it call any privileged
|
||||
instructions?). If not, the extension is rejected. If yes, the
|
||||
extension is loaded, and can run. If the extension runs, the
|
||||
instruction that sandbox unsafe instructions check if the unsafe
|
||||
instruction is used in a safe way.
|
||||
|
||||
<p>The verifier must be trusted, but the sandboxer doesn't. We can do
|
||||
without the verifier, if the trusted program can establish that the
|
||||
extension has been sandboxed by a trusted sandboxer.
|
||||
|
||||
<p>The paper refers to this setup as instance of proof-carrying code.
|
||||
|
||||
<h2>Software fault isolation</h2>
|
||||
|
||||
<p><a href="http://citeseer.ist.psu.edu/wahbe93efficient.html">SFI</a>
|
||||
by Wahbe et al. explored out to use sandboxing for fault isolation
|
||||
extensions; that is, use sandboxing to control that stores and jump
|
||||
stay within a specified memory range (i.e., they don't overwrite and
|
||||
jump into addresses in the trusted program unchecked). They
|
||||
implemented SFI for a RISC processor, which simplify things since
|
||||
memory can be written only by store instructions (other instructions
|
||||
modify registers). In addition, they assumed that there were plenty
|
||||
of registers, so that they can dedicate a few for sandboxing code.
|
||||
|
||||
<p>The extension is loaded into a specific range (called a segment)
|
||||
within the trusted application's address space. The segment is
|
||||
identified by the upper bits of the addresses in the
|
||||
segment. Separate code and data segments are necessary to prevent an
|
||||
extension overwriting its code.
|
||||
|
||||
<p>An unsafe instruction on the MIPS is an instruction that jumps or
|
||||
stores to an address that cannot be statically verified to be within
|
||||
the correct segment. Most control transfer operations, such
|
||||
program-counter relative can be statically verified. Stores to
|
||||
static variables often use an immediate addressing mode and can be
|
||||
statically verified. Indirect jumps and indirect stores are unsafe.
|
||||
|
||||
<p>To sandbox those instructions the sandboxer could generate the
|
||||
following code for each unsafe instruction:
|
||||
<pre>
|
||||
DR0 <- target address
|
||||
R0 <- DR0 >> shift-register; // load in R0 segment id of target
|
||||
CMP R0, segment-register; // compare to segment id to segment's ID
|
||||
BNE fault-isolation-error // if not equal, branch to trusted error code
|
||||
STORE using DR0
|
||||
</pre>
|
||||
In this code, DR0, shift-register, and segment register
|
||||
are <i>dedicated</i>: they cannot be used by the extension code. The
|
||||
verifier must check if the extension doesn't use they registers. R0
|
||||
is a scratch register, but doesn't have to be dedicated. The
|
||||
dedicated registers are necessary, because otherwise extension could
|
||||
load DR0 and jump to the STORE instruction directly, skipping the
|
||||
check.
|
||||
|
||||
<p>This implementation costs 4 registers, and 4 additional instructions
|
||||
for each unsafe instruction. One could do better, however:
|
||||
<pre>
|
||||
DR0 <- target address & and-mask-register // mask segment ID from target
|
||||
DR0 <- DR0 | segment register // insert this segment's ID
|
||||
STORE using DR0
|
||||
</pre>
|
||||
This code just sets the write segment ID bits. It doesn't catch
|
||||
illegal addresses; it just ensures that illegal addresses are within
|
||||
the segment, harming the extension but no other code. Even if the
|
||||
extension jumps to the second instruction of this sandbox sequence,
|
||||
nothing bad will happen (because DR0 will already contain the correct
|
||||
segment ID).
|
||||
|
||||
<p>Optimizations include:
|
||||
<ul>
|
||||
<li>use guard zones for <i>store value, offset(reg)</i>
|
||||
<li>treat SP as dedicated register (sandbox code that initializes it)
|
||||
<li>etc.
|
||||
</ul>
|
||||
|
||||
<h2>XFI</h2>
|
||||
|
||||
<p>XFI extends SFI in several ways:
|
||||
<ul>
|
||||
<li>Handles fault isolation and protection
|
||||
<li>Uses control-folow integrity (CFI) to get good performance
|
||||
<li>Doesn't use dedicated registers
|
||||
<li>Use two stacks (a scoped stack and an allocation stack) and only
|
||||
allocation stack can be corrupted by buffer-overrun attacks. The
|
||||
scoped stack cannot via computed memory references.
|
||||
<li>Uses a binary rewriter.
|
||||
<li>Works for the x86
|
||||
</ul>
|
||||
|
||||
<p>x86 is challenging, because limited registers and variable length
|
||||
of instructions. SFI technique won't work with x86 instruction
|
||||
set. For example if the binary contains:
|
||||
<pre>
|
||||
25 CD 80 00 00 # AND eax, 0x80CD
|
||||
</pre>
|
||||
and an adversary can arrange to jump to the second byte, then the
|
||||
adversary calls system call on Linux, which has binary the binary
|
||||
representation CD 80. Thus, XFI must control execution flow.
|
||||
|
||||
<p>XFI policy goals:
|
||||
<ul>
|
||||
<li>Memory-access constraints (like SFI)
|
||||
<li>Interface restrictions (extension has fixed entry and exit points)
|
||||
<li>Scoped-stack integrity (calling stack is well formed)
|
||||
<li>Simplified instructions semantics (remove dangerous instructions)
|
||||
<li>System-environment integrity (ensure certain machine model
|
||||
invariants, such as x86 flags register cannot be modified)
|
||||
<li>Control-flow integrity: execution must follow a static, expected
|
||||
control-flow graph. (enter at beginning of basic blocks)
|
||||
<li>Program-data integrity (certain global variables in extension
|
||||
cannot be accessed via computed memory addresses)
|
||||
</ul>
|
||||
|
||||
<p>The binary rewriter inserts guards to ensure these properties. The
|
||||
verifier check if the appropriate guards in place. The primary
|
||||
mechanisms used are:
|
||||
<ul>
|
||||
<li>CFI guards on computed control-flow transfers (see figure 2)
|
||||
<li>Two stacks
|
||||
<li>Guards on computer memory accesses (see figure 3)
|
||||
<li>Module header has a section that contain access permissions for
|
||||
region
|
||||
<li>Binary rewriter, which performs intra-procedure analysis, and
|
||||
generates guards, code for stack use, and verification hints
|
||||
<li>Verifier checks specific conditions per basic block. hints specify
|
||||
the verification state for the entry to each basic block, and at
|
||||
exit of basic block the verifier checks that the final state implies
|
||||
the verification state at entry to all possible successor basic
|
||||
blocks. (see figure 4)
|
||||
</ul>
|
||||
|
||||
<p>Can XFI protect against the attack discussed in last lecture?
|
||||
<pre>
|
||||
unsigned int j;
|
||||
p=(unsigned char *)s->init_buf->data;
|
||||
j= *(p++);
|
||||
s->session->session_id_length=j;
|
||||
memcpy(s->session->session_id,p,j);
|
||||
</pre>
|
||||
Where will <i>j</i> be located?
|
||||
|
||||
<p>How about the following one from the paper <a href="http://research.microsoft.com/users/jpincus/beyond-stack-smashing.pdf"><i>Beyond stack smashing:
|
||||
recent advances in exploiting buffer overruns</i></a>?
|
||||
<pre>
|
||||
void f2b(void * arg, size_t len) {
|
||||
char buf[100];
|
||||
long val = ..;
|
||||
long *ptr = ..;
|
||||
extern void (*f)();
|
||||
|
||||
memcopy(buff, arg, len);
|
||||
*ptr = val;
|
||||
f();
|
||||
...
|
||||
return;
|
||||
}
|
||||
</pre>
|
||||
What code can <i>(*f)()</i> call? Code that the attacker inserted?
|
||||
Code in libc?
|
||||
|
||||
<p>How about an attack that use <i>ptr</i> in the above code to
|
||||
overwrite a method's address in a class's dispatch table with an
|
||||
address of support function?
|
||||
|
||||
<p>How about <a href="http://research.microsoft.com/~shuochen/papers/usenix05data_attack.pdf">data-only attacks</a>? For example, attacker
|
||||
overwrites <i>pw_uid</i> in the heap with 0 before the following
|
||||
code executes (when downloading /etc/passwd and then uploading it with a
|
||||
modified entry).
|
||||
<pre>
|
||||
FILE *getdatasock( ... ) {
|
||||
seteuid(0);
|
||||
setsockeope ( ...);
|
||||
...
|
||||
seteuid(pw->pw_uid);
|
||||
...
|
||||
}
|
||||
</pre>
|
||||
|
||||
<p>How much does XFI slow down applications? How many more
|
||||
instructions are executed? (see Tables 1-4)
|
||||
|
||||
</body>
|
288
web/l1.html
Normal file
288
web/l1.html
Normal file
|
@ -0,0 +1,288 @@
|
|||
<title>L1</title>
|
||||
<html>
|
||||
<head>
|
||||
</head>
|
||||
<body>
|
||||
|
||||
<h1>OS overview</h1>
|
||||
|
||||
<h2>Overview</h2>
|
||||
|
||||
<ul>
|
||||
<li>Goal of course:
|
||||
|
||||
<ul>
|
||||
<li>Understand operating systems in detail by designing and
|
||||
implementing miminal OS
|
||||
<li>Hands-on experience with building systems ("Applying 6.033")
|
||||
</ul>
|
||||
|
||||
<li>What is an operating system?
|
||||
<ul>
|
||||
<li>a piece of software that turns the hardware into something useful
|
||||
<li>layered picture: hardware, OS, applications
|
||||
<li>Three main functions: fault isolate applications, abstract hardware,
|
||||
manage hardware
|
||||
</ul>
|
||||
|
||||
<li>Examples:
|
||||
<ul>
|
||||
<li>OS-X, Windows, Linux, *BSD, ... (desktop, server)
|
||||
<li>PalmOS Windows/CE (PDA)
|
||||
<li>Symbian, JavaOS (Cell phones)
|
||||
<li>VxWorks, pSOS (real-time)
|
||||
<li> ...
|
||||
</ul>
|
||||
|
||||
<li>OS Abstractions
|
||||
<ul>
|
||||
<li>processes: fork, wait, exec, exit, kill, getpid, brk, nice, sleep,
|
||||
trace
|
||||
<li>files: open, close, read, write, lseek, stat, sync
|
||||
<li>directories: mkdir, rmdir, link, unlink, mount, umount
|
||||
<li>users + security: chown, chmod, getuid, setuid
|
||||
<li>interprocess communication: signals, pipe
|
||||
<li>networking: socket, accept, snd, recv, connect
|
||||
<li>time: gettimeofday
|
||||
<li>terminal:
|
||||
</ul>
|
||||
|
||||
<li>Sample Unix System calls (mostly POSIX)
|
||||
<ul>
|
||||
<li> int read(int fd, void*, int)
|
||||
<li> int write(int fd, void*, int)
|
||||
<li> off_t lseek(int fd, off_t, int [012])
|
||||
<li> int close(int fd)
|
||||
<li> int fsync(int fd)
|
||||
<li> int open(const char*, int flags [, int mode])
|
||||
<ul>
|
||||
<li> O_RDONLY, O_WRONLY, O_RDWR, O_CREAT
|
||||
</ul>
|
||||
<li> mode_t umask(mode_t cmask)
|
||||
<li> int mkdir(char *path, mode_t mode);
|
||||
<li> DIR *opendir(char *dirname)
|
||||
<li> struct dirent *readdir(DIR *dirp)
|
||||
<li> int closedir(DIR *dirp)
|
||||
<li> int chdir(char *path)
|
||||
<li> int link(char *existing, char *new)
|
||||
<li> int unlink(char *path)
|
||||
<li> int rename(const char*, const char*)
|
||||
<li> int rmdir(char *path)
|
||||
<li> int stat(char *path, struct stat *buf)
|
||||
<li> int mknod(char *path, mode_t mode, dev_t dev)
|
||||
<li> int fork()
|
||||
<ul>
|
||||
<li> returns childPID in parent, 0 in child; only
|
||||
difference
|
||||
</ul>
|
||||
<li>int getpid()
|
||||
<li> int waitpid(int pid, int* stat, int opt)
|
||||
<ul>
|
||||
<li> pid==-1: any; opt==0||WNOHANG
|
||||
<li> returns pid or error
|
||||
</ul>
|
||||
<li> void _exit(int status)
|
||||
<li> int kill(int pid, int signal)
|
||||
<li> int sigaction(int sig, struct sigaction *, struct sigaction *)
|
||||
<li> int sleep (int sec)
|
||||
<li> int execve(char* prog, char** argv, char** envp)
|
||||
<li> void *sbrk(int incr)
|
||||
<li> int dup2(int oldfd, int newfd)
|
||||
<li> int fcntl(int fd, F_SETFD, int val)
|
||||
<li> int pipe(int fds[2])
|
||||
<ul>
|
||||
<li> writes on fds[1] will be read on fds[0]
|
||||
<li> when last fds[1] closed, read fds[0] retursn EOF
|
||||
<li> when last fds[0] closed, write fds[1] kills SIGPIPE/fails
|
||||
EPIPE
|
||||
</ul>
|
||||
<li> int fchown(int fd, uind_t owner, gid_t group)
|
||||
<li> int fchmod(int fd, mode_t mode)
|
||||
<li> int socket(int domain, int type, int protocol)
|
||||
<li> int accept(int socket_fd, struct sockaddr*, int* namelen)
|
||||
<ul>
|
||||
<li> returns new fd
|
||||
</ul>
|
||||
<li> int listen(int fd, int backlog)
|
||||
<li> int connect(int fd, const struct sockaddr*, int namelen)
|
||||
<li> void* mmap(void* addr, size_t len, int prot, int flags, int fd,
|
||||
off_t offset)
|
||||
<li> int munmap(void* addr, size_t len)
|
||||
<li> int gettimeofday(struct timeval*)
|
||||
</ul>
|
||||
</ul>
|
||||
|
||||
<p>See the <a href="../reference.html">reference page</a> for links to
|
||||
the early Unix papers.
|
||||
|
||||
<h2>Class structure</h2>
|
||||
|
||||
<ul>
|
||||
<li>Lab: minimal OS for x86 in an exokernel style (50%)
|
||||
<ul>
|
||||
<li>kernel interface: hardware + protection
|
||||
<li>libOS implements fork, exec, pipe, ...
|
||||
<li>applications: file system, shell, ..
|
||||
<li>development environment: gcc, bochs
|
||||
<li>lab 1 is out
|
||||
</ul>
|
||||
|
||||
<li>Lecture structure (20%)
|
||||
<ul>
|
||||
<li>homework
|
||||
<li>45min lecture
|
||||
<li>45min case study
|
||||
</ul>
|
||||
|
||||
<li>Two quizzes (30%)
|
||||
<ul>
|
||||
<li>mid-term
|
||||
<li>final's exam week
|
||||
</ul>
|
||||
|
||||
</ul>
|
||||
|
||||
<h2>Case study: the shell (simplified)</h2>
|
||||
|
||||
<ul>
|
||||
<li>interactive command execution and a programming language
|
||||
<li>Nice example that uses various OS abstractions. See <a
|
||||
href="../readings/ritchie74unix.pdf">Unix
|
||||
paper</a> if you are unfamiliar with the shell.
|
||||
<li>Final lab is a simple shell.
|
||||
<li>Basic structure:
|
||||
<pre>
|
||||
|
||||
while (1) {
|
||||
printf ("$");
|
||||
readcommand (command, args); // parse user input
|
||||
if ((pid = fork ()) == 0) { // child?
|
||||
exec (command, args, 0);
|
||||
} else if (pid > 0) { // parent?
|
||||
wait (0); // wait for child to terminate
|
||||
} else {
|
||||
perror ("Failed to fork\n");
|
||||
}
|
||||
}
|
||||
</pre>
|
||||
<p>The split of creating a process with a new program in fork and exec
|
||||
is mostly a historical accident. See the <a
|
||||
href="../readings/ritchie79evolution.html">assigned paper</a> for today.
|
||||
<li>Example:
|
||||
<pre>
|
||||
$ ls
|
||||
</pre>
|
||||
<li>why call "wait"? to wait for the child to terminate and collect
|
||||
its exit status. (if child finishes, child becomes a zombie until
|
||||
parent calls wait.)
|
||||
<li>I/O: file descriptors. Child inherits open file descriptors
|
||||
from parent. By convention:
|
||||
<ul>
|
||||
<li>file descriptor 0 for input (e.g., keyboard). read_command:
|
||||
<pre>
|
||||
read (1, buf, bufsize)
|
||||
</pre>
|
||||
<li>file descriptor 1 for output (e.g., terminal)
|
||||
<pre>
|
||||
write (1, "hello\n", strlen("hello\n")+1)
|
||||
</pre>
|
||||
<li>file descriptor 2 for error (e.g., terminal)
|
||||
</ul>
|
||||
<li>How does the shell implement:
|
||||
<pre>
|
||||
$ls > tmp1
|
||||
</pre>
|
||||
just before exec insert:
|
||||
<pre>
|
||||
close (1);
|
||||
fd = open ("tmp1", O_CREAT|O_WRONLY); // fd will be 1!
|
||||
</pre>
|
||||
<p>The kernel will return the first free file descriptor, 1 in this case.
|
||||
<li>How does the shell implement sharing an output file:
|
||||
<pre>
|
||||
$ls 2> tmp1 > tmp1
|
||||
</pre>
|
||||
replace last code with:
|
||||
<pre>
|
||||
|
||||
close (1);
|
||||
close (2);
|
||||
fd1 = open ("tmp1", O_CREAT|O_WRONLY); // fd will be 1!
|
||||
fd2 = dup (fd1);
|
||||
</pre>
|
||||
both file descriptors share offset
|
||||
<li>how do programs communicate?
|
||||
<pre>
|
||||
$ sort file.txt | uniq | wc
|
||||
</pre>
|
||||
or
|
||||
<pre>
|
||||
$ sort file.txt > tmp1
|
||||
$ uniq tmp1 > tmp2
|
||||
$ wc tmp2
|
||||
$ rm tmp1 tmp2
|
||||
</pre>
|
||||
or
|
||||
<pre>
|
||||
$ kill -9
|
||||
</pre>
|
||||
<li>A pipe is an one-way communication channel. Here is an example
|
||||
where the parent is the writer and the child is the reader:
|
||||
<pre>
|
||||
|
||||
int fdarray[2];
|
||||
|
||||
if (pipe(fdarray) < 0) panic ("error");
|
||||
if ((pid = fork()) < 0) panic ("error");
|
||||
else if (pid > 0) {
|
||||
close(fdarray[0]);
|
||||
write(fdarray[1], "hello world\n", 12);
|
||||
} else {
|
||||
close(fdarray[1]);
|
||||
n = read (fdarray[0], buf, MAXBUF);
|
||||
write (1, buf, n);
|
||||
}
|
||||
</pre>
|
||||
<li>How does the shell implement pipelines (i.e., cmd 1 | cmd 2 |..)?
|
||||
We want to arrange that the output of cmd 1 is the input of cmd 2.
|
||||
The way to achieve this goal is to manipulate stdout and stdin.
|
||||
<li>The shell creates processes for each command in
|
||||
the pipeline, hooks up their stdin and stdout correctly. To do it
|
||||
correct, and waits for the last process of the
|
||||
pipeline to exit. A sketch of the core modifications to our shell for
|
||||
setting up a pipe is:
|
||||
<pre>
|
||||
int fdarray[2];
|
||||
|
||||
if (pipe(fdarray) < 0) panic ("error");
|
||||
if ((pid = fork ()) == 0) { child (left end of pipe)
|
||||
close (1);
|
||||
tmp = dup (fdarray[1]); // fdarray[1] is the write end, tmp will be 1
|
||||
close (fdarray[0]); // close read end
|
||||
close (fdarray[1]); // close fdarray[1]
|
||||
exec (command1, args1, 0);
|
||||
} else if (pid > 0) { // parent (right end of pipe)
|
||||
close (0);
|
||||
tmp = dup (fdarray[0]); // fdarray[0] is the read end, tmp will be 0
|
||||
close (fdarray[0]);
|
||||
close (fdarray[1]); // close write end
|
||||
exec (command2, args2, 0);
|
||||
} else {
|
||||
printf ("Unable to fork\n");
|
||||
}
|
||||
</pre>
|
||||
<li>Why close read-end and write-end? multiple reasons: maintain that
|
||||
every process starts with 3 file descriptors and reading from an empty
|
||||
pipe blocks reader, while reading from a closed pipe returns end of
|
||||
file.
|
||||
<li>How do you background jobs?
|
||||
<pre>
|
||||
$ compute &
|
||||
</pre>
|
||||
<li>How does the shell implement "&", backgrounding? (Don't call wait
|
||||
immediately).
|
||||
<li>More details in the shell lecture later in the term.
|
||||
|
||||
</body>
|
||||
|
||||
|
245
web/l13.html
Normal file
245
web/l13.html
Normal file
|
@ -0,0 +1,245 @@
|
|||
<title>High-performance File Systems</title>
|
||||
<html>
|
||||
<head>
|
||||
</head>
|
||||
<body>
|
||||
|
||||
<h1>High-performance File Systems</h1>
|
||||
|
||||
<p>Required reading: soft updates.
|
||||
|
||||
<h2>Overview</h2>
|
||||
|
||||
<p>A key problem in designing file systems is how to obtain
|
||||
performance on file system operations while providing consistency.
|
||||
With consistency, we mean, that file system invariants are maintained
|
||||
is on disk. These invariants include that if a file is created, it
|
||||
appears in its directory, etc. If the file system data structures are
|
||||
consistent, then it is possible to rebuild the file system to a
|
||||
correct state after a failure.
|
||||
|
||||
<p>To ensure consistency of on-disk file system data structures,
|
||||
modifications to the file system must respect certain rules:
|
||||
<ul>
|
||||
|
||||
<li>Never point to a structure before it is initialized. An inode must
|
||||
be initialized before a directory entry references it. An block must
|
||||
be initialized before an inode references it.
|
||||
|
||||
<li>Never reuse a structure before nullifying all pointers to it. An
|
||||
inode pointer to a disk block must be reset before the file system can
|
||||
reallocate the disk block.
|
||||
|
||||
<li>Never reset the last point to a live structure before a new
|
||||
pointer is set. When renaming a file, the file system should not
|
||||
remove the old name for an inode until after the new name has been
|
||||
written.
|
||||
</ul>
|
||||
The paper calls these dependencies update dependencies.
|
||||
|
||||
<p>xv6 ensures these rules by writing every block synchronously, and
|
||||
by ordering the writes appropriately. With synchronous, we mean
|
||||
that a process waits until the current disk write has been
|
||||
completed before continuing with execution.
|
||||
|
||||
<ul>
|
||||
|
||||
<li>What happens if power fails after 4776 in mknod1? Did we lose the
|
||||
inode for ever? No, we have a separate program (called fsck), which
|
||||
can rebuild the disk structures correctly and can mark the inode on
|
||||
the free list.
|
||||
|
||||
<li>Does the order of writes in mknod1 matter? Say, what if we wrote
|
||||
directory entry first and then wrote the allocated inode to disk?
|
||||
This violates the update rules and it is not a good plan. If a
|
||||
failure happens after the directory write, then on recovery we have
|
||||
an directory pointing to an unallocated inode, which now may be
|
||||
allocated by another process for another file!
|
||||
|
||||
<li>Can we turn the writes (i.e., the ones invoked by iupdate and
|
||||
wdir) into delayed writes without creating problems? No, because
|
||||
the cause might write them back to the disk in an incorrect order.
|
||||
It has no information to decide in what order to write them.
|
||||
|
||||
</ul>
|
||||
|
||||
<p>xv6 is a nice example of the tension between consistency and
|
||||
performance. To get consistency, xv6 uses synchronous writes,
|
||||
but these writes are slow, because they perform at the rate of a
|
||||
seek instead of the rate of the maximum data transfer rate. The
|
||||
bandwidth to a disk is reasonable high for large transfer (around
|
||||
50Mbyte/s), but latency is low, because of the cost of moving the
|
||||
disk arm(s) (the seek latency is about 10msec).
|
||||
|
||||
<p>This tension is an implementation-dependent one. The Unix API
|
||||
doesn't require that writes are synchronous. Updates don't have to
|
||||
appear on disk until a sync, fsync, or open with O_SYNC. Thus, in
|
||||
principle, the UNIX API allows delayed writes, which are good for
|
||||
performance:
|
||||
<ul>
|
||||
<li>Batch many writes together in a big one, written at the disk data
|
||||
rate.
|
||||
<li>Absorp writes to the same block.
|
||||
<li>Schedule writes to avoid seeks.
|
||||
</ul>
|
||||
|
||||
<p>Thus the question: how to delay writes and achieve consistency?
|
||||
The paper provides an answer.
|
||||
|
||||
<h2>This paper</h2>
|
||||
|
||||
<p>The paper surveys some of the existing techniques and introduces a
|
||||
new to achieve the goal of performance and consistency.
|
||||
|
||||
<p>
|
||||
|
||||
<p>Techniques possible:
|
||||
<ul>
|
||||
|
||||
<li>Equip system with NVRAM, and put buffer cache in NVRAM.
|
||||
|
||||
<li>Logging. Often used in UNIX file systems for metadata updates.
|
||||
LFS is an extreme version of this strategy.
|
||||
|
||||
<li>Flusher-enforced ordering. All writes are delayed. This flusher
|
||||
is aware of dependencies between blocks, but doesn't work because
|
||||
circular dependencies need to be broken by writing blocks out.
|
||||
|
||||
</ul>
|
||||
|
||||
<p>Soft updates is the solution explored in this paper. It doesn't
|
||||
require NVRAM, and performs as well as the naive strategy of keep all
|
||||
dirty block in main memory. Compared to logging, it is unclear if
|
||||
soft updates is better. The default BSD file systems uses soft
|
||||
updates, but most Linux file systems use logging.
|
||||
|
||||
<p>Soft updates is a sophisticated variant of flusher-enforced
|
||||
ordering. Instead of maintaining dependencies on the block-level, it
|
||||
maintains dependencies on file structure level (per inode, per
|
||||
directory, etc.), reducing circular dependencies. Furthermore, it
|
||||
breaks any remaining circular dependencies by undo changes before
|
||||
writing the block and then redoing them to the block after writing.
|
||||
|
||||
<p>Pseudocode for create:
|
||||
<pre>
|
||||
create (f) {
|
||||
allocate inode in block i (assuming inode is available)
|
||||
add i to directory data block d (assuming d has space)
|
||||
mark d has dependent on i, and create undo/redo record
|
||||
update directory inode in block di
|
||||
mark di has dependent on d
|
||||
}
|
||||
</pre>
|
||||
|
||||
<p>Pseudocode for the flusher:
|
||||
<pre>
|
||||
flushblock (b)
|
||||
{
|
||||
lock b;
|
||||
for all dependencies that b is relying on
|
||||
"remove" that dependency by undoing the change to b
|
||||
mark the dependency as "unrolled"
|
||||
write b
|
||||
}
|
||||
|
||||
write_completed (b) {
|
||||
remove dependencies that depend on b
|
||||
reapply "unrolled" dependencies that b depended on
|
||||
unlock b
|
||||
}
|
||||
</pre>
|
||||
|
||||
<p>Apply flush algorithm to example:
|
||||
<ul>
|
||||
<li>A list of two dependencies: directory->inode, inode->directory.
|
||||
<li>Lets say syncer picks directory first
|
||||
<li>Undo directory->inode changes (i.e., unroll <A,#4>)
|
||||
<li>Write directory block
|
||||
<li>Remove met dependencies (i.e., remove inode->directory dependency)
|
||||
<li>Perform redo operation (i.e., redo <A,#4>)
|
||||
<li>Select inode block and write it
|
||||
<li>Remove met dependencies (i.e., remove directory->inode dependency)
|
||||
<li>Select directory block (it is dirty again!)
|
||||
<li>Write it.
|
||||
</ul>
|
||||
|
||||
<p>An file operation that is important for file-system consistency
|
||||
is rename. Rename conceptually works as follows:
|
||||
<pre>
|
||||
rename (from, to)
|
||||
unlink (to);
|
||||
link (from, to);
|
||||
unlink (from);
|
||||
</pre>
|
||||
|
||||
<p>Rename it often used by programs to make a new version of a file
|
||||
the current version. Committing to a new version must happen
|
||||
atomically. Unfortunately, without a transaction-like support
|
||||
atomicity is impossible to guarantee, so a typical file systems
|
||||
provides weaker semantics for rename: if to already exists, an
|
||||
instance of to will always exist, even if the system should crash in
|
||||
the middle of the operation. Does the above implementation of rename
|
||||
guarantee this semantics? (Answer: no).
|
||||
|
||||
<p>If rename is implemented as unlink, link, unlink, then it is
|
||||
difficult to guarantee even the weak semantics. Modern UNIXes provide
|
||||
rename as a file system call:
|
||||
<pre>
|
||||
update dir block for to point to from's inode // write block
|
||||
update dir block for from to free entry // write block
|
||||
</pre>
|
||||
<p>fsck may need to correct refcounts in the inode if the file
|
||||
system fails during rename. for example, a crash after the first
|
||||
write followed by fsck should set refcount to 2, since both from
|
||||
and to are pointing at the inode.
|
||||
|
||||
<p>This semantics is sufficient, however, for an application to ensure
|
||||
atomicity. Before the call, there is a from and perhaps a to. If the
|
||||
call is successful, following the call there is only a to. If there
|
||||
is a crash, there may be both a from and a to, in which case the
|
||||
caller knows the previous attempt failed, and must retry. The
|
||||
subtlety is that if you now follow the two links, the "to" name may
|
||||
link to either the old file or the new file. If it links to the new
|
||||
file, that means that there was a crash and you just detected that the
|
||||
rename operation was composite. On the other hand, the retry
|
||||
procedure can be the same for either case (do the rename again), so it
|
||||
isn't necessary to discover how it failed. The function follows the
|
||||
golden rule of recoverability, and it is idempotent, so it lays all
|
||||
the needed groundwork for use as part of a true atomic action.
|
||||
|
||||
<p>With soft updates renames becomes:
|
||||
<pre>
|
||||
rename (from, to) {
|
||||
i = namei(from);
|
||||
add "to" directory data block td a reference to inode i
|
||||
mark td dependent on block i
|
||||
update directory inode "to" tdi
|
||||
mark tdi as dependent on td
|
||||
remove "from" directory data block fd a reference to inode i
|
||||
mark fd as dependent on tdi
|
||||
update directory inode in block fdi
|
||||
mark fdi as dependent on fd
|
||||
}
|
||||
</pre>
|
||||
<p>No synchronous writes!
|
||||
|
||||
<p>What needs to be done on recovery? (Inspect every statement in
|
||||
rename and see what inconsistencies could exist on the disk; e.g.,
|
||||
refcnt inode could be too high.) None of these inconsitencies require
|
||||
fixing before the file system can operate; they can be fixed by a
|
||||
background file system repairer.
|
||||
|
||||
<h2>Paper discussion</h2>
|
||||
|
||||
<p>Do soft updates perform any useless writes? (A useless write is a
|
||||
write that will be immediately overwritten.) (Answer: yes.) Fix
|
||||
syncer to becareful with what block to start. Fix cache replacement
|
||||
to selecting LRU block with no pendending dependencies.
|
||||
|
||||
<p>Can a log-structured file system implement rename better? (Answer:
|
||||
yes, since it can get the refcnts right).
|
||||
|
||||
<p>Discuss all graphs.
|
||||
|
||||
</body>
|
||||
|
247
web/l14.txt
Normal file
247
web/l14.txt
Normal file
|
@ -0,0 +1,247 @@
|
|||
Why am I lecturing about Multics?
|
||||
Origin of many ideas in today's OSes
|
||||
Motivated UNIX design (often in opposition)
|
||||
Motivated x86 VM design
|
||||
This lecture is really "how Intel intended x86 segments to be used"
|
||||
|
||||
Multics background
|
||||
design started in 1965
|
||||
very few interactive time-shared systems then: CTSS
|
||||
design first, then implementation
|
||||
system stable by 1969
|
||||
so pre-dates UNIX, which started in 1969
|
||||
ambitious, many years, many programmers, MIT+GE+BTL
|
||||
|
||||
Multics high-level goals
|
||||
many users on same machine: "time sharing"
|
||||
perhaps commercial services sharing the machine too
|
||||
remote terminal access (but no recognizable data networks: wired or phone)
|
||||
persistent reliable file system
|
||||
encourage interaction between users
|
||||
support joint projects that share data &c
|
||||
control access to data that should not be shared
|
||||
|
||||
Most interesting aspect of design: memory system
|
||||
idea: eliminate memory / file distinction
|
||||
file i/o uses LD / ST instructions
|
||||
no difference between memory and disk files
|
||||
just jump to start of file to run program
|
||||
enhances sharing: no more copying files to private memory
|
||||
this seems like a really neat simplification!
|
||||
|
||||
GE 645 physical memory system
|
||||
24-bit phys addresses
|
||||
36-bit words
|
||||
so up to 75 megabytes of physical memory!!!
|
||||
but no-one could afford more than about a megabyte
|
||||
|
||||
[per-process state]
|
||||
DBR
|
||||
DS, SDW (== address space)
|
||||
KST
|
||||
stack segment
|
||||
per-segment linkage segments
|
||||
|
||||
[global state]
|
||||
segment content pages
|
||||
per-segment page tables
|
||||
per-segment branch in directory segment
|
||||
AST
|
||||
|
||||
645 segments (simplified for now, no paging or rings)
|
||||
descriptor base register (DBR) holds phy addr of descriptor segment (DS)
|
||||
DS is an array of segment descriptor words (SDW)
|
||||
SDW: phys addr, length, r/w/x, present
|
||||
CPU has pairs of registers: 18 bit offset, 18 bit segment #
|
||||
five pairs (PC, arguments, base, linkage, stack)
|
||||
early Multics limited each segment to 2^16 words
|
||||
thus there are lots of them, intended to correspond to program modules
|
||||
note: cannot directly address phys mem (18 vs 24)
|
||||
645 segments are a lot like the x86!
|
||||
|
||||
645 paging
|
||||
DBR and SDW actually contain phy addr of 64-entry page table
|
||||
each page is 1024 words
|
||||
PTE holds phys addr and present flag
|
||||
no permission bits, so you really need to use the segments, not like JOS
|
||||
no per-process page table, only per-segment
|
||||
so all processes using a segment share its page table and phys storage
|
||||
makes sense assuming segments tend to be shared
|
||||
paging environment doesn't change on process switch
|
||||
|
||||
Multics processes
|
||||
each process has its own DS
|
||||
Multics switches DBR on context switch
|
||||
different processes typically have different number for same segment
|
||||
|
||||
how to use segments to unify memory and file system?
|
||||
don't want to have to use 18-bit seg numbers as file names
|
||||
we want to write programs using symbolic names
|
||||
names should be hierarchical (for users)
|
||||
so users can have directories and sub-directories
|
||||
and path names
|
||||
|
||||
Multics file system
|
||||
tree structure, directories and files
|
||||
each file and directory is a segment
|
||||
dir seg holds array of "branches"
|
||||
name, length, ACL, array of block #s, "active"
|
||||
unique ROOT directory
|
||||
path names: ROOT > A > B
|
||||
note there are no inodes, thus no i-numbers
|
||||
so "real name" for a file is the complete path name
|
||||
o/s tables have path name where unix would have i-number
|
||||
presumably makes renaming and removing active files awkward
|
||||
no hard links
|
||||
|
||||
how does a program refer to a different segment?
|
||||
inter-segment variables contain symbolic segment name
|
||||
A$E refers to segment A, variable/function E
|
||||
what happens when segment B calls function A$E(1, 2, 3)?
|
||||
|
||||
when compiling B:
|
||||
compiler actually generates *two* segments
|
||||
one holds B's instructions
|
||||
one holds B's linkage information
|
||||
initial linkage entry:
|
||||
name of segment e.g. "A"
|
||||
name of symbol e.g. "E"
|
||||
valid flag
|
||||
CALL instruction is indirect through entry i of linkage segment
|
||||
compiler marks entry i invalid
|
||||
[storage for strings "A" and "E" really in segment B, not linkage seg]
|
||||
|
||||
when a process is executing B:
|
||||
two segments in DS: B and a *copy* of B's linkage segment
|
||||
CPU linkage register always points to current segment's linkage segment
|
||||
call A$E is really call indirect via linkage[i]
|
||||
faults because linkage[i] is invalid
|
||||
o/s fault handler
|
||||
looks up segment name for i ("A")
|
||||
search path in file system for segment "A" (cwd, library dirs)
|
||||
if not already in use by some process (branch active flag and AST knows):
|
||||
allocate page table and pages
|
||||
read segment A into memory
|
||||
if not already in use by *this* process (KST knows):
|
||||
find free SDW j in process DS, make it refer to A's page table
|
||||
set up r/w/x based on process's user and file ACL
|
||||
also set up copy of A's linkage segment
|
||||
search A's symbol table for "E"
|
||||
linkage[i] := j / address(E)
|
||||
restart B
|
||||
now the CALL works via linkage[i]
|
||||
and subsequent calls are fast
|
||||
|
||||
how does A get the correct linkage register?
|
||||
the right value cannot be embedded in A, since shared among processes
|
||||
so CALL actually goes to instructions in A's linkage segment
|
||||
load current seg# into linkage register, jump into A
|
||||
one set of these per procedure in A
|
||||
|
||||
all memory / file references work this way
|
||||
as if pointers were really symbolic names
|
||||
segment # is really a transparent optimization
|
||||
linking is "dynamic"
|
||||
programs contain symbolic references
|
||||
resolved only as needed -- if/when executed
|
||||
code is shared among processes
|
||||
was program data shared?
|
||||
probably most variables not shared (on stack, in private segments)
|
||||
maybe a DB would share a data segment, w/ synchronization
|
||||
file data:
|
||||
probably one at a time (locks) for read/write
|
||||
read-only is easy to share
|
||||
|
||||
filesystem / segment implications
|
||||
programs start slowly due to dynamic linking
|
||||
creat(), unlink(), &c are outside of this model
|
||||
store beyond end extends a segment (== appends to a file)
|
||||
no need for buffer cache! no need to copy into user space!
|
||||
but no buffer cache => ad-hoc caches e.g. active segment table
|
||||
when are dirty segments written back to disk?
|
||||
only in page eviction algorithm, when free pages are low
|
||||
database careful ordered writes? e.g. log before data blocks?
|
||||
I don't know, probably separate flush system calls
|
||||
|
||||
how does shell work?
|
||||
you type a program name
|
||||
the shell just CALLs that program, as a segment!
|
||||
dynamic linking finds program segment and any library segments it needs
|
||||
the program eventually returns, e.g. with RET
|
||||
all this happened inside the shell process's address space
|
||||
no fork, no exec
|
||||
buggy program can crash the shell! e.g. scribble on stack
|
||||
process creation was too slow to give each program its own process
|
||||
|
||||
how valuable is the sharing provided by segment machinery?
|
||||
is it critical to users sharing information?
|
||||
or is it just there to save memory and copying?
|
||||
|
||||
how does the kernel fit into all this?
|
||||
kernel is a bunch of code modules in segments (in file system)
|
||||
a process dynamically loads in the kernel segments that it uses
|
||||
so kernel segments have different numbers in different processes
|
||||
a little different from separate kernel "program" in JOS or xv6
|
||||
kernel shares process's segment# address space
|
||||
thus easy to interpret seg #s in system call arguments
|
||||
kernel segment ACLs in file system restrict write
|
||||
so mapped non-writeable into processes
|
||||
|
||||
how to call the kernel?
|
||||
very similar to the Intel x86
|
||||
8 rings. users at 4. core kernel at 0.
|
||||
CPU knows current execution level
|
||||
SDW has max read/write/execute levels
|
||||
call gate: lowers ring level, but only at designated entry
|
||||
stack per ring, incoming call switches stacks
|
||||
inner ring can always read arguments, write results
|
||||
problem: checking validity of arguments to system calls
|
||||
don't want user to trick kernel into reading/writing the wrong segment
|
||||
you have this problem in JOS too
|
||||
later Multics CPUs had hardware to check argument references
|
||||
|
||||
are Multics rings a general-purpose protected subsystem facility?
|
||||
example: protected game implementation
|
||||
protected so that users cannot cheat
|
||||
put game's code and data in ring 3
|
||||
BUT what if I don't trust the author?
|
||||
or if i've already put some other subsystem in ring 3?
|
||||
a ring has full power over itself and outer rings: you must trust
|
||||
today: user/kernel, server processes and IPC
|
||||
pro: protection among mutually suspicious subsystems
|
||||
con: no convenient sharing of address spaces
|
||||
|
||||
UNIX vs Multics
|
||||
UNIX was less ambitious (e.g. no unified mem/FS)
|
||||
UNIX hardware was small
|
||||
just a few programmers, all in the same room
|
||||
evolved rather than pre-planned
|
||||
quickly self-hosted, so they got experience earlier
|
||||
|
||||
What did UNIX inherit from MULTICS?
|
||||
a shell at user level (not built into kernel)
|
||||
a single hierarchical file system, with subdirectories
|
||||
controlled sharing of files
|
||||
written in high level language, self-hosted development
|
||||
|
||||
What did UNIX reject from MULTICS?
|
||||
files look like memory
|
||||
instead, unifying idea is file descriptor and read()/write()
|
||||
memory is a totally separate resource
|
||||
dynamic linking
|
||||
instead, static linking at compile time, every binary had copy of libraries
|
||||
segments and sharing
|
||||
instead, single linear address space per process, like xv6
|
||||
(but shared libraries brought these back, just for efficiency, in 1980s)
|
||||
Hierarchical rings of protection
|
||||
simpler user/kernel
|
||||
for subsystems, setuid, then client/server and IPC
|
||||
|
||||
The most useful sources I found for late-1960s Multics VM:
|
||||
1. Bensoussan, Clingen, Daley, "The Multics Virtual Memory: Concepts
|
||||
and Design," CACM 1972 (segments, paging, naming segments, dynamic
|
||||
linking).
|
||||
2. Daley and Dennis, "Virtual Memory, Processes, and Sharing in Multics,"
|
||||
SOSP 1967 (more details about dynamic linking and CPU).
|
||||
3. Graham, "Protection in an Information Processing Utility,"
|
||||
CACM 1968 (brief account of rings and gates).
|
1412
web/l19.txt
Normal file
1412
web/l19.txt
Normal file
File diff suppressed because it is too large
Load diff
494
web/l2.html
Normal file
494
web/l2.html
Normal file
|
@ -0,0 +1,494 @@
|
|||
<html>
|
||||
<head>
|
||||
<title>L2</title>
|
||||
</head>
|
||||
<body>
|
||||
|
||||
<h1>6.828 Lecture Notes: x86 and PC architecture</h1>
|
||||
|
||||
<h2>Outline</h2>
|
||||
<ul>
|
||||
<li>PC architecture
|
||||
<li>x86 instruction set
|
||||
<li>gcc calling conventions
|
||||
<li>PC emulation
|
||||
</ul>
|
||||
|
||||
<h2>PC architecture</h2>
|
||||
|
||||
<ul>
|
||||
<li>A full PC has:
|
||||
<ul>
|
||||
<li>an x86 CPU with registers, execution unit, and memory management
|
||||
<li>CPU chip pins include address and data signals
|
||||
<li>memory
|
||||
<li>disk
|
||||
<li>keyboard
|
||||
<li>display
|
||||
<li>other resources: BIOS ROM, clock, ...
|
||||
</ul>
|
||||
|
||||
<li>We will start with the original 16-bit 8086 CPU (1978)
|
||||
<li>CPU runs instructions:
|
||||
<pre>
|
||||
for(;;){
|
||||
run next instruction
|
||||
}
|
||||
</pre>
|
||||
|
||||
<li>Needs work space: registers
|
||||
<ul>
|
||||
<li>four 16-bit data registers: AX, CX, DX, BX
|
||||
<li>each in two 8-bit halves, e.g. AH and AL
|
||||
<li>very fast, very few
|
||||
</ul>
|
||||
<li>More work space: memory
|
||||
<ul>
|
||||
<li>CPU sends out address on address lines (wires, one bit per wire)
|
||||
<li>Data comes back on data lines
|
||||
<li><i>or</i> data is written to data lines
|
||||
</ul>
|
||||
|
||||
<li>Add address registers: pointers into memory
|
||||
<ul>
|
||||
<li>SP - stack pointer
|
||||
<li>BP - frame base pointer
|
||||
<li>SI - source index
|
||||
<li>DI - destination index
|
||||
</ul>
|
||||
|
||||
<li>Instructions are in memory too!
|
||||
<ul>
|
||||
<li>IP - instruction pointer (PC on PDP-11, everything else)
|
||||
<li>increment after running each instruction
|
||||
<li>can be modified by CALL, RET, JMP, conditional jumps
|
||||
</ul>
|
||||
|
||||
<li>Want conditional jumps
|
||||
<ul>
|
||||
<li>FLAGS - various condition codes
|
||||
<ul>
|
||||
<li>whether last arithmetic operation overflowed
|
||||
<li> ... was positive/negative
|
||||
<li> ... was [not] zero
|
||||
<li> ... carry/borrow on add/subtract
|
||||
<li> ... overflow
|
||||
<li> ... etc.
|
||||
<li>whether interrupts are enabled
|
||||
<li>direction of data copy instructions
|
||||
</ul>
|
||||
<li>JP, JN, J[N]Z, J[N]C, J[N]O ...
|
||||
</ul>
|
||||
|
||||
<li>Still not interesting - need I/O to interact with outside world
|
||||
<ul>
|
||||
<li>Original PC architecture: use dedicated <i>I/O space</i>
|
||||
<ul>
|
||||
<li>Works same as memory accesses but set I/O signal
|
||||
<li>Only 1024 I/O addresses
|
||||
<li>Example: write a byte to line printer:
|
||||
<pre>
|
||||
#define DATA_PORT 0x378
|
||||
#define STATUS_PORT 0x379
|
||||
#define BUSY 0x80
|
||||
#define CONTROL_PORT 0x37A
|
||||
#define STROBE 0x01
|
||||
void
|
||||
lpt_putc(int c)
|
||||
{
|
||||
/* wait for printer to consume previous byte */
|
||||
while((inb(STATUS_PORT) & BUSY) == 0)
|
||||
;
|
||||
|
||||
/* put the byte on the parallel lines */
|
||||
outb(DATA_PORT, c);
|
||||
|
||||
/* tell the printer to look at the data */
|
||||
outb(CONTROL_PORT, STROBE);
|
||||
outb(CONTROL_PORT, 0);
|
||||
}
|
||||
<pre>
|
||||
</ul>
|
||||
|
||||
<li>Memory-Mapped I/O
|
||||
<ul>
|
||||
<li>Use normal physical memory addresses
|
||||
<ul>
|
||||
<li>Gets around limited size of I/O address space
|
||||
<li>No need for special instructions
|
||||
<li>System controller routes to appropriate device
|
||||
</ul>
|
||||
<li>Works like ``magic'' memory:
|
||||
<ul>
|
||||
<li> <i>Addressed</i> and <i>accessed</i> like memory,
|
||||
but ...
|
||||
<li> ... does not <i>behave</i> like memory!
|
||||
<li> Reads and writes can have ``side effects''
|
||||
<li> Read results can change due to external events
|
||||
</ul>
|
||||
</ul>
|
||||
</ul>
|
||||
|
||||
|
||||
<li>What if we want to use more than 2^16 bytes of memory?
|
||||
<ul>
|
||||
<li>8086 has 20-bit physical addresses, can have 1 Meg RAM
|
||||
<li>each segment is a 2^16 byte window into physical memory
|
||||
<li>virtual to physical translation: pa = va + seg*16
|
||||
<li>the segment is usually implicit, from a segment register
|
||||
<li>CS - code segment (for fetches via IP)
|
||||
<li>SS - stack segment (for load/store via SP and BP)
|
||||
<li>DS - data segment (for load/store via other registers)
|
||||
<li>ES - another data segment (destination for string operations)
|
||||
<li>tricky: can't use the 16-bit address of a stack variable as a pointer
|
||||
<li>but a <i>far pointer</i> includes full segment:offset (16 + 16 bits)
|
||||
</ul>
|
||||
|
||||
<li>But 8086's 16-bit addresses and data were still painfully small
|
||||
<ul>
|
||||
<li>80386 added support for 32-bit data and addresses (1985)
|
||||
<li>boots in 16-bit mode, boot.S switches to 32-bit mode
|
||||
<li>registers are 32 bits wide, called EAX rather than AX
|
||||
<li>operands and addresses are also 32 bits, e.g. ADD does 32-bit arithmetic
|
||||
<li>prefix 0x66 gets you 16-bit mode: MOVW is really 0x66 MOVW
|
||||
<li>the .code32 in boot.S tells assembler to generate 0x66 for e.g. MOVW
|
||||
<li>80386 also changed segments and added paged memory...
|
||||
</ul>
|
||||
|
||||
</ul>
|
||||
|
||||
<h2>x86 Physical Memory Map</h2>
|
||||
|
||||
<ul>
|
||||
<li>The physical address space mostly looks like ordinary RAM
|
||||
<li>Except some low-memory addresses actually refer to other things
|
||||
<li>Writes to VGA memory appear on the screen
|
||||
<li>Reset or power-on jumps to ROM at 0x000ffff0
|
||||
</ul>
|
||||
|
||||
<pre>
|
||||
+------------------+ <- 0xFFFFFFFF (4GB)
|
||||
| 32-bit |
|
||||
| memory mapped |
|
||||
| devices |
|
||||
| |
|
||||
/\/\/\/\/\/\/\/\/\/\
|
||||
|
||||
/\/\/\/\/\/\/\/\/\/\
|
||||
| |
|
||||
| Unused |
|
||||
| |
|
||||
+------------------+ <- depends on amount of RAM
|
||||
| |
|
||||
| |
|
||||
| Extended Memory |
|
||||
| |
|
||||
| |
|
||||
+------------------+ <- 0x00100000 (1MB)
|
||||
| BIOS ROM |
|
||||
+------------------+ <- 0x000F0000 (960KB)
|
||||
| 16-bit devices, |
|
||||
| expansion ROMs |
|
||||
+------------------+ <- 0x000C0000 (768KB)
|
||||
| VGA Display |
|
||||
+------------------+ <- 0x000A0000 (640KB)
|
||||
| |
|
||||
| Low Memory |
|
||||
| |
|
||||
+------------------+ <- 0x00000000
|
||||
</pre>
|
||||
|
||||
<h2>x86 Instruction Set</h2>
|
||||
|
||||
<ul>
|
||||
<li>Two-operand instruction set
|
||||
<ul>
|
||||
<li>Intel syntax: <tt>op dst, src</tt>
|
||||
<li>AT&T (gcc/gas) syntax: <tt>op src, dst</tt>
|
||||
<ul>
|
||||
<li>uses b, w, l suffix on instructions to specify size of operands
|
||||
</ul>
|
||||
<li>Operands are registers, constant, memory via register, memory via constant
|
||||
<li> Examples:
|
||||
<table cellspacing=5>
|
||||
<tr><td><u>AT&T syntax</u> <td><u>"C"-ish equivalent</u>
|
||||
<tr><td>movl %eax, %edx <td>edx = eax; <td><i>register mode</i>
|
||||
<tr><td>movl $0x123, %edx <td>edx = 0x123; <td><i>immediate</i>
|
||||
<tr><td>movl 0x123, %edx <td>edx = *(int32_t*)0x123; <td><i>direct</i>
|
||||
<tr><td>movl (%ebx), %edx <td>edx = *(int32_t*)ebx; <td><i>indirect</i>
|
||||
<tr><td>movl 4(%ebx), %edx <td>edx = *(int32_t*)(ebx+4); <td><i>displaced</i>
|
||||
</table>
|
||||
</ul>
|
||||
|
||||
<li>Instruction classes
|
||||
<ul>
|
||||
<li>data movement: MOV, PUSH, POP, ...
|
||||
<li>arithmetic: TEST, SHL, ADD, AND, ...
|
||||
<li>i/o: IN, OUT, ...
|
||||
<li>control: JMP, JZ, JNZ, CALL, RET
|
||||
<li>string: REP MOVSB, ...
|
||||
<li>system: IRET, INT
|
||||
</ul>
|
||||
|
||||
<li>Intel architecture manual Volume 2 is <i>the</i> reference
|
||||
|
||||
</ul>
|
||||
|
||||
<h2>gcc x86 calling conventions</h2>
|
||||
|
||||
<ul>
|
||||
<li>x86 dictates that stack grows down:
|
||||
<table cellspacing=5>
|
||||
<tr><td><u>Example instruction</u> <td><u>What it does</u>
|
||||
<tr><td>pushl %eax
|
||||
<td>
|
||||
subl $4, %esp <br>
|
||||
movl %eax, (%esp) <br>
|
||||
<tr><td>popl %eax
|
||||
<td>
|
||||
movl (%esp), %eax <br>
|
||||
addl $4, %esp <br>
|
||||
<tr><td>call $0x12345
|
||||
<td>
|
||||
pushl %eip <sup>(*)</sup> <br>
|
||||
movl $0x12345, %eip <sup>(*)</sup> <br>
|
||||
<tr><td>ret
|
||||
<td>
|
||||
popl %eip <sup>(*)</sup>
|
||||
</table>
|
||||
(*) <i>Not real instructions</i>
|
||||
|
||||
<li>GCC dictates how the stack is used.
|
||||
Contract between caller and callee on x86:
|
||||
<ul>
|
||||
<li>after call instruction:
|
||||
<ul>
|
||||
<li>%eip points at first instruction of function
|
||||
<li>%esp+4 points at first argument
|
||||
<li>%esp points at return address
|
||||
</ul>
|
||||
<li>after ret instruction:
|
||||
<ul>
|
||||
<li>%eip contains return address
|
||||
<li>%esp points at arguments pushed by caller
|
||||
<li>called function may have trashed arguments
|
||||
<li>%eax contains return value
|
||||
(or trash if function is <tt>void</tt>)
|
||||
<li>%ecx, %edx may be trashed
|
||||
<li>%ebp, %ebx, %esi, %edi must contain contents from time of <tt>call</tt>
|
||||
</ul>
|
||||
<li>Terminology:
|
||||
<ul>
|
||||
<li>%eax, %ecx, %edx are "caller save" registers
|
||||
<li>%ebp, %ebx, %esi, %edi are "callee save" registers
|
||||
</ul>
|
||||
</ul>
|
||||
|
||||
<li>Functions can do anything that doesn't violate contract.
|
||||
By convention, GCC does more:
|
||||
<ul>
|
||||
<li>each function has a stack frame marked by %ebp, %esp
|
||||
<pre>
|
||||
+------------+ |
|
||||
| arg 2 | \
|
||||
+------------+ >- previous function's stack frame
|
||||
| arg 1 | /
|
||||
+------------+ |
|
||||
| ret %eip | /
|
||||
+============+
|
||||
| saved %ebp | \
|
||||
%ebp-> +------------+ |
|
||||
| | |
|
||||
| local | \
|
||||
| variables, | >- current function's stack frame
|
||||
| etc. | /
|
||||
| | |
|
||||
| | |
|
||||
%esp-> +------------+ /
|
||||
</pre>
|
||||
<li>%esp can move to make stack frame bigger, smaller
|
||||
<li>%ebp points at saved %ebp from previous function,
|
||||
chain to walk stack
|
||||
<li>function prologue:
|
||||
<pre>
|
||||
pushl %ebp
|
||||
movl %esp, %ebp
|
||||
</pre>
|
||||
<li>function epilogue:
|
||||
<pre>
|
||||
movl %ebp, %esp
|
||||
popl %ebp
|
||||
</pre>
|
||||
or
|
||||
<pre>
|
||||
leave
|
||||
</pre>
|
||||
</ul>
|
||||
|
||||
<li>Big example:
|
||||
<ul>
|
||||
<li>C code
|
||||
<pre>
|
||||
int main(void) { return f(8)+1; }
|
||||
int f(int x) { return g(x); }
|
||||
int g(int x) { return x+3; }
|
||||
</pre>
|
||||
<li>assembler
|
||||
<pre>
|
||||
_main:
|
||||
<i>prologue</i>
|
||||
pushl %ebp
|
||||
movl %esp, %ebp
|
||||
<i>body</i>
|
||||
pushl $8
|
||||
call _f
|
||||
addl $1, %eax
|
||||
<i>epilogue</i>
|
||||
movl %ebp, %esp
|
||||
popl %ebp
|
||||
ret
|
||||
_f:
|
||||
<i>prologue</i>
|
||||
pushl %ebp
|
||||
movl %esp, %ebp
|
||||
<i>body</i>
|
||||
pushl 8(%esp)
|
||||
call _g
|
||||
<i>epilogue</i>
|
||||
movl %ebp, %esp
|
||||
popl %ebp
|
||||
ret
|
||||
|
||||
_g:
|
||||
<i>prologue</i>
|
||||
pushl %ebp
|
||||
movl %esp, %ebp
|
||||
<i>save %ebx</i>
|
||||
pushl %ebx
|
||||
<i>body</i>
|
||||
movl 8(%ebp), %ebx
|
||||
addl $3, %ebx
|
||||
movl %ebx, %eax
|
||||
<i>restore %ebx</i>
|
||||
popl %ebx
|
||||
<i>epilogue</i>
|
||||
movl %ebp, %esp
|
||||
popl %ebp
|
||||
ret
|
||||
</pre>
|
||||
</ul>
|
||||
|
||||
<li>Super-small <tt>_g</tt>:
|
||||
<pre>
|
||||
_g:
|
||||
movl 4(%esp), %eax
|
||||
addl $3, %eax
|
||||
ret
|
||||
</pre>
|
||||
|
||||
<li>Compiling, linking, loading:
|
||||
<ul>
|
||||
<li> <i>Compiler</i> takes C source code (ASCII text),
|
||||
produces assembly language (also ASCII text)
|
||||
<li> <i>Assembler</i> takes assembly language (ASCII text),
|
||||
produces <tt>.o</tt> file (binary, machine-readable!)
|
||||
<li> <i>Linker</i> takse multiple '<tt>.o</tt>'s,
|
||||
produces a single <i>program image</i> (binary)
|
||||
<li> <i>Loader</i> loads the program image into memory
|
||||
at run-time and starts it executing
|
||||
</ul>
|
||||
</ul>
|
||||
|
||||
|
||||
<h2>PC emulation</h2>
|
||||
|
||||
<ul>
|
||||
<li> Emulator like Bochs works by
|
||||
<ul>
|
||||
<li> doing exactly what a real PC would do,
|
||||
<li> only implemented in software rather than hardware!
|
||||
</ul>
|
||||
<li> Runs as a normal process in a "host" operating system (e.g., Linux)
|
||||
<li> Uses normal process storage to hold emulated hardware state:
|
||||
e.g.,
|
||||
<ul>
|
||||
<li> Hold emulated CPU registers in global variables
|
||||
<pre>
|
||||
int32_t regs[8];
|
||||
#define REG_EAX 1;
|
||||
#define REG_EBX 2;
|
||||
#define REG_ECX 3;
|
||||
...
|
||||
int32_t eip;
|
||||
int16_t segregs[4];
|
||||
...
|
||||
</pre>
|
||||
<li> <tt>malloc</tt> a big chunk of (virtual) process memory
|
||||
to hold emulated PC's (physical) memory
|
||||
</ul>
|
||||
<li> Execute instructions by simulating them in a loop:
|
||||
<pre>
|
||||
for (;;) {
|
||||
read_instruction();
|
||||
switch (decode_instruction_opcode()) {
|
||||
case OPCODE_ADD:
|
||||
int src = decode_src_reg();
|
||||
int dst = decode_dst_reg();
|
||||
regs[dst] = regs[dst] + regs[src];
|
||||
break;
|
||||
case OPCODE_SUB:
|
||||
int src = decode_src_reg();
|
||||
int dst = decode_dst_reg();
|
||||
regs[dst] = regs[dst] - regs[src];
|
||||
break;
|
||||
...
|
||||
}
|
||||
eip += instruction_length;
|
||||
}
|
||||
</pre>
|
||||
|
||||
<li> Simulate PC's physical memory map
|
||||
by decoding emulated "physical" addresses just like a PC would:
|
||||
<pre>
|
||||
#define KB 1024
|
||||
#define MB 1024*1024
|
||||
|
||||
#define LOW_MEMORY 640*KB
|
||||
#define EXT_MEMORY 10*MB
|
||||
|
||||
uint8_t low_mem[LOW_MEMORY];
|
||||
uint8_t ext_mem[EXT_MEMORY];
|
||||
uint8_t bios_rom[64*KB];
|
||||
|
||||
uint8_t read_byte(uint32_t phys_addr) {
|
||||
if (phys_addr < LOW_MEMORY)
|
||||
return low_mem[phys_addr];
|
||||
else if (phys_addr >= 960*KB && phys_addr < 1*MB)
|
||||
return rom_bios[phys_addr - 960*KB];
|
||||
else if (phys_addr >= 1*MB && phys_addr < 1*MB+EXT_MEMORY) {
|
||||
return ext_mem[phys_addr-1*MB];
|
||||
else ...
|
||||
}
|
||||
|
||||
void write_byte(uint32_t phys_addr, uint8_t val) {
|
||||
if (phys_addr < LOW_MEMORY)
|
||||
low_mem[phys_addr] = val;
|
||||
else if (phys_addr >= 960*KB && phys_addr < 1*MB)
|
||||
; /* ignore attempted write to ROM! */
|
||||
else if (phys_addr >= 1*MB && phys_addr < 1*MB+EXT_MEMORY) {
|
||||
ext_mem[phys_addr-1*MB] = val;
|
||||
else ...
|
||||
}
|
||||
</pre>
|
||||
<li> Simulate I/O devices, etc., by detecting accesses to
|
||||
"special" memory and I/O space and emulating the correct behavior:
|
||||
e.g.,
|
||||
<ul>
|
||||
<li> Reads/writes to emulated hard disk
|
||||
transformed into reads/writes of a file on the host system
|
||||
<li> Writes to emulated VGA display hardware
|
||||
transformed into drawing into an X window
|
||||
<li> Reads from emulated PC keyboard
|
||||
transformed into reads from X input event queue
|
||||
</ul>
|
||||
</ul>
|
334
web/l3.html
Normal file
334
web/l3.html
Normal file
|
@ -0,0 +1,334 @@
|
|||
<title>L3</title>
|
||||
<html>
|
||||
<head>
|
||||
</head>
|
||||
<body>
|
||||
|
||||
<h1>Operating system organizaton</h1>
|
||||
|
||||
<p>Required reading: Exokernel paper.
|
||||
|
||||
<h2>Intro: virtualizing</h2>
|
||||
|
||||
<p>One way to think about an operating system interface is that it
|
||||
extends the hardware instructions with a set of "instructions" that
|
||||
are implemented in software. These instructions are invoked using a
|
||||
system call instruction (int on the x86). In this view, a task of the
|
||||
operating system is to provide each application with a <i>virtual</i>
|
||||
version of the interface; that is, it provides each application with a
|
||||
virtual computer.
|
||||
|
||||
<p>One of the challenges in an operating system is multiplexing the
|
||||
physical resources between the potentially many virtual computers.
|
||||
What makes the multiplexing typically complicated is an additional
|
||||
constraint: isolate the virtual computers well from each other. That
|
||||
is,
|
||||
<ul>
|
||||
<li> stores shouldn't be able to overwrite other apps's data
|
||||
<li> jmp shouldn't be able to enter another application
|
||||
<li> one virtual computer cannot hog the processor
|
||||
</ul>
|
||||
|
||||
<p>In this lecture, we will explore at a high-level how to build
|
||||
virtual computer that meet these goals. In the rest of the term we
|
||||
work out the details.
|
||||
|
||||
<h2>Virtual processors</h2>
|
||||
|
||||
<p>To give each application its own set of virtual processor, we need
|
||||
to virtualize the physical processors. One way to do is to multiplex
|
||||
the physical processor over time: the operating system runs one
|
||||
application for a while, then runs another application for while, etc.
|
||||
We can implement this solution as follows: when an application has run
|
||||
for its share of the processor, unload the state of the phyical
|
||||
processor, save that state to be able to resume the application later,
|
||||
load in the state for the next application, and resume it.
|
||||
|
||||
<p>What needs to be saved and restored? That depends on the
|
||||
processor, but for the x86:
|
||||
<ul>
|
||||
<li>IP
|
||||
<li>SP
|
||||
<li>The other processor registers (eax, etc.)
|
||||
</ul>
|
||||
|
||||
<p>To enforce that a virtual processor doesn't keep a processor, the
|
||||
operating system can arrange for a periodic interrupt, and switch the
|
||||
processor in the interrupt routine.
|
||||
|
||||
<p>To separate the memories of the applications, we may also need to save
|
||||
and restore the registers that define the (virtual) memory of the
|
||||
application (e.g., segment and MMU registers on the x86), which is
|
||||
explained next.
|
||||
|
||||
|
||||
|
||||
<h2>Separating memories</h2>
|
||||
|
||||
<p>Approach to separating memories:
|
||||
<ul>
|
||||
<li>Force programs to be written in high-level, type-safe language
|
||||
<li>Enforce separation using hardware support
|
||||
</ul>
|
||||
The approaches can be combined.
|
||||
|
||||
<p>Lets assume unlimited physical memory for a little while. We can
|
||||
enforce separation then as follows:
|
||||
<ul>
|
||||
|
||||
<li>Put device (memory management unit) between processor and memory,
|
||||
which checks each memory access against a set of domain registers.
|
||||
(The domain registers are like segment registers on the x86, except
|
||||
there is no computation to compute an address.)
|
||||
<li>The domain register specifies a range of addresses that the
|
||||
processor is allow to access.
|
||||
<li>When switching applications, switch domain registers.
|
||||
</ul>
|
||||
Why does this work? load/stores/jmps cannot touch/enter other
|
||||
application's domains.
|
||||
|
||||
<p>To allow for controled sharing and separation with an application,
|
||||
extend domain registers with protectioin bits: read (R), write (W),
|
||||
execute-only (X).
|
||||
|
||||
<p>How to protect the domain registers? Extend the protection bits
|
||||
with a kernel-only one. When in kernel-mode, processor can change
|
||||
domain registers. As we will see in lecture 4, x86 stores the U/K
|
||||
information in CPL (current privilege level) in CS segment
|
||||
register.
|
||||
|
||||
<p>To change from user to kernel, extend the hardware with special
|
||||
instructions for entering a "supervisor" or "system" call, and
|
||||
returning from it. On x86, int and reti. The int instruction takes as
|
||||
argument the system call number. We can then think of the kernel
|
||||
interface as the set of "instructions" that augment the instructions
|
||||
implemented in hardware.
|
||||
|
||||
<h2>Memory management</h2>
|
||||
|
||||
<p>We assumed unlimited physical memory and big addresses. In
|
||||
practice, operating system must support creating, shrinking, and
|
||||
growing of domains, while still allowing the addresses of an
|
||||
application to be contiguous (for programming convenience). What if
|
||||
we want to grow the domain of application 1 but the memory right below
|
||||
and above it is in use by application 2?
|
||||
|
||||
<p>How? Virtual addresses and spaces. Virtualize addresses and let
|
||||
the kernel control the mapping from virtual to physical.
|
||||
|
||||
<p> Address spaces provide each application with the ideas that it has
|
||||
a complete memory for itself. All the addresses it issues are its
|
||||
addresses (e.g., each application has an address 0).
|
||||
|
||||
<li> How do you give each application its own address space?
|
||||
<ul>
|
||||
<li> MMU translates <i>virtual</i> address to <i>physical</i>
|
||||
addresses using a translation table
|
||||
<li> Implementation approaches for translation table:
|
||||
<ol>
|
||||
|
||||
<li> for each virtual address store physical address, too costly.
|
||||
|
||||
<li> translate a set of contiguous virtual addresses at a time using
|
||||
segments (segment #, base address, length)
|
||||
|
||||
<li> translate a fixed-size set of address (page) at a time using a
|
||||
page map (page # -> block #) (draw hardware page table picture).
|
||||
Datastructures for page map: array, n-level tree, superpages, etc.
|
||||
|
||||
</ol>
|
||||
<br>Some processor have both 2+3: x86! (see lecture 4)
|
||||
</ul>
|
||||
|
||||
<li> What if two applications want to share real memory? Map the pages
|
||||
into multiple address spaces and have protection bits per page.
|
||||
|
||||
<li> How do you give an application access to a memory-mapped-IO
|
||||
device? Map the physical address for the device into the applications
|
||||
address space.
|
||||
|
||||
<li> How do you get off the ground?
|
||||
<ul>
|
||||
<li> when computer starts, MMU is disabled.
|
||||
<li> computer starts in kernel mode, with no
|
||||
translation (i.e., virtual address 0 is physical address 0, and
|
||||
so on)
|
||||
<li> kernel program sets up MMU to translate kernel address to physical
|
||||
address. often kernel virtual address translates to physical adress 0.
|
||||
<li> enable MMU
|
||||
<br><p>Lab 2 explores this topic in detail.
|
||||
</ul>
|
||||
|
||||
<h2>Operating system organizations</h2>
|
||||
|
||||
<p>A central theme in operating system design is how to organize the
|
||||
operating system. It is helpful to define a couple of terms:
|
||||
<ul>
|
||||
|
||||
<li>Kernel: the program that runs in kernel mode, in a kernel
|
||||
address space.
|
||||
|
||||
<li>Library: code against which application link (e.g., libc).
|
||||
|
||||
<li>Application: code that runs in a user-level address space.
|
||||
|
||||
<li>Operating system: kernel plus all user-level system code (e.g.,
|
||||
servers, libraries, etc.)
|
||||
|
||||
</ul>
|
||||
|
||||
<p>Example: trace a call to printf made by an application.
|
||||
|
||||
<p>There are roughly 4 operating system designs:
|
||||
<ul>
|
||||
|
||||
<li>Monolithic design. The OS interface is the kernel interface (i.e.,
|
||||
the complete operating systems runs in kernel mode). This has limited
|
||||
flexibility (other than downloadable kernel modules) and doesn't fault
|
||||
isolate individual OS modules (e.g., the file system and process
|
||||
module are both in the kernel address space). xv6 has this
|
||||
organization.
|
||||
|
||||
<li>Microkernl design. The kernel interface provides a minimal set of
|
||||
abstractions (e.g., virtual memory, IPC, and threads), and the rest of
|
||||
the operating system is implemented by user applications (often called
|
||||
servers).
|
||||
|
||||
<li>Virtual machine design. The kernel implements a virtual machine
|
||||
monitor. The monitor multiplexes multiple virtual machines, which
|
||||
each provide as the kernel programming interface the machine platform
|
||||
(the instruction set, devices, etc.). Each virtual machine runs its
|
||||
own, perhaps simple, operating system.
|
||||
|
||||
<li>Exokernel design. Only used in this class and discussed below.
|
||||
|
||||
</ul>
|
||||
|
||||
<p>Although monolithic operating systems are the dominant operating
|
||||
system architecture for desktop and server machines, it is worthwhile
|
||||
to consider alternative architectures, even it is just to understand
|
||||
operating systems better. This lecture looks at exokernels, because
|
||||
that is what you will building in the lab. xv6 is organized as a
|
||||
monolithic system, and we will study in the next lectures. Later in
|
||||
the term we will read papers about microkernel and virtual machine
|
||||
operating systems.
|
||||
|
||||
<h2>Exokernels</h2>
|
||||
|
||||
<p>The exokernel architecture takes an end-to-end approach to
|
||||
operating system design. In this design, the kernel just securely
|
||||
multiplexes physical resources; any programmer can decide what the
|
||||
operating system interface and its implementation are for his
|
||||
application. One would expect a couple of popular APIs (e.g., UNIX)
|
||||
that most applications will link against, but a programmer is always
|
||||
free to replace that API, partially or completely. (Draw picture of
|
||||
JOS.)
|
||||
|
||||
<p>Compare UNIX interface (<a href="v6.c">v6</a> or <a
|
||||
href="os10.h">OSX</a>) with the JOS exokernel-like interface:
|
||||
<pre>
|
||||
enum
|
||||
{
|
||||
SYS_cputs = 0,
|
||||
SYS_cgetc,
|
||||
SYS_getenvid,
|
||||
SYS_env_destroy,
|
||||
SYS_page_alloc,
|
||||
SYS_page_map,
|
||||
SYS_page_unmap,
|
||||
SYS_exofork,
|
||||
SYS_env_set_status,
|
||||
SYS_env_set_trapframe,
|
||||
SYS_env_set_pgfault_upcall,
|
||||
SYS_yield,
|
||||
SYS_ipc_try_send,
|
||||
SYS_ipc_recv,
|
||||
};
|
||||
</pre>
|
||||
|
||||
<p>To illustrate the differences between these interfaces in more
|
||||
detail consider implementing the following:
|
||||
<ul>
|
||||
|
||||
<li>User-level thread package that deals well with blocking system
|
||||
calls, page faults, etc.
|
||||
|
||||
<li>High-performance web server performing optimizations across module
|
||||
boundaries (e.g., file system and network stack).
|
||||
|
||||
</ul>
|
||||
|
||||
<p>How well can each kernel interface implement the above examples?
|
||||
(Start with UNIX interface and see where you run into problems.) (The
|
||||
JOS kernel interface is not flexible enough: for example,
|
||||
<i>ipc_receive</i> is blocking.)
|
||||
|
||||
<h2>Exokernel paper discussion</h2>
|
||||
|
||||
|
||||
<p>The central challenge in an exokernel design it to provide
|
||||
extensibility, but provide fault isolation. This challenge breaks
|
||||
down into three problems:
|
||||
|
||||
<ul>
|
||||
|
||||
<li>tracking owner ship of resources;
|
||||
|
||||
<li>ensuring fault isolation between applications;
|
||||
|
||||
<li>revoking access to resources.
|
||||
|
||||
</ul>
|
||||
|
||||
<ul>
|
||||
|
||||
<li>How is physical memory multiplexed? Kernel tracks for each
|
||||
physical page who has it.
|
||||
|
||||
<li>How is the processor multiplexed? Time slices.
|
||||
|
||||
<li>How is the network multiplexed? Packet filters.
|
||||
|
||||
<li>What is the plan for revoking resources?
|
||||
<ul>
|
||||
|
||||
<li>Expose information so that application can do the right thing.
|
||||
|
||||
<li>Ask applications politely to release resources of a given type.
|
||||
|
||||
<li>Ask applications with force to release resources
|
||||
|
||||
</ul>
|
||||
|
||||
<li>What is an environment? The processor environment: it stores
|
||||
sufficient information to deliver events to applications: exception
|
||||
context, interrupt context, protected entry context, and addressing
|
||||
context. This structure is processor specific.
|
||||
|
||||
<li>How does on implement a minimal protected control transfer on the
|
||||
x86? Lab 4's approach to IPC has some short comings: what are they?
|
||||
(It is essentially a polling-based solution, and the one you implement
|
||||
is unfair.) What is a better way? Set up a specific handler to be
|
||||
called when an environment wants to call this environment. How does
|
||||
this impact scheduling of environments? (i.e., give up time slice or
|
||||
not?)
|
||||
|
||||
<li>How does one dispatch exceptions (e.g., page fault) to user space
|
||||
on the x86? Give each environment a separate exception stack in user
|
||||
space, and propagate exceptions on that stack. See page-fault handling
|
||||
in lab 4.
|
||||
|
||||
<li>How does on implement processes in user space? The thread part of
|
||||
a process is easy. The difficult part it to perform the copy of the
|
||||
address space efficiently; one would like to share memory between
|
||||
parent and child. This property can be achieved using copy-on-write.
|
||||
The child should, however, have its own exception stack. Again,
|
||||
see lab 4. <i>sfork</i> is a trivial extension of user-level <i>fork</i>.
|
||||
|
||||
<li>What are the examples of extensibility in this paper? (RPC system
|
||||
in which server saves and restores registers, different page table,
|
||||
and stride scheduler.)
|
||||
|
||||
</ul>
|
||||
|
||||
</body>
|
518
web/l4.html
Normal file
518
web/l4.html
Normal file
|
@ -0,0 +1,518 @@
|
|||
<title>L4</title>
|
||||
<html>
|
||||
<head>
|
||||
</head>
|
||||
<body>
|
||||
|
||||
<h1>Address translation and sharing using segments</h1>
|
||||
|
||||
<p>This lecture is about virtual memory, focusing on address
|
||||
spaces. It is the first lecture out of series of lectures that uses
|
||||
xv6 as a case study.
|
||||
|
||||
<h2>Address spaces</h2>
|
||||
|
||||
<ul>
|
||||
|
||||
<li>OS: kernel program and user-level programs. For fault isolation
|
||||
each program runs in a separate address space. The kernel address
|
||||
spaces is like user address spaces, expect it runs in kernel mode.
|
||||
The program in kernel mode can execute priviledge instructions (e.g.,
|
||||
writing the kernel's code segment registers).
|
||||
|
||||
<li>One job of kernel is to manage address spaces (creating, growing,
|
||||
deleting, and switching between them)
|
||||
|
||||
<ul>
|
||||
|
||||
<li>Each address space (including kernel) consists of the binary
|
||||
representation for the text of the program, the data part
|
||||
part of the program, and the stack area.
|
||||
|
||||
<li>The kernel address space runs the kernel program. In a monolithic
|
||||
organization the kernel manages all hardware and provides an API
|
||||
to user programs.
|
||||
|
||||
<li>Each user address space contains a program. A user progam may ask
|
||||
to shrink or grow its address space.
|
||||
|
||||
</ul>
|
||||
|
||||
<li>The main operations:
|
||||
<ul>
|
||||
<li>Creation. Allocate physical memory to storage program. Load
|
||||
program into physical memory. Fill address spaces with references to
|
||||
physical memory.
|
||||
<li>Growing. Allocate physical memory and add it to address space.
|
||||
<li>Shrinking. Free some of the memory in an address space.
|
||||
<li>Deletion. Free all memory in an address space.
|
||||
<li>Switching. Switch the processor to use another address space.
|
||||
<li>Sharing. Share a part of an address space with another program.
|
||||
</ul>
|
||||
</ul>
|
||||
|
||||
<p>Two main approaches to implementing address spaces: using segments
|
||||
and using page tables. Often when one uses segments, one also uses
|
||||
page tables. But not the other way around; i.e., paging without
|
||||
segmentation is common.
|
||||
|
||||
<h2>Example support for address spaces: x86</h2>
|
||||
|
||||
<p>For an operating system to provide address spaces and address
|
||||
translation typically requires support from hardware. The translation
|
||||
and checking of permissions typically must happen on each address used
|
||||
by a program, and it would be too slow to check that in software (if
|
||||
even possible). The division of labor is operating system manages
|
||||
address spaces, and hardware translates addresses and checks
|
||||
permissions.
|
||||
|
||||
<p>PC block diagram without virtual memory support:
|
||||
<ul>
|
||||
<li>physical address
|
||||
<li>base, IO hole, extended memory
|
||||
<li>Physical address == what is on CPU's address pins
|
||||
</ul>
|
||||
|
||||
<p>The x86 starts out in real mode and translation is as follows:
|
||||
<ul>
|
||||
<li>segment*16+offset ==> physical address
|
||||
<li>no protection: program can load anything into seg reg
|
||||
</ul>
|
||||
|
||||
<p>The operating system can switch the x86 to protected mode, which
|
||||
allows the operating system to create address spaces. Translation in
|
||||
protected mode is as follows:
|
||||
<ul>
|
||||
<li>selector:offset (logical addr) <br>
|
||||
==SEGMENTATION==>
|
||||
<li>linear address <br>
|
||||
==PAGING ==>
|
||||
<li>physical address
|
||||
</ul>
|
||||
|
||||
<p>Next lecture covers paging; now we focus on segmentation.
|
||||
|
||||
<p>Protected-mode segmentation works as follows:
|
||||
<ul>
|
||||
<li>protected-mode segments add 32-bit addresses and protection
|
||||
<ul>
|
||||
<li>wait: what's the point? the point of segments in real mode was
|
||||
bigger addresses, but 32-bit mode fixes that!
|
||||
</ul>
|
||||
<li>segment register holds segment selector
|
||||
<li>selector indexes into global descriptor table (GDT)
|
||||
<li>segment descriptor holds 32-bit base, limit, type, protection
|
||||
<li>la = va + base ; assert(va < limit);
|
||||
<li>seg register usually implicit in instruction
|
||||
<ul>
|
||||
<li>DS:REG
|
||||
<ul>
|
||||
<li><tt>movl $0x1, _flag</tt>
|
||||
</ul>
|
||||
<li>SS:ESP, SS:EBP
|
||||
<ul>
|
||||
<li><tt>pushl %ecx, pushl $_i</tt>
|
||||
<li><tt>popl %ecx</tt>
|
||||
<li><tt>movl 4(%ebp),%eax</tt>
|
||||
</ul>
|
||||
<li>CS:EIP
|
||||
<ul>
|
||||
<li>instruction fetch
|
||||
</ul>
|
||||
<li>String instructions: read from DS:ESI, write to ES:EDI
|
||||
<ul>
|
||||
<li><tt>rep movsb</tt>
|
||||
</ul>
|
||||
<li>Exception: far addresses
|
||||
<ul>
|
||||
<li><tt>ljmp $selector, $offset</tt>
|
||||
</ul>
|
||||
</ul>
|
||||
<li>LGDT instruction loads CPU's GDT register
|
||||
<li>you turn on protected mode by setting PE bit in CR0 register
|
||||
<li>what happens with the next instruction? CS now has different
|
||||
meaning...
|
||||
|
||||
<li>How to transfer from segment to another, perhaps with different
|
||||
priveleges.
|
||||
<ul>
|
||||
<li>Current privilege level (CPL) is in the low 2 bits of CS
|
||||
<li>CPL=0 is privileged O/S, CPL=3 is user
|
||||
<li>Within in the same privelege level: ljmp.
|
||||
<li>Transfer to a segment with more privilege: call gates.
|
||||
<ul>
|
||||
<li>a way for app to jump into a segment and acquire privs
|
||||
<li>CPL must be <= descriptor's DPL in order to read or write segment
|
||||
<li>call gates can change privelege <b>and</b> switch CS and SS
|
||||
segment
|
||||
<li>call gates are implemented using a special type segment descriptor
|
||||
in the GDT.
|
||||
<li>interrupts are conceptually the same as call gates, but their
|
||||
descriptor is stored in the IDT. We will use interrupts to transfer
|
||||
control between user and kernel mode, both in JOS and xv6. We will
|
||||
return to this in the lecture about interrupts and exceptions.
|
||||
</ul>
|
||||
</ul>
|
||||
|
||||
<li>What about protection?
|
||||
<ul>
|
||||
<li>can o/s limit what memory an application can read or write?
|
||||
<li>app can load any selector into a seg reg...
|
||||
<li>but can only mention indices into GDT
|
||||
<li>app can't change GDT register (requires privilege)
|
||||
<li>why can't app write the descriptors in the GDT?
|
||||
<li>what about system calls? how to they transfer to kernel?
|
||||
<li>app cannot <b>just</b> lower the CPL
|
||||
</ul>
|
||||
</ul>
|
||||
|
||||
<h2>Case study (xv6)</h2>
|
||||
|
||||
<p>xv6 is a reimplementation of <a href="../v6.html">Unix 6th edition</a>.
|
||||
<ul>
|
||||
<li>v6 is a version of the orginal Unix operating system for <a href="http://www.pdp11.org/">DEC PDP11</a>
|
||||
<ul>
|
||||
<li>PDP-11 (1972):
|
||||
<li>16-bit processor, 18-bit physical (40)
|
||||
<li>UNIBUS
|
||||
<li>memory-mapped I/O
|
||||
<li>performance: less than 1MIPS
|
||||
<li>register-to-register transfer: 0.9 usec
|
||||
<li>56k-228k (40)
|
||||
<li>no paging, but some segmentation support
|
||||
<li>interrupts, traps
|
||||
<li>about $10K
|
||||
<li>rk disk with 2MByte of storage
|
||||
<li>with cabinet 11/40 is 400lbs
|
||||
</ul>
|
||||
<li>Unix v6
|
||||
<ul>
|
||||
<li><a href="../reference.html">Unix papers</a>.
|
||||
<li>1976; first widely available Unix outside Bell labs
|
||||
<li>Thompson and Ritchie
|
||||
<li>Influenced by Multics but simpler.
|
||||
<li>complete (used for real work)
|
||||
<li>Multi-user, time-sharing
|
||||
<li>small (43 system calls)
|
||||
<li>modular (composition through pipes; one had to split programs!!)
|
||||
<li>compactly written (2 programmers, 9,000 lines of code)
|
||||
<li>advanced UI (shell)
|
||||
<li>introduced C (derived from B)
|
||||
<li>distributed with source
|
||||
<li>V7 was sold by Microsoft for a couple years under the name Xenix
|
||||
</ul>
|
||||
<li>Lion's commentary
|
||||
<ul>
|
||||
<li>surpressed because of copyright issue
|
||||
<li>resurfaced in 1996
|
||||
</ul>
|
||||
|
||||
<li>xv6 written for 6.828:
|
||||
<ul>
|
||||
<li>v6 reimplementation for x86
|
||||
<li>does't include all features of v6 (e.g., xv6 has 20 of 43
|
||||
system calls).
|
||||
<li>runs on symmetric multiprocessing PCs (SMPs).
|
||||
</ul>
|
||||
</ul>
|
||||
|
||||
<p>Newer Unixs have inherited many of the conceptual ideas even though
|
||||
they added paging, networking, graphics, improve performance, etc.
|
||||
|
||||
<p>You will need to read most of the source code multiple times. Your
|
||||
goal is to explain every line to yourself.
|
||||
|
||||
<h3>Overview of address spaces in xv6</h3>
|
||||
|
||||
<p>In today's lecture we see how xv6 creates the kernel address
|
||||
spaces, first user address spaces, and switches to it. To understand
|
||||
how this happens, we need to understand in detail the state on the
|
||||
stack too---this may be surprising, but a thread of control and
|
||||
address space are tightly bundled in xv6, in a concept
|
||||
called <i>process</i>. The kernel address space is the only address
|
||||
space with multiple threads of control. We will study context
|
||||
switching and process management in detail next weeks; creation of
|
||||
the first user process (init) will get you a first flavor.
|
||||
|
||||
<p>xv6 uses only the segmentation hardware on xv6, but in a limited
|
||||
way. (In JOS you will use page-table hardware too, which we cover in
|
||||
next lecture.) The adddress space layouts are as follows:
|
||||
<ul>
|
||||
<li>In kernel address space is set up as follows:
|
||||
<pre>
|
||||
the code segment runs from 0 to 2^32 and is mapped X and R
|
||||
the data segment runs from 0 to 2^32 but is mapped W (read and write).
|
||||
</pre>
|
||||
<li>For each process, the layout is as follows:
|
||||
<pre>
|
||||
text
|
||||
original data and bss
|
||||
fixed-size stack
|
||||
expandable heap
|
||||
</pre>
|
||||
The text of a process is stored in its own segment and the rest in a
|
||||
data segment.
|
||||
</ul>
|
||||
|
||||
<p>xv6 makes minimal use of the segmentation hardware available on the
|
||||
x86. What other plans could you envision?
|
||||
|
||||
<p>In xv6, each each program has a user and a kernel stack; when the
|
||||
user program switches to the kernel, it switches to its kernel stack.
|
||||
Its kernel stack is stored in process's proc structure. (This is
|
||||
arranged through the descriptors in the IDT, which is covered later.)
|
||||
|
||||
<p>xv6 assumes that there is a lot of physical memory. It assumes that
|
||||
segments can be stored contiguously in physical memory and has
|
||||
therefore no need for page tables.
|
||||
|
||||
<h3>xv6 kernel address space</h3>
|
||||
|
||||
<p>Let's see how xv6 creates the kernel address space by tracing xv6
|
||||
from when it boots, focussing on address space management:
|
||||
<ul>
|
||||
<li>Where does xv6 start after the PC is power on: start (which is
|
||||
loaded at physical address 0x7c00; see lab 1).
|
||||
<li>1025-1033: are we in real mode?
|
||||
<ul>
|
||||
<li>how big are logical addresses?
|
||||
<li>how big are physical addresses?
|
||||
<li>how are addresses physical calculated?
|
||||
<li>what segment is being used in subsequent code?
|
||||
<li>what values are in that segment?
|
||||
</ul>
|
||||
<li>1068: what values are loaded in the GDT?
|
||||
<ul>
|
||||
<li>1097: gdtr points to gdt
|
||||
<li>1094: entry 0 unused
|
||||
<li>1095: entry 1 (X + R, base = 0, limit = 0xffffffff, DPL = 0)
|
||||
<li>1096: entry 2 (W, base = 0, limit = 0xffffffff, DPL = 0)
|
||||
<li>are we using segments in a sophisticated way? (i.e., controled sharing)
|
||||
<li>are P and S set?
|
||||
<li>are addresses translated as in protected mode when lgdt completes?
|
||||
</ul>
|
||||
<li>1071: no, and not even here.
|
||||
<li>1075: far jump, load 8 in CS. from now on we use segment-based translation.
|
||||
<li>1081-1086: set up other segment registers
|
||||
<li>1087: where is the stack which is used for procedure calls?
|
||||
<li>1087: cmain in the bootloader (see lab 1), which calls main0
|
||||
<li>1222: main0.
|
||||
<ul>
|
||||
<li>job of main0 is to set everthing up so that all xv6 convtions works
|
||||
<li>where is the stack? (sp = 0x7bec)
|
||||
<li>what is on it?
|
||||
<pre>
|
||||
00007bec [00007bec] 7cda // return address in cmain
|
||||
00007bf0 [00007bf0] 0080 // callee-saved ebx
|
||||
00007bf4 [00007bf4] 7369 // callee-saved esi
|
||||
00007bf8 [00007bf8] 0000 // callee-saved ebp
|
||||
00007bfc [00007bfc] 7c49 // return address for cmain: spin
|
||||
00007c00 [00007c00] c031fcfa // the instructions from 7c00 (start)
|
||||
</pre>
|
||||
</ul>
|
||||
<li>1239-1240: switch to cpu stack (important for scheduler)
|
||||
<ul>
|
||||
<li>why -32?
|
||||
<li>what values are in ebp and esp?
|
||||
<pre>
|
||||
esp: 0x108d30 1084720
|
||||
ebp: 0x108d5c 1084764
|
||||
</pre>
|
||||
<li>what is on the stack?
|
||||
<pre>
|
||||
00108d30 [00108d30] 0000
|
||||
00108d34 [00108d34] 0000
|
||||
00108d38 [00108d38] 0000
|
||||
00108d3c [00108d3c] 0000
|
||||
00108d40 [00108d40] 0000
|
||||
00108d44 [00108d44] 0000
|
||||
00108d48 [00108d48] 0000
|
||||
00108d4c [00108d4c] 0000
|
||||
00108d50 [00108d50] 0000
|
||||
00108d54 [00108d54] 0000
|
||||
00108d58 [00108d58] 0000
|
||||
00108d5c [00108d5c] 0000
|
||||
00108d60 [00108d60] 0001
|
||||
00108d64 [00108d64] 0001
|
||||
00108d68 [00108d68] 0000
|
||||
00108d6c [00108d6c] 0000
|
||||
</pre>
|
||||
|
||||
<li>what is 1 in 0x108d60? is it on the stack?
|
||||
|
||||
</ul>
|
||||
|
||||
<li>1242: is it save to reference bcpu? where is it allocated?
|
||||
|
||||
<li>1260-1270: set up proc[0]
|
||||
|
||||
<ul>
|
||||
<li>each process has its own stack (see struct proc).
|
||||
|
||||
<li>where is its stack? (see the section below on physical memory
|
||||
management below).
|
||||
|
||||
<li>what is the jmpbuf? (will discuss in detail later)
|
||||
|
||||
<li>1267: why -4?
|
||||
|
||||
</ul>
|
||||
|
||||
<li>1270: necessar to be able to take interrupts (will discuss in
|
||||
detail later)
|
||||
|
||||
<li>1292: what process do you think scheduler() will run? we will
|
||||
study later how that happens, but let's assume it runs process0 on
|
||||
process0's stack.
|
||||
</ul>
|
||||
|
||||
<h3>xv6 user address spaces</h3>
|
||||
|
||||
<ul>
|
||||
<li>1327: process0
|
||||
<ul>
|
||||
<li>process 0 sets up everything to make process conventions work out
|
||||
|
||||
<li>which stack is process0 running? see 1260.
|
||||
|
||||
<li>1334: is the convention to release the proc_table_lock after being
|
||||
scheduled? (we will discuss locks later; assume there are no other
|
||||
processors for now.)
|
||||
|
||||
<li>1336: cwd is current working directory.
|
||||
|
||||
<li>1348: first step in initializing a template tram frame: set
|
||||
everything to zero. we are setting up process 0 as if it just
|
||||
entered the kernel from user space and wants to go back to user
|
||||
space. (see x86.h to see what field have the value 0.)
|
||||
|
||||
<li>1349: why "|3"? instead of 0?
|
||||
|
||||
<li>1351: why set interrupt flag in template trapframe?
|
||||
|
||||
<li>1352: where will the user stack be in proc[0]'s address space?
|
||||
|
||||
<li>1353: makes a copy of proc0. fork() calls copyproc() to implement
|
||||
forking a process. This statement in essense is calling fork inside
|
||||
proc0, making a proc[1] a duplicate of proc[0]. proc[0], however,
|
||||
has not much in its address space of one page (see 1341).
|
||||
<ul>
|
||||
<li>2221: grab a lock on the proc table so that we are the only one
|
||||
updating it.
|
||||
<li>2116: allocate next pid.
|
||||
<li>2228: we got our entry; release the lock. from now we are only
|
||||
modifying our entry.
|
||||
<li>2120-2127: copy proc[0]'s memory. proc[1]'s memory will be identical
|
||||
to proc[0]'s.
|
||||
<li>2130-2136: allocate a kernel stack. this stack is different from
|
||||
the stack that proc[1] uses when running in user mode.
|
||||
<li>2139-2140: copy the template trapframe that xv6 had set up in
|
||||
proc[0].
|
||||
<li>2147: where will proc[1] start running when the scheduler selects
|
||||
it?
|
||||
<li>2151-2155: Unix semantics: child inherits open file descriptors
|
||||
from parent.
|
||||
<li>2158: same for cwd.
|
||||
</ul>
|
||||
|
||||
<li>1356: load a program in proc[1]'s address space. the program
|
||||
loaded is the binary version of init.c (sheet 16).
|
||||
|
||||
<li>1374: where will proc[1] start?
|
||||
|
||||
<li>1377-1388: copy the binary into proc[1]'s address space. (you
|
||||
will learn about the ELF format in the labs.)
|
||||
<ul>
|
||||
<li>can the binary for init be any size for proc[1] to work correctly?
|
||||
|
||||
<li>what is the layout of proc[1]'s address space? is it consistent
|
||||
with the layout described on line 1950-1954?
|
||||
|
||||
</ul>
|
||||
|
||||
<li>1357: make proc[1] runnable so that the scheduler will select it
|
||||
to run. everything is set up now for proc[1] to run, "return" to
|
||||
user space, and execute init.
|
||||
|
||||
<li>1359: proc[0] gives up the processor, which calls sleep, which
|
||||
calls sched, which setjmps back to scheduler. let's peak a bit in
|
||||
scheduler to see what happens next. (we will return to the
|
||||
scheduler in more detail later.)
|
||||
</ul>
|
||||
<li>2219: this test will fail for proc[1]
|
||||
<li>2226: setupsegs(p) sets up the segments for proc[1]. this call is
|
||||
more interesting than the previous, so let's see what happens:
|
||||
<ul>
|
||||
<li>2032-37: this is for traps and interrupts, which we will cover later.
|
||||
<li>2039-49: set up new gdt.
|
||||
<li>2040: why 0x100000 + 64*1024?
|
||||
<li>2045: why 3? why is base p->mem? is p->mem physical or logical?
|
||||
<li>2045-2046: how much the program for proc[1] be compiled if proc[1]
|
||||
will run successfully in user space?
|
||||
<li>2052: we are still running in the kernel, but we are loading gdt.
|
||||
is this ok?
|
||||
<li>why have so few user-level segments? why not separate out code,
|
||||
data, stack, bss, etc.?
|
||||
</ul>
|
||||
<li>2227: record that proc[1] is running on the cpu
|
||||
<li>2228: record it is running instead of just runnable
|
||||
<li>2229: setjmp to fork_ret.
|
||||
<li>2282: which stack is proc[1] running on?
|
||||
<li>2284: when scheduled, first release the proc_table_lock.
|
||||
<li>2287: back into assembly.
|
||||
<li>2782: where is the stack pointer pointing to?
|
||||
<pre>
|
||||
0020dfbc [0020dfbc] 0000
|
||||
0020dfc0 [0020dfc0] 0000
|
||||
0020dfc4 [0020dfc4] 0000
|
||||
0020dfc8 [0020dfc8] 0000
|
||||
0020dfcc [0020dfcc] 0000
|
||||
0020dfd0 [0020dfd0] 0000
|
||||
0020dfd4 [0020dfd4] 0000
|
||||
0020dfd8 [0020dfd8] 0000
|
||||
0020dfdc [0020dfdc] 0023
|
||||
0020dfe0 [0020dfe0] 0023
|
||||
0020dfe4 [0020dfe4] 0000
|
||||
0020dfe8 [0020dfe8] 0000
|
||||
0020dfec [0020dfec] 0000
|
||||
0020dff0 [0020dff0] 001b
|
||||
0020dff4 [0020dff4] 0200
|
||||
0020dff8 [0020dff8] 1000
|
||||
</pre>
|
||||
<li>2783: why jmp instead of call?
|
||||
<li>what will iret put in eip?
|
||||
<li>what is 0x1b? what will iret put in cs?
|
||||
<li>after iret, what will the processor being executing?
|
||||
</ul>
|
||||
|
||||
<h3>Managing physical memory</h3>
|
||||
|
||||
<p>To create an address space we must allocate physical memory, which
|
||||
will be freed when an address space is deleted (e.g., when a user
|
||||
program terminates). xv6 implements a first-fit memory allocater
|
||||
(see kalloc.c).
|
||||
|
||||
<p>It maintains a list of ranges of free memory. The allocator finds
|
||||
the first range that is larger than the amount of requested memory.
|
||||
It splits that range in two: one range of the size requested and one
|
||||
of the remainder. It returns the first range. When memory is
|
||||
freed, kfree will merge ranges that are adjacent in memory.
|
||||
|
||||
<p>Under what scenarios is a first-fit memory allocator undesirable?
|
||||
|
||||
<h3>Growing an address space</h3>
|
||||
|
||||
<p>How can a user process grow its address space? growproc.
|
||||
<ul>
|
||||
<li>2064: allocate a new segment of old size plus n
|
||||
<li>2067: copy the old segment into the new (ouch!)
|
||||
<li>2068: and zero the rest.
|
||||
<li>2071: free the old physical memory
|
||||
</ul>
|
||||
<p>We could do a lot better if segments didn't have to contiguous in
|
||||
physical memory. How could we arrange that? Using page tables, which
|
||||
is our next topic. This is one place where page tables would be
|
||||
useful, but there are others too (e.g., in fork).
|
||||
</body>
|
||||
|
||||
|
210
web/l5.html
Normal file
210
web/l5.html
Normal file
|
@ -0,0 +1,210 @@
|
|||
<title>Lecture 5/title>
|
||||
<html>
|
||||
<head>
|
||||
</head>
|
||||
<body>
|
||||
|
||||
<h2>Address translation and sharing using page tables</h2>
|
||||
|
||||
<p> Reading: <a href="../readings/i386/toc.htm">80386</a> chapters 5 and 6<br>
|
||||
|
||||
<p> Handout: <b> x86 address translation diagram</b> -
|
||||
<a href="x86_translation.ps">PS</a> -
|
||||
<a href="x86_translation.eps">EPS</a> -
|
||||
<a href="x86_translation.fig">xfig</a>
|
||||
<br>
|
||||
|
||||
<p>Why do we care about x86 address translation?
|
||||
<ul>
|
||||
<li>It can simplify s/w structure by placing data at fixed known addresses.
|
||||
<li>It can implement tricks like demand paging and copy-on-write.
|
||||
<li>It can isolate programs to contain bugs.
|
||||
<li>It can isolate programs to increase security.
|
||||
<li>JOS uses paging a lot, and segments more than you might think.
|
||||
</ul>
|
||||
|
||||
<p>Why aren't protected-mode segments enough?
|
||||
<ul>
|
||||
<li>Why did the 386 add translation using page tables as well?
|
||||
<li>Isn't it enough to give each process its own segments?
|
||||
</ul>
|
||||
|
||||
<p>Translation using page tables on x86:
|
||||
<ul>
|
||||
<li>paging hardware maps linear address (la) to physical address (pa)
|
||||
<li>(we will often interchange "linear" and "virtual")
|
||||
<li>page size is 4096 bytes, so there are 1,048,576 pages in 2^32
|
||||
<li>why not just have a big array with each page #'s translation?
|
||||
<ul>
|
||||
<li>table[20-bit linear page #] => 20-bit phys page #
|
||||
</ul>
|
||||
<li>386 uses 2-level mapping structure
|
||||
<li>one page directory page, with 1024 page directory entries (PDEs)
|
||||
<li>up to 1024 page table pages, each with 1024 page table entries (PTEs)
|
||||
<li>so la has 10 bits of directory index, 10 bits table index, 12 bits offset
|
||||
<li>What's in a PDE or PTE?
|
||||
<ul>
|
||||
<li>20-bit phys page number, present, read/write, user/supervisor
|
||||
</ul>
|
||||
<li>cr3 register holds physical address of current page directory
|
||||
<li>puzzle: what do PDE read/write and user/supervisor flags mean?
|
||||
<li>puzzle: can supervisor read/write user pages?
|
||||
|
||||
<li>Here's how the MMU translates an la to a pa:
|
||||
|
||||
<pre>
|
||||
uint
|
||||
translate (uint la, bool user, bool write)
|
||||
{
|
||||
uint pde;
|
||||
pde = read_mem (%CR3 + 4*(la >> 22));
|
||||
access (pde, user, read);
|
||||
pte = read_mem ( (pde & 0xfffff000) + 4*((la >> 12) & 0x3ff));
|
||||
access (pte, user, read);
|
||||
return (pte & 0xfffff000) + (la & 0xfff);
|
||||
}
|
||||
|
||||
// check protection. pxe is a pte or pde.
|
||||
// user is true if CPL==3
|
||||
void
|
||||
access (uint pxe, bool user, bool write)
|
||||
{
|
||||
if (!(pxe & PG_P)
|
||||
=> page fault -- page not present
|
||||
if (!(pxe & PG_U) && user)
|
||||
=> page fault -- not access for user
|
||||
|
||||
if (write && !(pxe & PG_W))
|
||||
if (user)
|
||||
=> page fault -- not writable
|
||||
else if (!(pxe & PG_U))
|
||||
=> page fault -- not writable
|
||||
else if (%CR0 & CR0_WP)
|
||||
=> page fault -- not writable
|
||||
}
|
||||
</pre>
|
||||
|
||||
<li>CPU's TLB caches vpn => ppn mappings
|
||||
<li>if you change a PDE or PTE, you must flush the TLB!
|
||||
<ul>
|
||||
<li>by re-loading cr3
|
||||
</ul>
|
||||
<li>turn on paging by setting CR0_PE bit of %cr0
|
||||
</ul>
|
||||
|
||||
Can we use paging to limit what memory an app can read/write?
|
||||
<ul>
|
||||
<li>user can't modify cr3 (requires privilege)
|
||||
<li>is that enough?
|
||||
<li>could user modify page tables? after all, they are in memory.
|
||||
</ul>
|
||||
|
||||
<p>How we will use paging (and segments) in JOS:
|
||||
<ul>
|
||||
<li>use segments only to switch privilege level into/out of kernel
|
||||
<li>use paging to structure process address space
|
||||
<li>use paging to limit process memory access to its own address space
|
||||
<li>below is the JOS virtual memory map
|
||||
<li>why map both kernel and current process? why not 4GB for each?
|
||||
<li>why is the kernel at the top?
|
||||
<li>why map all of phys mem at the top? i.e. why multiple mappings?
|
||||
<li>why map page table a second time at VPT?
|
||||
<li>why map page table a third time at UVPT?
|
||||
<li>how do we switch mappings for a different process?
|
||||
</ul>
|
||||
|
||||
<pre>
|
||||
4 Gig --------> +------------------------------+
|
||||
| | RW/--
|
||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
||||
: . :
|
||||
: . :
|
||||
: . :
|
||||
|~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~| RW/--
|
||||
| | RW/--
|
||||
| Remapped Physical Memory | RW/--
|
||||
| | RW/--
|
||||
KERNBASE -----> +------------------------------+ 0xf0000000
|
||||
| Cur. Page Table (Kern. RW) | RW/-- PTSIZE
|
||||
VPT,KSTACKTOP--> +------------------------------+ 0xefc00000 --+
|
||||
| Kernel Stack | RW/-- KSTKSIZE |
|
||||
| - - - - - - - - - - - - - - -| PTSIZE
|
||||
| Invalid Memory | --/-- |
|
||||
ULIM ------> +------------------------------+ 0xef800000 --+
|
||||
| Cur. Page Table (User R-) | R-/R- PTSIZE
|
||||
UVPT ----> +------------------------------+ 0xef400000
|
||||
| RO PAGES | R-/R- PTSIZE
|
||||
UPAGES ----> +------------------------------+ 0xef000000
|
||||
| RO ENVS | R-/R- PTSIZE
|
||||
UTOP,UENVS ------> +------------------------------+ 0xeec00000
|
||||
UXSTACKTOP -/ | User Exception Stack | RW/RW PGSIZE
|
||||
+------------------------------+ 0xeebff000
|
||||
| Empty Memory | --/-- PGSIZE
|
||||
USTACKTOP ---> +------------------------------+ 0xeebfe000
|
||||
| Normal User Stack | RW/RW PGSIZE
|
||||
+------------------------------+ 0xeebfd000
|
||||
| |
|
||||
| |
|
||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
||||
. .
|
||||
. .
|
||||
. .
|
||||
|~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~|
|
||||
| Program Data & Heap |
|
||||
UTEXT --------> +------------------------------+ 0x00800000
|
||||
PFTEMP -------> | Empty Memory | PTSIZE
|
||||
| |
|
||||
UTEMP --------> +------------------------------+ 0x00400000
|
||||
| Empty Memory | PTSIZE
|
||||
0 ------------> +------------------------------+
|
||||
</pre>
|
||||
|
||||
<h3>The VPT </h3>
|
||||
|
||||
<p>Remember how the X86 translates virtual addresses into physical ones:
|
||||
|
||||
<p><img src=pagetables.png>
|
||||
|
||||
<p>CR3 points at the page directory. The PDX part of the address
|
||||
indexes into the page directory to give you a page table. The
|
||||
PTX part indexes into the page table to give you a page, and then
|
||||
you add the low bits in.
|
||||
|
||||
<p>But the processor has no concept of page directories, page tables,
|
||||
and pages being anything other than plain memory. So there's nothing
|
||||
that says a particular page in memory can't serve as two or three of
|
||||
these at once. The processor just follows pointers:
|
||||
|
||||
pd = lcr3();
|
||||
pt = *(pd+4*PDX);
|
||||
page = *(pt+4*PTX);
|
||||
|
||||
<p>Diagramatically, it starts at CR3, follows three arrows, and then stops.
|
||||
|
||||
<p>If we put a pointer into the page directory that points back to itself at
|
||||
index Z, as in
|
||||
|
||||
<p><img src=vpt.png>
|
||||
|
||||
<p>then when we try to translate a virtual address with PDX and PTX
|
||||
equal to V, following three arrows leaves us at the page directory.
|
||||
So that virtual page translates to the page holding the page directory.
|
||||
In Jos, V is 0x3BD, so the virtual address of the VPD is
|
||||
(0x3BD<<22)|(0x3BD<<12).
|
||||
|
||||
|
||||
<p>Now, if we try to translate a virtual address with PDX = V but an
|
||||
arbitrary PTX != V, then following three arrows from CR3 ends
|
||||
one level up from usual (instead of two as in the last case),
|
||||
which is to say in the page tables. So the set of virtual pages
|
||||
with PDX=V form a 4MB region whose page contents, as far
|
||||
as the processor is concerned, are the page tables themselves.
|
||||
In Jos, V is 0x3BD so the virtual address of the VPT is (0x3BD<<22).
|
||||
|
||||
<p>So because of the "no-op" arrow we've cleverly inserted into
|
||||
the page directory, we've mapped the pages being used as
|
||||
the page directory and page table (which are normally virtually
|
||||
invisible) into the virtual address space.
|
||||
|
||||
|
||||
</body>
|
70
web/mkhtml
Executable file
70
web/mkhtml
Executable file
|
@ -0,0 +1,70 @@
|
|||
#!/usr/bin/perl
|
||||
|
||||
my @lines = <>;
|
||||
my $text = join('', @lines);
|
||||
my $title;
|
||||
if($text =~ /^\*\* (.*?)\n/m){
|
||||
$title = $1;
|
||||
$text = $` . $';
|
||||
}else{
|
||||
$title = "Untitled";
|
||||
}
|
||||
|
||||
$text =~ s/[ \t]+$//mg;
|
||||
$text =~ s/^$/<br><br>/mg;
|
||||
$text =~ s!\b([a-z0-9]+\.(c|s|pl|h))\b!<a href="src/$1.html">$1</a>!g;
|
||||
$text =~ s!^(Lecture [0-9]+\. .*?)$!<b><i>$1</i></b>!mg;
|
||||
$text =~ s!^\* (.*?)$!<h2>$1</h2>!mg;
|
||||
$text =~ s!((<br>)+\n)+<h2>!\n<h2>!g;
|
||||
$text =~ s!</h2>\n?((<br>)+\n)+!</h2>\n!g;
|
||||
$text =~ s!((<br>)+\n)+<b>!\n<br><br><b>!g;
|
||||
$text =~ s!\b\s*--\s*\b!\–!g;
|
||||
$text =~ s!\[([^\[\]|]+) \| ([^\[\]]+)\]!<a href="$1">$2</a>!g;
|
||||
$text =~ s!\[([^ \t]+)\]!<a href="$1">$1</a>!g;
|
||||
|
||||
$text =~ s!``!\“!g;
|
||||
$text =~ s!''!\”!g;
|
||||
|
||||
print <<EOF;
|
||||
<!-- AUTOMATICALLY GENERATED: EDIT the .txt version, not the .html version -->
|
||||
<html>
|
||||
<head>
|
||||
<title>$title</title>
|
||||
<style type="text/css"><!--
|
||||
body {
|
||||
background-color: white;
|
||||
color: black;
|
||||
font-size: medium;
|
||||
line-height: 1.2em;
|
||||
margin-left: 0.5in;
|
||||
margin-right: 0.5in;
|
||||
margin-top: 0;
|
||||
margin-bottom: 0;
|
||||
}
|
||||
|
||||
h1 {
|
||||
text-indent: 0in;
|
||||
text-align: left;
|
||||
margin-top: 2em;
|
||||
font-weight: bold;
|
||||
font-size: 1.4em;
|
||||
}
|
||||
|
||||
h2 {
|
||||
text-indent: 0in;
|
||||
text-align: left;
|
||||
margin-top: 2em;
|
||||
font-weight: bold;
|
||||
font-size: 1.2em;
|
||||
}
|
||||
--></style>
|
||||
</head>
|
||||
<body bgcolor=#ffffff>
|
||||
<h1>$title</h1>
|
||||
<br><br>
|
||||
EOF
|
||||
print $text;
|
||||
print <<EOF;
|
||||
</body>
|
||||
</html>
|
||||
EOF
|
53
web/x86-intr.html
Normal file
53
web/x86-intr.html
Normal file
|
@ -0,0 +1,53 @@
|
|||
<title>Homework: xv6 and Interrupts and Exceptions</title>
|
||||
<html>
|
||||
<head>
|
||||
</head>
|
||||
<body>
|
||||
|
||||
<h1>Homework: xv6 and Interrupts and Exceptions</h1>
|
||||
|
||||
<p>
|
||||
<b>Read</b>: xv6's trapasm.S, trap.c, syscall.c, vectors.S, and usys.S. Skim
|
||||
lapic.c, ioapic.c, and picirq.c
|
||||
|
||||
<p>
|
||||
<b>Hand-In Procedure</b>
|
||||
<p>
|
||||
You are to turn in this homework during lecture. Please
|
||||
write up your answers to the exercises below and hand them in to a
|
||||
6.828 staff member at the beginning of the lecture.
|
||||
<p>
|
||||
|
||||
<b>Introduction</b>
|
||||
|
||||
<p>Try to understand
|
||||
xv6's trapasm.S, trap.c, syscall.c, vectors.S, and usys.S. Skim
|
||||
You will need to consult:
|
||||
|
||||
<p>Chapter 5 of <a href="../readings/ia32/IA32-3.pdf">IA-32 Intel
|
||||
Architecture Software Developer's Manual, Volume 3: System programming
|
||||
guide</a>; you can skip sections 5.7.1, 5.8.2, and 5.12.2. Be aware
|
||||
that terms such as exceptions, traps, interrupts, faults and aborts
|
||||
have no standard meaning.
|
||||
|
||||
<p>Chapter 9 of the 1987 <a href="../readings/i386/toc.htm">i386
|
||||
Programmer's Reference Manual</a> also covers exception and interrupt
|
||||
handling in IA32 processors.
|
||||
|
||||
<p><b>Assignment</b>:
|
||||
|
||||
In xv6, set a breakpoint at the beginning of <code>syscall()</code> to
|
||||
catch the very first system call. What values are on the stack at
|
||||
this point? Turn in the output of <code>print-stack 35</code> at that
|
||||
breakpoint with each value labeled as to what it is (e.g.,
|
||||
saved <code>%ebp</code> for <code>trap</code>,
|
||||
<code>trapframe.eip</code>, etc.).
|
||||
<p>
|
||||
<b>This completes the homework.</b>
|
||||
|
||||
</body>
|
||||
|
||||
|
||||
|
||||
|
||||
|
18
web/x86-intro.html
Normal file
18
web/x86-intro.html
Normal file
|
@ -0,0 +1,18 @@
|
|||
<title>Homework: Intro to x86 and PC</title>
|
||||
<html>
|
||||
<head>
|
||||
</head>
|
||||
<body>
|
||||
|
||||
<h1>Homework: Intro to x86 and PC</h1>
|
||||
|
||||
<p>Today's lecture is an introduction to the x86 and the PC, the
|
||||
platform for which you will write an operating system. The assigned
|
||||
book is a reference for x86 assembly programming of which you will do
|
||||
some.
|
||||
|
||||
<p><b>Assignment</b> Make sure to do exercise 1 of lab 1 before
|
||||
coming to lecture.
|
||||
|
||||
</body>
|
||||
|
33
web/x86-mmu.html
Normal file
33
web/x86-mmu.html
Normal file
|
@ -0,0 +1,33 @@
|
|||
<title>Homework: x86 MMU</title>
|
||||
<html>
|
||||
<head>
|
||||
</head>
|
||||
<body>
|
||||
|
||||
<h1>Homework: x86 MMU</h1>
|
||||
|
||||
<p>Read chapters 5 and 6 of
|
||||
<a href="../readings/i386/toc.htm">Intel 80386 Reference Manual</a>.
|
||||
These chapters explain
|
||||
the x86 Memory Management Unit (MMU),
|
||||
which we will cover in lecture today and which you need
|
||||
to understand in order to do lab 2.
|
||||
|
||||
<p>
|
||||
<b>Read</b>: bootasm.S and setupsegs() in proc.c
|
||||
|
||||
<p>
|
||||
<b>Hand-In Procedure</b>
|
||||
<p>
|
||||
You are to turn in this homework during lecture. Please
|
||||
write up your answers to the exercises below and hand them in to a
|
||||
6.828 staff member by the beginning of lecture.
|
||||
<p>
|
||||
|
||||
<p><b>Assignment</b>: Try to understand setupsegs() in proc.c.
|
||||
What values are written into <code>gdt[SEG_UCODE]</code>
|
||||
and <code>gdt[SEG_UDATA]</code> for init, the first user-space
|
||||
process?
|
||||
(You can use Bochs to answer this question.)
|
||||
|
||||
</body>
|
BIN
web/x86-mmu1.pdf
Normal file
BIN
web/x86-mmu1.pdf
Normal file
Binary file not shown.
55
web/x86-mmu2.pdf
Normal file
55
web/x86-mmu2.pdf
Normal file
|
@ -0,0 +1,55 @@
|
|||
%PDF-1.4
1 0 obj
<</Type /XObject /Subtype /Image /ImageMask true /Name /Obj1 /Width 2560 /Height 3256 /BlackIs1 false /BitsPerComponent 1 /Length 25249 /Filter /CCITTFaxDecode /DecodeParms << /K -1 /Columns 2560 >> >> stream
ÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿü¦ÑæKçóFf<46>…<dp„|än'‹ŠG2tyŠJÙà\áÅ#åãè<C3A3>‘‘FaË7L<37>lº1ÙÑš"^+<2B>Ža‚à Ž ÐGåž¼ŽJYÀ´°ˆƒw&EÏB‹•QÊ ‹2(ä09<06>Èdã<>QfÔ£jÎæAÜú”.zˆX0æ<1C>ærÇä2Aw&9 ¸']YÇ*£Á‚«”äAÉŽ[“¦ ß xrhEƒÐí±®dÈ'¯ÿÿÿÿÿÿÿ“`×úãûÿÿÿÿþ½ÿ_ÿýúÿºÿ½{ÿÿÿÿÿü³£Zî <09>¨D#:;ý„×ÿI¿×ÿ<7F>ÿõÿçVb.Dƒ!³U~PŠL†ûúÂ3âL<œ\ๆ œÌ @ÿÖ<C3BF>ÚèhqªiýÕ´ÓÕtÓOª¼”Q;¢wª´IòCÚ“|»¢wÿ ž›§„¯AÁ ƒxzzþþíp¿ëéöû¿½cõ½/[ý*"<22>>DQéiu§"4ºÿÆ ñ{ÿÁý×WþðÃ
ýÿä¿T7ûÓ¹z„ÿ÷òÔAþ—ÓšI_5¯ùÖlŽÃÿ÷µÒÕîÝw3–¿ò$°®ÿXj¶µµüZöÅEé[wlußÕ«÷wÚjéÝÚdW_»»M5öS¡hA„Bõ†\ eº `<60>„H¾"""""""#õÿÿÿÿÿÿÿÿÿùk%¿–P¬ìh‰#B<>#-"@ÊH†Áùe0A„AªaMSú}¦½'ëûTä‡É§¨Y'Dâ/(&ÐAÁL¦É¬ *ÿH8T› ôÿOûþŸÚÿþö»ÿ¿¥ù¡‡ðr†—ö”œë~·«}þ;ŠŠ<C5A0>íV¿ðKé|Wî (Kuÿê´ %¨J¿ûxP<78>KÂÿå”´%é_÷ß Iá/o®ü($ «Òùeý”.áþ×J’ô®í~ßé%éjÅKL»³~AsøfÏÿ±¬Hƒˆ»ø¿kí?ûþ_atí0«¦‡àÁ4"""#ñþýüµ•QÑ|´ÌDr/¢8-…#†Hk|²§h)ÿü<C3BF> (r <20>ã<1C>ä`ƒ= ÙZÿ&9>.Ê•¹s.¥ü!wÿ¯ÿ×ÿÿ¯ÿÿÿÿÿÿÿçxþv¡š™ÇúõLèó±ÓûÈÀsKÿ:È»ÿ Ó*oÿ´öÿ;@Ÿ¢8zÿÛ_-Zôÿüǵï~¡´!§ý{òqr9fcéÂfÄ4'ÿ܆<C39C>戎Ž²ñ“åÅ4½8‡ ô¯M?ûîa¡f„P<E2809E>¡¯§x_wM7ÿþªž<C2AA>ú¢qûë».(<28>í…ÿß{Z]/Iäñèžúÿ_z ‡.ØaIÛ’¿í{ôÿú‡Iÿßû
|
||||
¤ú éÿ¯Jÿöÿÿþ«ßî®·ßÿþDù~ÿÿô¾>þº™‡ãÿØqúÿâ×ý×uõûZØÿÒtúëî½ÿýx?ÿ__ÿÿYºä#‚Ûü“×ÿ_z÷_ÿåî…ósÑ:ÿúü½ûµâ½%×s7·_o„ÿÞõ¥š}ýÚp×µØ`¿iXKÿëöm.×kظŠâ½<C3A2>Žý0—¶µ¶+÷ºëjújÿ¼W±QV×û"ö°Âöšý{»o¿†jg4ÖvÓA”8@ÁWþ½„M5UˆˆˆŽ"""?Ô21Âÿâ)ÿÿøKÿáõÿûzÿÖ?ýwÿþ½ÿ}®¿ì˜ÈìÆc(EÄ.ΈŽ…ò8‚§â""""#ÿé<C3BF>4r
c‘!¶ä§.ɹCœrÜã”9Fá]–Uy!Í¢èò%,Én[øˆˆ<CB86>oÿ×_^þÿëëßßÿõõÿÿÿÿúÿ¿ÿúÿ¿“j¾¿¯<C2BF>Úÿÿÿÿ÷ôÿõÿÿÿÿÿýzûÿÿÿÿÿÿýoþ¿ûÿ¯ÿ¥ÿÐ_ýú_ýÿÒÿéÈW„ºúéÒÒïïÿ%ó¡ÿé0@ÿõ§ÿ [;£'3`äÌ1ÿH*'<27>S¿ú®åQõd%ôº×JÖFïþ é~“Ÿ‚4?úƒúé4÷¿ývµÒ¼Æfh<66>ù_ôMøªézgÏ üø <C3B8>‘Ùñ_úúKÕvÇ~ƒC ¯ô„¼%¤òÇIøL*õð’-ñë¤Hœu“Œ›ïþ‚ô«â›Ý úPƒÐ<¼ÿÒð’ÿý7ZN“ÿ×Ð$»‡¥ý
|
||||
ÿÿÐ]$–xY!ܸ({ë…ÿÿ¥Ü/Zt„ã¿ä«õÿ]RT¦ûÒÖö¿ô¿<C3B4>ÛÝþ©ßÚ¯éz ÁT}øÞM^ùkšzéo¾ßÒ¹<C392>Ög}ôºÊ™mé?ß«kÿ¬J"8Úî׶Ž>Âv¬0µý¬Gô:¸ø¶8ïøþÝvÖíµÿü=4ì-‘^ì*ÿ†<C3BF>ØX“â hWÊåqO3HäGˆàn`ÅñDE!¾WA=#µwô—í~È솔V‚A7<…¹\@¸ç<1C>åySYmÆpX¯¢Úë^ëéÿû/Wâ?ÿÿÿÿ×ýÿÿý~¿å{Í/þVÁlÿáÿÿäG.È>Ì97"á8 cƒA‘<>ÿòÛ£ýzþÿoÿÿêÿßþ¿ÿkûÿ¯_ÿÿÿýÿ~¿ßý{ÿ]ÿúÿ÷&âÑüó3‰äGˆà„p‡3‘ÁA|ŽGÞþMÀà 5ÿÿú×Y€ØëÿÜÆÿ¿õûþþ¿ÿÿÿ_ÿÿÿÿÿþ¿ÿ÷ÎÖÿÿþ?ÿïÿÿÿV¿ë¯ýÿÿÿý_ÿÿÿÿÿ¯ÿß×ÿïfq'3ðÍ‘ƒ#™5EÅ#²ñ¦lˌ֎ô/ÿÄDDDDDDDDD}ûÿÿùmÑëý}ÿ¿ÿ×ÿ:õ¼'ÿzÿúúGe"G~™ªÿþBiÿÿêÿÿþ—ÿÿÈo 2ý<32>ÐŒäáƒÑSd]’ñQoúéPv†¼—}Úéýõëû^¿Ó÷þ–—Ÿ ‹ÆcÍÈãs1ðïßÿÿZ^a=ôÈpj]=ÿýªö«§ª{ÿÿýpºÓäÝýºìƒ÷Í¢7…©ÒÌìâ8¿õ¬¼ËÊAô_dó¢ùðÄÆfÁ<Ï|NXðƒÂ¡g>øEþ•GN“Óïtþû ¢oïÐtš}XCCÂh4ÿá‘ t?Õã[?ýÚéÚzº®½ýõAv—ÉÿkfǬ<C387>ïH¼z&ùwû–å=<12>HÇh<C387>Ñc•
úéBUa‡î<Ì:WÅÄzPŸ÷ÐAè7¤ô/<Ÿ¿ô ¨<C2A0>îdþ½_úŽ½vŸqÿWú§Ãî¿Š A>ˆ±ÿD‡)ÿì<C3BF>ÅëvêÒB¿úéw‚†?ëé¾ù8)Ê„/isH.»Õƒòœ&…×Oȱ`Å‘)ü"^!’ûõêÐ<C3AA>_ù:D6#!Oúƒó ?ÿñ°Ø?õ„Rúÿv}úÿi¢U—_o
ÿ§þû7ýU91ÿøVÒ´½l+Ø,ýy~‰gïþ‰Å
ÿ¥Â[×Ä68ø‡±H… ÿ ÷šp¾]>ßËQ$ôŠq;izîöÖÝ+ÁWìg›î~¿ëÌã‘Ûÿ^ºÿäAîäAÕ;
|
||||
a¯ýë~[mm;Õ>ÿºé:_øaA— „Ñ#) ,0CþÓkØVÂûÃ
|
||||
ÃäQí…†ýH3ÛKüDGýxŠˆzVÆÇLTSýþ—ùØ,¿Ý4í?»û´þúÒöûÿw
Pi‘Âü;[4×ô·ô¿á‚[ ¡ „""!¡(Mqékék«ˆ¨ˆˆ<CB86>Õþûª÷î»ú^ØPÖ’þ¾ô½Šú„µúúªþ—ú^ÿéo¤–á¬~é}õˆÿTµp£ÿëá¥ÙØLŸ%‘|ŽŠÒ%ò:(ÿ¥ìWý/ú[°²lÿXŽRÁl%·G½z×Ôõÿ ã<>ŽBŽh%¡
ŽD¼%ÅŽ]gò¦"Á¸<C381>g¥…ýþfêÿú÷¯ÿÿÿÿÿÿÿÿÿô¿ÿ¿ÿûÿÿÿ÷ëóµ@§a‡'¾¿ú ¾ü4ÿÿ÷yTÿþEv×ýïùsË›þ¿ûxÿÿûdb \3L£8Éïïÿ¹ñJŽÌǹ†pC‚”
|
||||
ˆœ!§ÿ¯ü Âõ0ƒÂýúÔ'öúë×ÿÿTLz#‰ßøHœ98°‘<r8¯þÿÂÐ< ôßu
|
||||
ƒ¤ïO ºÿÖüpžŸé'§÷§þÿý
|
||||
u¥ûý¤:WÚ_ÿ^‰/"?®²á?ÿýð@»Çê±ÿÿõÿ姺ÿÿþëÿ¿ôE<C3B4>ÿôÿ÷þÿþ_þZº»ÍTÎïÞ¿ÿúU~Oé/fu®ëÇúÿ×OÛúúiuÿ¯Ýµml.}ƒ{
wÿÿñ
Ž*)xêW_ý}ªöV×ÿÿÜ2+Æ«pÐi Õ_ý Ï4t`Î’´" Ê&¿øˆâ"5ÿÿ2&é÷þÿÿ±™ëÿ¯÷×íü!úÿãúïÿ¿ÿõýÿ·ÿýÿ×ëÿëòÜq‘Ù¢4#§6Î#‡ÿÿ;`Gïÿÿ_ÿÿÿ‘ŽB9Ø)ÈŽÍ…Žq˃]T|+œ¤¿ïÿ‚µßõý>¿×_ÿïÿÿÿÿþ»ëÿÿÿ¿ÿÿ]ßÿÿ¯ÿÿÿïþÿÿýÿÿÿÿÿÿÿÿÿÿïÿïõ×ýïÿÿÿÿÿÿûÿÿõÿõÿõûÿûÿÿõÿÿï¯ÿÿÿÿÿÿýßúûúÿÿÿûÿÿõÿÿÿK¯ÿïÿÿÿÿÿÕþ¾ïºÿõZßïïßÿßÿÿë¯ý}ÿÿíÿû×ÿÿ¿×ÿ×ï¯ÿÿýÿý/ÿýÿëÿïÿÿ׿¯ë}ÿ®ïîûÿÿ_¾¿×ÿ×íÿïÿýý×þwè¨GâŸ5æÑDá ¹˜uÿÿQõ{ÿ^믻ÿßù›¨ÿÿOú¿ÿýúÿòUÿõO¯ÿ^¿ïÿÒÿKÿ„u¿ÿ¯õûúOÿõÈÆG3´:5",×ÿþ¡0¡8atUÿ÷¾¼„n•ÿþɃ«ïõUÿÿ"ÍH0µD…:‰ýÿþÛ;Cö¤ˆÀ¿ÿÙj÷ÿ³ªÖB«Õø?Ï dxœòqåÉÌÇ”ñq„xŽ!™é0ÿõôÈ7ÿ г<C390>pœ^ƒâøýÈ`gÿû—¦·§xOººÕ<C2BA>#_ÿø½%×]
"#ý)èÿ¿<C3BF>YyEöN<¼rxôO2 åù<ÓK¦ÿÿÀ• ƒ~ø~›ˆÂü$ÈfGûÿ=<>__Oí‘Á¹:$òìº#äpP^.
ˆâó`ŽGÌömŽpr8`Ž‹Š\9Ñüïˆû0EÝ/¥š,õÿ먈<C2A8>ˆˆˆˆˆˆˆˆˆˆˆˆˆ<CB86>ëâ2&<26>E>3æ<>kâÑáWúÒÿø.NÁ暈Ný—_ööB9Ç;‘ ‹†˜ä܇_ß@¶…ôE{ýUÿÔÕ„"",‚‚…TsRäýׂù>ü&¸/®NÎ8KÿªÒ""#õôBoË ½zûö+ A}ú‚»¥Õ³ïûöÏ¿ÿéÃ"ÓßÜÕ§wk믯׾’!Ù¿ñø¸á„¶!ÅqLƒØ}a$DeÿöÜ1ÿµowî”kÿ Ó<0B>ƒë
§õÿü;A–=§S Úª ÓMi¿ñÂ^«Ô¬Êª—ïû]Ò%Ù#¯¯ñFKÐ@šßßU0Ÿÿ¦¾“»úîÁÔ<>Ñwý~#°©öÿúâ–“ÿû^þ¿ñõÿÿk¯ÿã÷÷ù?8çr¦zŒA¨LåY‹‚«8ep…aU•¿ª¿×ለˆˆˆˆˆŽûÿ»¤ÿßõ i}?\ ·ÿû kýj—ÿÿ…ÿÿÒm+ÿÿ-ýbÅÿßÿþWëü<C3AB>ŒÿˆÕ_ÿÿÿ~¿ÿÝWÿ¿û×ûÿë]s_1òèà†Ì<E280A0>žFãhº0
|
||||
¤pÈ
Ÿþ"""""""¿ÿÿyÃkÎå¹Ü«?f-2
|
||||
Ng+<2B>©¿ÿYmÐÿßÓ¯´»ýþéÿÿÿýzÿÿÿÿÿ÷ü˜(ÎÉA<C389>ÿðÂeWR¢þ¿ÃÓÓÿÿµ×<C2B5>§ŸþBß.l?,]hÿƒéÇ×d)?ýƒöì‰"{†tEÁÿù/·7‹³Ã(&pCìŽ*å<G&\.ÁÿöAzÕiÆOAÅ®…÷ìývÿÓ®×oû! ?þA¹ýô¨<C3B4>ì0¤áȯ]’ˆ`¤q^ûìï
|
||||
_8@ú á]H1<48>6°›<E280BA>Êïû<C3AF>Íü%×á:ûõÕ—DDCŒxÀ!r@ì®/‘¸Á¯á‘ó†Aœ~µ]0´¯þ·õˆˆˆˆâ"?ÄD{ÿxÞûõþÿÿÒøÿûJZ÷È
|
||||
ö@ðbêïÿÿÿÖ”½— ¢² Ç ªä4CdG"¹–9C™Éж…¯-Â<02>^ÿýýúD44’Ÿ<E28099>ÎBuùvyÌêëŠßí¿‚®<E2809A>__‰…¶¿ÿ±ZôAÆuëNÿÿ_ýí<ô¥Ûa[^[R-’¿¢#NŸ"+QüTT6?‹ø{\/ V’[¾ÖÖ¾¿3O_œKMì Âv¾Ÿw¬~;TL!Áœtâ,/þ"""4"#]ýWëªõð<C3B5>_ú¥÷Ú¯øƒ¿CÿZödk›Ë¢è ‹ÇÌþ_(GLƒdíÿÿÖ\9´)r<>ñÿ_ÿÿûÿÿ×õÿÿÿ÷÷×ûÿÿZ×ßÿýn[§;ÿ"$ïØ:úÈ`Gü†zª½ƒßðöC3;òß‚dO#HꉄQ˜¾\L8‹²æG<C3A6>‘ÁGÊã@xlˆˆˆˆˆˆˆˆû÷Ö
|
||||
¿-H<><48>l‚,ʃ9Xã“9Ms_©
<0A>ÎWüƒ
?Ð!ÿ![:o _äÛ%ÿÿÿ–•f]pšúiÿ¿ù©¢wEÃ_>ÿïéÿ¥T¿ð‘e<E28098>aN€¹ÚØ!ÀØ`܆ÿ
|
||||
C
Pzk÷ý"n·<08>3\œ5%±dâšëÜ–ˆûþ’„L/z¦<7A>¯XL ¾N/ô•SåÍËü‰<C3BC>a5Ysf‡-e«ý%»ÝðnZÅ=wªD‡‘Ò÷þ<C3B7>*'
º«ï…“º'<27>w}0¯ó6uf"„GDƒþ½§_ü<5F>ëOOÿÕ ž¼ `†|@ƒþ‚ý'_ïW¯×ÿtìÂjš
ï¤/×÷|[]u×äÇꩪ×H’`û}¢cß}JK#·úÑ1þ¯Kmk'nN2q—<71>úDa<44>äÿik¯ÍDoâµëÈ.`á8aaiè: ÷þ‘¨(D$Áþ4— |CkúIâžÈ`ÂB1P´<50>[ØOãTÁ?Ö´?Úë]kÔ%B¿\þkð’õû]á$ºéÖ‰Â|Š=ƒý"xÑ8µúIuô„’x\´¥!Qø¸¤<7F> ÛþödþZGúJœµAK¯]Òúð‚ˆ5÷™Î?ð‚ˆ=p‚äó}éWëú ?K„.¨/oAi/CÏôšüw¤úŽ/|'ü¤]šPßéŽÂKµ ׯ«Ò@ÂýkÕµmo^ <>ýª<C3BD><C2AA>ÂXö¹þÌ'^Âv%+õ”·¶ÿa)1H†ú ƒ¸¶8Ø¥º&öÂDLÁŽ*Ar7ï~¿»´×Ò{H.ÿö¿ßÙÓ†šêª@ˆ%Ó^ÓpanÓTÕ'VP‡è*p‚<70><†½X!£
Å
|
||||
Þ"jb$""hDDDDD‡Ðý,—N‚–ÝM~¸OA{
~“ë_ý%ºKÝz]…ý¯¥lR ÀÁ˜Ü¯t·ü^©a¯_Öh#/ýb#ëÆ·ýÒÿ½ÿëú2áIaHaÿù
|
||||
h’–` ÁzÿØAÃ'wïÿ§„Ò»ïÿZræÂk.ysrç–{ï“žƒÝaðû{~ÿ¤½Ñ<ýíï÷ÿ´›út¿ÿw××_ÿµ¿ûÿºÇZë^ÿ«ÿõÓZþúZÿêÖÔRõù:×ÿuq@¾]ãŠÿ×ZKý¿ÿþúûýÿ}·Uòs¦ÿìGƒeˤÔ|<7C>t©¹gÿØ<C3BF>ÈnŸÚýÿ÷®æÛëýÿ¯jý}ýuþ[’Í¥½¯õ¥ÿþ×í+ðº÷ÿØ®Ûü%ÿý¥â¾*+éÿm{O×ëÿ
>
[Ó ÿýˆâT!ÕþØ¥ÿý~Úÿúÿ½Þ¿â?ÿÿÿÿ_ßúÊæH<C3A6>‚þ"?ÿþÈ1È– ’n;˜ÁÚ¦<DDDDGÿÿþýÿÿ¿ÿ×ÿÿÿÿÿÿÿÿËAfhf@ßËqdGF´vñÜó!’fÃ.û•ËhÂei¦šal„kõÉHçùTgxé+ïø"‡"ðh?Ú¯])}ÿèa0ž×µþ–ÕMqUUÐ_ù_
IÞNÿäfSðÍ2®MWÌ38 ÌË.EfÌãŽ)˜ìð¦™qBjƒÓn\<5C>°ƒ8)N.n3"ñ¸¸ÉÅ4û
0DDä Ž—ña8†=cC4?éî¯ü&ƒuBаš„^ƒN´¿I;_µ»ÿ0úZïÓ øJío¥
|
||||
‹¥÷.z¢ODß"KâÓÃ
|
||||
MÜ“×D£"<22>Dá¢QŽ28†%<¥Çdp<64>´‰ÞJòûòxáÐ<ÿ˜K%<25>è„ì ôx Ü&øAºµî)Òé鵿ßë…kòºX£UõõÿÓ¥¯ÿ¥<C3BF>Wßém=?Ö¸¯×ð»÷ˆÿ¿íÓýwÒ÷ÿÿ¦<C3BF>ÿö•"¨ýzÿ«ÿçP©g‘<>—?ñÿþpm~%ÿjõ<C3B5>ëÿÕ~{ÿÿÿúýWÂ_ᆿûÿÿî>›ÝµÿÿÿDwþû Oø7#<23>·ýÿ¬Ÿm¯þ¹ kÿþø/cù\—Ò_ t þ¿—£ÛÿúñÛ÷ùj?ÿ½ÿUÓÒ¿ÍÁé~Õo4ÿ_ý%¿Š§3+K_×_ý¬-~ÿÿëO[Köײ
÷¯všjÚ¶¾Ø]w9_é+K»õöÕöÒ†µû†“÷³bwÅGñÄ>“û™¦>ØâÇÜllWÛ_ÜSVõþï¿ø¨ƒKº~ßVŸý¯Þ½…M?»"ëøu¯iú~š×»28^ðšÂft ÉÚ¦<C39A>–è&[_a;0êœ0ƒ/“Xg§´¢|@Á4Ðh!H~$„^DDDDGþ"vu׆<C397>(%~5_µEÕQ6-ûT-
ãÇ÷¯ÿõ¿úÿýÿ׿ÿ×ÿÿÿå¬c!£xù£8Á›Í„#Šr#äp¤p Žƒ˜ŽÆyÆÆa£‚™„3À…ÁAq˜dá¶a—€^Z<0C>!ä`Ëäxø0G\ˆàLŽ[â9˜Ês0þ]‘ÁPŽàsÈá’{8<10>Ã9ˆàw™Á¸ë
|
||||
@˜ðMÉY †häpERX$ä$9ÛÐÌë5 f®C=‘ŽDó¹C˜çY2<59>dÄB§ Yì<>˨ªÌ4Rs€ƒQ<1F>ê2ÖS\µ¨–h¾¤§ˆòÐË)J2rÅq)£ òÕ¼ì£"H§gdÁ¹Þ¤EˆíÌ‹ ¦8 giY¶28†®¢*†šå#ÔœzÃ&îi<>‹‘2º¯ø^—I»¿áÂïÿ××í¿…Tÿÿú^וî#_ÓTíÿi}))Ó‘ý·ûV{
|
||||
üìl^Ú×Kx{“³T}‘4úyåq[#ˆTDŠùÆj3 ΄Ñ;1;8d<38>sAƒ4œŒK¥æ|>'ž0S1
|
||||
Û‡>´"FÙÃ4g£q=OÂz5™Ã02@…ÌŽVP.|]ü=`˜Aà<41>¨Aá?Â_áè40ƒ ãù¬ŒÃ
|
||||
&8ú}bL 㓆ðƒ4õN-=:Óˆi"cú_T®šnWW„qÖCå<43>ѼŽˆùx_Œ'_Mßi¦«ë¦©î<C2A9>§ß§õÅ:OízÙ]úŒLöëCþ¿Ýd!iW§¿ªëwëú†~½FÜ”<04>ªªr7ÿN$gÄ Q~¾ê)cïz®ëö‰FJ?#<23>'<27>’»Z"»ô»¯…#Ž²XÑ2;ÂŽ‚–EµB@Œþ¶D×Y"Ã’Í%r,dw…%œ=?O#Î×Oð<4F>äyÿô<C3BF>*º@Š<>µi‰G
|
||||
ù?ˆïZý<'®E/t¤ ^Á«k‚®´Ç¥õzzK?—;¿ýEkQÐcÂ#®$kúA?ÁSw¥ôë<C3B4>KÓ×UÁŒ„IÄ/ÈTDëõ¬!Ý¿ÅÑ ?Ql‡#}G]ÿ^Ì×þBŽÁj¯¯âäÁ=ƒÏƒÿ×ÿÉ~Òú¿Ž]Dè·¯ê0ë„¿â<C2BF>ÿß_
ð¿ýþ„ÈÝáÿáü7™-«ã_ÿƒ~‚úì.‡ýøl7ëᇯý÷¾ÞïüŽ²íà<C3AD>Ë"š‘Õÿÿ×ï׿ÿÑû‡Ñ?×ú×ÝÖ‚ikY:™.‡uÿ¾ußÂþAŽÿÿÈÁÃüÒ<>Ä࿾ºÏ=‡ñò4˜_ä˜ÿ:úßü<C39F>[Á®ü‘ð·Nûú5dtaŸ-æÑé_ñý_r<5F>ӽϙrz pA #‰:Û×ÕGKÿA®½…µìù¡m ßô!ÿû-Û†h[}WÀ·«,,?¡ydšý¿^Ïú×H äãì0]tD³<44>l+Ø-Ö¿a~öŠûÂVãàØøãccìÓø‘eö
b¹-Çüqð…¸ß<C2B8><C39F><EFBFBD>øý<C3B8><C3BD>Ú¯÷¦®½ëÒl„c±ÿìoþþ×ÖþÓµýw{WþÃKß}¦±²ŸNY–>þÓ궿·®ï»ZþínÖþÿþÖáÃA¤Šð±ëD[Õþ¡ý†öï°š'vD¨kdX¾Â®ýØNàÁPdò×0Zb1HzdAá‘á Â÷Þ¨4^œ®¸Ð0@Â:2:jŽ¨4!àÐ0MavšÁ hÕšˆˆ¨<CB86>BŽàÂÃÐh"’‰øh<C3B8>‚ÎÅOQˆŽ"""""6¨5©fŠ?V5 Õ°•x§Ý/tÂJï@Âì ÔíKF?ÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿü²™¨ÿÿ“s7…ùcÖï`ý¸fi†ÿÿäñRUÓ<55>Gÿÿÿÿÿÿÿÿå‘T
bàælØ˃rñt\ËŠb9gÐËäto#£qDmPˆˆˆˆˆˆˆˆˆˆˆ‰<0C>w!´9‡ ¸äÇ,rrÉÁxQƒ6„¢"""""C$‡,r-‘0Ðh9Ëk ÄDDD†I9
ƒ“‹²1Íg™NSHDµë*„ô,!ö<>ÞG(äDDDDHc<>m‚nE¡‹ŸŽå4â"""%¹ò`!€¤tl €ªG½ˆ•ÍQÄ^)ÑÌò.ŒÙ‚0ÐMˆˆˆˆ‰]TÌG™!À¯ˆˆ™z(ÍÅÈàAs#<23>áµÄDDDì¡÷×ÿÿÿÿÿÿÿõÿÿÿÿÿÿÿå‘X`ì€Á;;3?pÓ³½Y#1˜ÎÇv„p9,KáÃÊÃÊMM¡$äße9„ÿ}.]êÊ‚¿Óÿ¿ôºM÷ûþK²\Ç’:_ÒD ÿ¿»™‘6ðç#ùă¡Ò]( üŽá¯wë„ Ôø<C394>i„Á3š™ñ™™9ãÍF\‚G3<1C>éSžFŒ éÎÈ/¿øL! ÷¸Â
ac׈|Eé.¯‚
6.Ùá ÈŽfcÉÅË–hß¿ªj½iÝÿ»ä(ÿE<C3BF>ˆ_áá…„8¼;ïý'þ«ßÔ‚ùýz´´H|?P¾œ=ÿ’·$ù(|<7C>ùì0IÓkv^’×ÆÝ4½^Óõ~H:O½Hùh'ä³ü–.úÕ×°¤‡êˆâ·È£ïOõB\%~G¯KÓ]u×ïuUûõŽE“Ìžx'‘Ç’½þ<C2BD>-ãê<u÷^›÷õÎ iéiô#î¾×O×û ŸëÓÿÓÕÔ<C395>ZZÒÒÕ6Gøó@«Qj%ÂS}ÿy¤Aÿ¿ÿ÷ÿé×þƒäQôÚ÷Awó0н}A%0šÕyf‹ý<E280B9>ÿ<EFBFBD>‹v›§ëä9úÓKÿÿK¿Ç¯’
›-×þ½ü5¶¼|Š9ÝIjGõëÿú¶Á…í³eý¥µõ¨þ¿`¸Oò^ Q,þÉUÿqÇñ±Ü ƒ—¿Ãíºµn~üý¯GïÿÕuU
¨þùжãáÚ~6Á|-¯a?®î×ÛMmzõ}7]Óñ\|oWdQáÚ®ž<C2AE>¯íÿz«v“_íèa‚`ŸiÃÿj«½¿Ø[aÚÞ¶¿ˆŽ""""4
œOÛUÁPh0žE{&ðÕ2(úØIw,£’â"#¸ˆŽ""#ƒ98‡fƒ!üRIyÚƒÄDDGû…VÒ°«×ÂXXëáªÛV’ðˆè²Ì궵ñ‹ÓT|DD[2_ÿþÿºþ¿ÿÿÿÿÿ+‰‘øA§ôòήþ³Ò×DÊ2#Ñx¦šØ e Ê€r¬É€é”‚<E2809D>¾F¯Óª®©„íÕ2¬TÂÿpJ]°¡S½?M©þ’4zN,»¦oYb”HvªŸÅÿyw¥Û¥„<11>
|
||||
‰;ø%ƒ×ôôô“_¼maýÐ:2s_ê‡Z×ÿÿŠ[]Ú®ˆ†Jù0ëù\¡Þûì$”…VX„šõ·ê»«'~¾þW¿KãÖí-WÿßáÛ§É=›)aµúÿú§Dá¥A±oÁ°—ßØa/KÃT§â ½h<17>ŽéëôÈ‹/þ×ý•‹]|—{Ý »¯†ª‘×ÞÚ[k©H>’ßø<C39F>Ûö X*Ú¶a0Èâ×i}/â˜éŠ<C3A9>ŠSþÇöµÃÖÓ[Q{üDZ ÂÚ…L/a|DZé˜uñÿÿÿÿÿ+<2B><0B>á¨\!¶_0þDDHƒ<03>îDr+™ÈÙ¦_ Ò91ÉŽCn‡eøˆ‰<0C>aÊ‚ràŽ
|
||||
óauœ‚†¹GË\‚(2
|
||||
ÞA©ÄDDDDAÿ¹s$ƒA
<0A>£øˆ<C3B8>•3²†TFÁ´<C381><0C>Wþî,Ãa VÈ”Fâ5&@Ì<>ŸÿÝÿ¦Ÿÿ‚)ëÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿúÿÿÿÿÿÿÿ¿ÿÿÿþ¿ýÿÿïÿÿÿÿÿÿ×ÿÿþ¿ÿÿÿׯÿÿ»ÿýÿÚ_ùy8ÿÿÿø z!Ï'ŒÞ|ÿÛÿúzýÿý}ùnÿþÿõÿô}òýÿÍ?ÿÿýÿÿÝÿÿÿï_¿ûÿÿÿ_×ÿÿë¯ÿÿÿýãâ¾» ªßׯÿ<C2AF>ÿëÿ]ÿÿÿæ“O{ÿ÷ÿ¿õÿ4ë×ÿ×ÿÒÿÿÿÿýÿoßÿß¿X`¿¯ÿ¬œ/õïÿÿý=ð×ÿþ¯½÷V¸K¿Ö²¾¬Âë–?Ô\hqÿ•ÐM |