155 lines
6.9 KiB
ReStructuredText
155 lines
6.9 KiB
ReStructuredText
.. SPDX-License-Identifier: GPL-2.0
|
|
|
|
===========
|
|
Page Tables
|
|
===========
|
|
|
|
Paged virtual memory was invented along with virtual memory as a concept in
|
|
1962 on the Ferranti Atlas Computer which was the first computer with paged
|
|
virtual memory. The feature migrated to newer computers and became a de facto
|
|
feature of all Unix-like systems as time went by. In 1985 the feature was
|
|
included in the Intel 80386, which was the CPU Linux 1.0 was developed on.
|
|
|
|
Page tables map virtual addresses as seen by the CPU into physical addresses
|
|
as seen on the external memory bus.
|
|
|
|
Linux defines page tables as a hierarchy which is currently five levels in
|
|
height. The architecture code for each supported architecture will then
|
|
map this to the restrictions of the hardware.
|
|
|
|
The physical address corresponding to the virtual address is often referenced
|
|
by the underlying physical page frame. The **page frame number** or **pfn**
|
|
is the physical address of the page (as seen on the external memory bus)
|
|
divided by `PAGE_SIZE`.
|
|
|
|
Physical memory address 0 will be *pfn 0* and the highest pfn will be
|
|
the last page of physical memory the external address bus of the CPU can
|
|
address.
|
|
|
|
With a page granularity of 4KB and a address range of 32 bits, pfn 0 is at
|
|
address 0x00000000, pfn 1 is at address 0x00001000, pfn 2 is at 0x00002000
|
|
and so on until we reach pfn 0xfffff at 0xfffff000. With 16KB pages pfs are
|
|
at 0x00004000, 0x00008000 ... 0xffffc000 and pfn goes from 0 to 0x3fffff.
|
|
|
|
As you can see, with 4KB pages the page base address uses bits 12-31 of the
|
|
address, and this is why `PAGE_SHIFT` in this case is defined as 12 and
|
|
`PAGE_SIZE` is usually defined in terms of the page shift as `(1 << PAGE_SHIFT)`
|
|
|
|
Over time a deeper hierarchy has been developed in response to increasing memory
|
|
sizes. When Linux was created, 4KB pages and a single page table called
|
|
`swapper_pg_dir` with 1024 entries was used, covering 4MB which coincided with
|
|
the fact that Torvald's first computer had 4MB of physical memory. Entries in
|
|
this single table were referred to as *PTE*:s - page table entries.
|
|
|
|
The software page table hierarchy reflects the fact that page table hardware has
|
|
become hierarchical and that in turn is done to save page table memory and
|
|
speed up mapping.
|
|
|
|
One could of course imagine a single, linear page table with enormous amounts
|
|
of entries, breaking down the whole memory into single pages. Such a page table
|
|
would be very sparse, because large portions of the virtual memory usually
|
|
remains unused. By using hierarchical page tables large holes in the virtual
|
|
address space does not waste valuable page table memory, because it will suffice
|
|
to mark large areas as unmapped at a higher level in the page table hierarchy.
|
|
|
|
Additionally, on modern CPUs, a higher level page table entry can point directly
|
|
to a physical memory range, which allows mapping a contiguous range of several
|
|
megabytes or even gigabytes in a single high-level page table entry, taking
|
|
shortcuts in mapping virtual memory to physical memory: there is no need to
|
|
traverse deeper in the hierarchy when you find a large mapped range like this.
|
|
|
|
The page table hierarchy has now developed into this::
|
|
|
|
+-----+
|
|
| PGD |
|
|
+-----+
|
|
|
|
|
| +-----+
|
|
+-->| P4D |
|
|
+-----+
|
|
|
|
|
| +-----+
|
|
+-->| PUD |
|
|
+-----+
|
|
|
|
|
| +-----+
|
|
+-->| PMD |
|
|
+-----+
|
|
|
|
|
| +-----+
|
|
+-->| PTE |
|
|
+-----+
|
|
|
|
|
|
Symbols on the different levels of the page table hierarchy have the following
|
|
meaning beginning from the bottom:
|
|
|
|
- **pte**, `pte_t`, `pteval_t` = **Page Table Entry** - mentioned earlier.
|
|
The *pte* is an array of `PTRS_PER_PTE` elements of the `pteval_t` type, each
|
|
mapping a single page of virtual memory to a single page of physical memory.
|
|
The architecture defines the size and contents of `pteval_t`.
|
|
|
|
A typical example is that the `pteval_t` is a 32- or 64-bit value with the
|
|
upper bits being a **pfn** (page frame number), and the lower bits being some
|
|
architecture-specific bits such as memory protection.
|
|
|
|
The **entry** part of the name is a bit confusing because while in Linux 1.0
|
|
this did refer to a single page table entry in the single top level page
|
|
table, it was retrofitted to be an array of mapping elements when two-level
|
|
page tables were first introduced, so the *pte* is the lowermost page
|
|
*table*, not a page table *entry*.
|
|
|
|
- **pmd**, `pmd_t`, `pmdval_t` = **Page Middle Directory**, the hierarchy right
|
|
above the *pte*, with `PTRS_PER_PMD` references to the *pte*:s.
|
|
|
|
- **pud**, `pud_t`, `pudval_t` = **Page Upper Directory** was introduced after
|
|
the other levels to handle 4-level page tables. It is potentially unused,
|
|
or *folded* as we will discuss later.
|
|
|
|
- **p4d**, `p4d_t`, `p4dval_t` = **Page Level 4 Directory** was introduced to
|
|
handle 5-level page tables after the *pud* was introduced. Now it was clear
|
|
that we needed to replace *pgd*, *pmd*, *pud* etc with a figure indicating the
|
|
directory level and that we cannot go on with ad hoc names any more. This
|
|
is only used on systems which actually have 5 levels of page tables, otherwise
|
|
it is folded.
|
|
|
|
- **pgd**, `pgd_t`, `pgdval_t` = **Page Global Directory** - the Linux kernel
|
|
main page table handling the PGD for the kernel memory is still found in
|
|
`swapper_pg_dir`, but each userspace process in the system also has its own
|
|
memory context and thus its own *pgd*, found in `struct mm_struct` which
|
|
in turn is referenced to in each `struct task_struct`. So tasks have memory
|
|
context in the form of a `struct mm_struct` and this in turn has a
|
|
`struct pgt_t *pgd` pointer to the corresponding page global directory.
|
|
|
|
To repeat: each level in the page table hierarchy is a *array of pointers*, so
|
|
the **pgd** contains `PTRS_PER_PGD` pointers to the next level below, **p4d**
|
|
contains `PTRS_PER_P4D` pointers to **pud** items and so on. The number of
|
|
pointers on each level is architecture-defined.::
|
|
|
|
PMD
|
|
--> +-----+ PTE
|
|
| ptr |-------> +-----+
|
|
| ptr |- | ptr |-------> PAGE
|
|
| ptr | \ | ptr |
|
|
| ptr | \ ...
|
|
| ... | \
|
|
| ptr | \ PTE
|
|
+-----+ +----> +-----+
|
|
| ptr |-------> PAGE
|
|
| ptr |
|
|
...
|
|
|
|
|
|
Page Table Folding
|
|
==================
|
|
|
|
If the architecture does not use all the page table levels, they can be *folded*
|
|
which means skipped, and all operations performed on page tables will be
|
|
compile-time augmented to just skip a level when accessing the next lower
|
|
level.
|
|
|
|
Page table handling code that wishes to be architecture-neutral, such as the
|
|
virtual memory manager, will need to be written so that it traverses all of the
|
|
currently five levels. This style should also be preferred for
|
|
architecture-specific code, so as to be robust to future changes.
|