Monday, January 18, 2010

Basics of the Mach-O file format

This post is part of a series on generating basic x86 Mach-O files with Ruby. The first post introduced CStruct, a Ruby class used to serialize simple struct-like objects.

Please note that the best way to learn about Mach-O properly is to read Apple's documentation on Mach-O, which is pretty good combined with the comments in /usr/include/mach-o/*.h. These posts will only cover the basics necessary to generate a simple object file for linking with ld or gcc, and are not meant to be comprehensive.

Mach-O File Format Overview

A Mach-O file consists of 2 main pieces: the header and the data. The header is basically a map of the file describing what it contains and the position of everything contained in it. The data comes directly after the header and consists of a number of binary blobs of data, one after the other.

The header contains 3 types of records: the Mach header, segments, and sections. Each binary blob is described by a named section in the header. Sections are grouped into one or more named segments. The Mach header is just one part of the header and should not be confused with the entire header. It contains information about the file as a whole, and specifies the number of segments as well.

Take a quick look at Figure 1 in Apple's Mach-O overview, which illustrates this quite nicely.

A very basic Mach object file consists of a header followed by single blob of machine code. That blob could be described by a single section named __text, inside a single nameless segment. Here's a diagram showing the layout of such a file:


            ,---------------------------,
  Header    |  Mach header              |
            |    Segment 1              |
            |      Section 1 (__text)   | --,
            |---------------------------|   | 
  Data      |           blob            | <-'
            '---------------------------'      

The Mach Header

The Mach header contains the architecture (cpu type), the type of file (object in our case), and the number of segments. There is more to it but that's about all we care about. To see exactly what's in a Mach header fire up a shell and type otool -h /bin/zsh (on a Mac).

Using CStruct we define the Mach header like so:

Segments

Segments, or segment commands, specify where in memory the segment should be loaded by the OS, and the number of bytes to allocate for that segment. They also specify which bytes inside the file are part of that segment, and how many sections it contains.

One benefit to generating an object file rather than an executable is that we let the linker worry about some details. One of those details is where in memory segments will ultimately end up.

Names are optional and can be arbitrary, but the convention is to name segments with uppercase letters preceded by two underscores, e.g. __DATA or __TEXT

The code exposes some more details about segment commands, but should be easy enough to follow.

Sections

All sections within a segment are described one after the other directly after each segment command. Sections define their name, address in memory, size, offset of section data within the file, and segment name. The segment name might seem redundant but in the next post we'll see why this is useful information to have in the section header.

Sections can optionally specify a map to addresses within their binary blob, called a relocation table. This is used by the linker. Since we're letting the linker work out where to place everything in memory the addresses inside our machine code will need to be updated.

By convention segments are named with lowercase letters preceded by two underscores, e.g. __bss or __text

Finally, the Ruby code describing section structs:

macho.rb

As much of the Mach-O format as we need is defined in asm/macho.rb. The Mach header, Segment commands, sections, relocation tables, and symbol table structs are all there, with a few constants as well.

I'll cover symbol tables and relocation tables in my next post.

Looking at real Mach-O files

To see the segments and sections of an object file, run otool -l /usr/lib/crt1.o. -l is for load commands. If you want to see why we stick to generating object files instead of executables run otool -l /bin/zsh. They are complicated beasts.

If you want to see the actual data for a section otool provides a couple of ways to do this. The first is to use otool -d <segment> <section> for an arbitrary section. To see the contents of a well-known section, such as __text in the __TEXT segment, use otool -t /usr/bin/true. You can also disassemble the __text section with otool -tv /usr/bin/true.

You'll get to know otool quite well if you work with Mach-O.

Take a break!

That was probably a lot to digest, and to make real sense of it you might need to read some of the official documentation.

We're close to being able to describe a minimal Mach object file that can be linked, and the resulting binary executed. By the end of the next post we'll be there.

(You can almost do that with what we know now. If you create a Mach file with a Mach header (ncmds=1), a single unnamed segment (nsects=1), and then a section named __text with a segment name of __TEXT, and some x86 machine code as the section data, you would almost have a useful Mach object file.)

Till next time, happy hacking!

No comments:

Post a Comment