Saturday, January 23, 2010

ScrobbleShark is alive!

Thanks to a kind fellow from Last.fm a new web app of mine called ScrobbleShark appears to be working. If you are one of the 100 people on earth who uses both Grooveshark and Last.fm - and one of the 10 who also want your Grooveshark plays scrobbled to Last.fm - then check it out. It supports automatic and manual scrobbling so you can try it out at first, just be sure to click 'pause scrobbling' after setting it up because automatic is the default option.

It's really easy to use. You login with your Google account, authenticate with Last.fm, enter your Grooveshark username, and you are up and running. I need some beta testers so if you only use Grooveshark, why not join Last.fm and give it a shot?

There are apps to visualize your listening history and other fun stuff. You can also meet people who have similar tastes in music, if you want to. It's really cool! Grooveshark is a great player and Last.fm is great for the community and data mining possibilities. ScrobbleShark helps you enjoy the best of both services.

Let me know what you think if you try it.

Wednesday, January 20, 2010

A preview of Mach-O file generation

This month I got back into an x86 compiler I started last May. It lives on github.

The code is a bit of a mess but it mostly works. It generates Mach object files that are linked with gcc to produce executable binaries.

The Big Refactoring of January 2010 has come to an end and the tests pass again, even if printing is broken it prints something, and more importantly compiles test/test_huge.code into something that works.

After print is fixed I can clean up the code before implementing anything new. I wasn't sure if I'd get back into this or not and am pretty excited about it. I'm learning a lot from this project.

If you are following the Mach-O posts you might want to look at asm/machofile.rb, a library for creating Mach-O files. Using it is quite straightforward, an example is in asm/binary.rb, in the #output method.

Definitely time for bed now!

Monday, January 18, 2010

Basics of the Mach-O file format

This post is part of a series on generating basic x86 Mach-O files with Ruby. The first post introduced CStruct, a Ruby class used to serialize simple struct-like objects.

Please note that the best way to learn about Mach-O properly is to read Apple's documentation on Mach-O, which is pretty good combined with the comments in /usr/include/mach-o/*.h. These posts will only cover the basics necessary to generate a simple object file for linking with ld or gcc, and are not meant to be comprehensive.

Mach-O File Format Overview

A Mach-O file consists of 2 main pieces: the header and the data. The header is basically a map of the file describing what it contains and the position of everything contained in it. The data comes directly after the header and consists of a number of binary blobs of data, one after the other.

The header contains 3 types of records: the Mach header, segments, and sections. Each binary blob is described by a named section in the header. Sections are grouped into one or more named segments. The Mach header is just one part of the header and should not be confused with the entire header. It contains information about the file as a whole, and specifies the number of segments as well.

Take a quick look at Figure 1 in Apple's Mach-O overview, which illustrates this quite nicely.

A very basic Mach object file consists of a header followed by single blob of machine code. That blob could be described by a single section named __text, inside a single nameless segment. Here's a diagram showing the layout of such a file:

            ,---------------------------,
  Header    |  Mach header              |
            |    Segment 1              |
            |      Section 1 (__text)   | --,
            |---------------------------|   | 
  Data      |           blob            | <-'
            '---------------------------'      

The Mach Header

The Mach header contains the architecture (cpu type), the type of file (object in our case), and the number of segments. There is more to it but that's about all we care about. To see exactly what's in a Mach header fire up a shell and type otool -h /bin/zsh (on a Mac).

Using CStruct we define the Mach header like so:

# Appears at the beginning of every Mach object file.
class MachHeader < CStruct
uint32 :magic
int32 :cputype
int32 :cpusubtype
uint32 :filetype
uint32 :ncmds
uint32 :sizeofcmds
uint32 :flags
end
# Values for the magic field.
MH_MAGIC = 0xfeedface # Mach magic number (big-endian)
MH_CIGAM = 0xcefaedfe # Little-endian version
# Values for the filetype field (there are several more)
MH_OBJECT = 0x1
MH_EXECUTE = 0x2
# CPU types and subtypes (only Intel for now).
CPU_TYPE_X86 = 7
CPU_TYPE_I386 = CPU_TYPE_X86
CPU_SUBTYPE_X86_ALL = 3
view raw mach-header.rb hosted with ❤ by GitHub

Segments

Segments, or segment commands, specify where in memory the segment should be loaded by the OS, and the number of bytes to allocate for that segment. They also specify which bytes inside the file are part of that segment, and how many sections it contains.

One benefit to generating an object file rather than an executable is that we let the linker worry about some details. One of those details is where in memory segments will ultimately end up.

Names are optional and can be arbitrary, but the convention is to name segments with uppercase letters preceded by two underscores, e.g. __DATA or __TEXT

The code exposes some more details about segment commands, but should be easy enough to follow.

# Segment commands are one type of load command.
# Load commands all begin with the type of command it is, cmd,
# and the size of that command's struct in bytes.
class LoadCommand < CStruct
uint32 :cmd
uint32 :cmdsize
end
# Values for the cmd member of LoadCommand CStructs (incomplete).
LC_SEGMENT = 0x1
LC_SYMTAB = 0x2
class SegmentCommand < LoadCommand
string :segname, 16
uint32 :vmaddr
uint32 :vmsize
uint32 :fileoff
uint32 :filesize
int32 :maxprot
int32 :initprot
uint32 :nsects
uint32 :flags
end
# Values for protection fields, maxprot and initprot (incomplete).
# (These are bitwise or'd together as in C.)
VM_PROT_NONE = 0x00
VM_PROT_READ = 0x01
VM_PROT_WRITE = 0x02
VM_PROT_EXECUTE = 0x04

Sections

All sections within a segment are described one after the other directly after each segment command. Sections define their name, address in memory, size, offset of section data within the file, and segment name. The segment name might seem redundant but in the next post we'll see why this is useful information to have in the section header.

Sections can optionally specify a map to addresses within their binary blob, called a relocation table. This is used by the linker. Since we're letting the linker work out where to place everything in memory the addresses inside our machine code will need to be updated.

By convention segments are named with lowercase letters preceded by two underscores, e.g. __bss or __text

Finally, the Ruby code describing section structs:

class Section < CStruct
string :sectname, 16
string :segname, 16
uint32 :addr
uint32 :size
uint32 :offset
uint32 :align
uint32 :reloff
uint32 :nreloc
uint32 :flags
uint32 :reserved1
uint32 :reserved2
end
# Values for the type bitfield of the flags field (mask 0xff).
# (incomplete)
S_REGULAR = 0x0
S_ZEROFILL = 0x1
S_CSTRING_LITERALS = 0x2
view raw section.rb hosted with ❤ by GitHub

macho.rb

As much of the Mach-O format as we need is defined in asm/macho.rb. The Mach header, Segment commands, sections, relocation tables, and symbol table structs are all there, with a few constants as well.

I'll cover symbol tables and relocation tables in my next post.

Looking at real Mach-O files

To see the segments and sections of an object file, run otool -l /usr/lib/crt1.o. -l is for load commands. If you want to see why we stick to generating object files instead of executables run otool -l /bin/zsh. They are complicated beasts.

If you want to see the actual data for a section otool provides a couple of ways to do this. The first is to use otool -d <segment> <section> for an arbitrary section. To see the contents of a well-known section, such as __text in the __TEXT segment, use otool -t /usr/bin/true. You can also disassemble the __text section with otool -tv /usr/bin/true.

You'll get to know otool quite well if you work with Mach-O.

Take a break!

That was probably a lot to digest, and to make real sense of it you might need to read some of the official documentation.

We're close to being able to describe a minimal Mach object file that can be linked, and the resulting binary executed. By the end of the next post we'll be there.

(You can almost do that with what we know now. If you create a Mach file with a Mach header (ncmds=1), a single unnamed segment (nsects=1), and then a section named __text with a segment name of __TEXT, and some x86 machine code as the section data, you would almost have a useful Mach object file.)

Till next time, happy hacking!

Sunday, January 17, 2010

Working with C-style structs in Ruby

This is the beginning of a series on generating Mach-O object files in Ruby. We start small by introducing some Ruby tools that are useful when working with binary data. Subsequent articles will cover a subset of the Mach-O file format, then generating Mach object files suitable for linking with ld or gcc to produce working executables. A basic knowledge of Ruby and C are assumed. You can likely wing it on the Ruby side of things if you know any similar languages.

First we need to read and write structured binary files with Ruby. Array#pack and String#unpack get the job done at a low level, but every time I use them I have to look up the documentation. It would also be nice to encapsulate serializing and deserializing into classes describing the various binary data structures. The built-in Struct class sounds promising but did not meet my needs, nor was it easily extended to meet them.

Meet CStruct, a class that you can use to describe a binary structure, somewhat similar to how you would do it in C. Subclassing CStruct results in a class whose instances can be serialized, and unserialized, with little effort. You can subclass descendants of CStruct to extend them with additional members. CStruct does not implement much more than is necessary for the compiler. For example there is no support for floating point. If you want to use this for more general purpose tasks be warned that it may require some work. Anything supported by Array#pack is fairly easy to add though.

First a quick example and then we'll get into the CStruct class itself. In C you may write the following to have one struct "inherit" from another:

struct Base {
int type;
int flags;
};
struct Frobulator {
struct Base base; /* Frobulator begins with 2 ints for type and flags */
unsigned int frob_limit;
};
struct Snazzifier {
struct Base base;
int snazz_level;
};


With CStruct in Ruby that translates to:

require 'cstruct'
class Base < CStruct
int :type
int :flags
end
class Frobulator < Base
uint :frob_limit
end
class Snazzifier < Base
int :snazz_level
end


CStructs act like Ruby's built-in Struct to a certain extent. They are instantiated the same way, by passing values to #new in the same order they are defined in the class. You can find out the size (in bytes) of a CStruct instance using the #bytesize method, or of any member using #sizeof(name).

The most important method (for us) is #serialize, which returns a binary string representing the contents of the CStruct.

(I know that CStruct.new_from_bin should be called CStruct.unserialize, you can see where my focus was when I wrote it.)

CStruct#serialize automatically creates a "pack pattern", which is an array of strings used to pack each member in turn. The pack pattern is mapped to the result of calling Array#pack on each corresponding member, and then the resulting strings are joined together. Serializing strings complicates matters so we cannot build up a pack pattern string and then serialize it in one go, but conceptually it's quite similar.

Unserializing is the same process in reverse, and was mainly added for completeness and testing purposes.

That's about all you need to know to use CStruct. The code needs some work but I decided to just go with what I have already so I can get on with the more interesting and fun tasks.

Next in this series: Basics of the Mach-O file format