Tunnelling TCP connections into iOS on QEMU

March 29, 2020

Preface

Thanks to the fantastic work by Jonathan Afek, it is possible to boot an iOS image with QEMU. The current state of the project allowed execution of arbitrary binaries, including bash (with the I/O happening over an emulated serial port).

While controlling the OS over a serial shell is great for a PoC, we wanted a more robust solution. The first thing that comes to mind when controlling remote systems is SSH: it allows connection multiplexing, file transfers, and more. However, to reap all those benefits, we had to establish a TCP connection with iOS. And since our QEMU system does not emulate a network card, this proved a challenging task.

This post will describe the solution we developed to enable communications with an emulated iOS system.

Complicated Alternatives

When running a regular Linux OS under QEMU, one can use the virt machine with an emulated network card. Unfortunately, with iOS, it’s not a simple matter of copying the emulated network card code into our n66 machine: iOS doesn’t have the appropriate drivers that would utilize the card, not to mention the additional hardware that would need to be emulated, such as the IO bus.

Therefore, in order to add a proper network interface to our iOS emulation, we had 2 options:

Take the I/O bus and network card code from virt, and develop relevant drivers for iOS, that would communicate with this emulated hardware. The main challenge with this approach stems from a lack of driver development tools for iOS.
Utilize the existing drivers in iOS, and develop emulated hardware that the drivers will communicate with. This requires a thorough reverse engineering effort directed at the IO and network drivers in iOS. Given the lack of symbols, and the vast amount of functionality of the real hardware that would have to be emulated, this approach would probably prove too expensive, as well.

We wanted a solution with fewer potential complications, and drew inspiration from the realm of virtualization.

QEMU Guest Services

Virtualization 101

In virtualization software (such as VMWare Workstation, VirtualBox, etc.), it is common to include a software suite that can be installed in the guest OS (the nomencalture differs between providers: VMWare calls it VMWare Tools, VirtualBox calls it Guest Additions). The purpose of that software is to enrich the capabilities of the guest OS by providing a direct communication channel with the host. Enhancements include features like clipboard sharing, drag and drop file copying, and more. What enables the implementation of the above features is a special virtualization opcode, that is used for direct communications from the guest to the host. The opcode differs from architecture to architecture, and sometimes from manufacturer to manufacturer: Intel uses vmcall, AMD uses vmmcall, and ARM uses hvc.

The QEMU Call Opcode

Since our QEMU system that executes an iOS image is akin to a hypervisor, we chose to take a similar approach, and define an opcode that can be used by the guest (iOS) to call out to the host (QEMU) for arbitrary functionality (We call this a QEMU Call). We wanted to keep the changes to the core code of QEMU to a minimum - therefore, we preferred not to introduce a new opcode. Overriding the functionality of hvc was also an option we wanted to avoid. However, QEMU supports definition of custom system registers, with user-defined implementations. This suited us perfectly, and provided us with a great location to introduce callbacks that will be executed when the guest (iOS) needs a service from QEMU, the host:

static const ARMCPRegInfo n66_cp_reginfo[] = {
    // Apple-specific registers
    N66_CPREG_DEF(ARM64_REG_HID11, 3, 0, 15, 13, 0, PL1_RW),
    N66_CPREG_DEF(ARM64_REG_HID3, 3, 0, 15, 3, 0, PL1_RW),
    N66_CPREG_DEF(ARM64_REG_HID5, 3, 0, 15, 5, 0, PL1_RW),
    N66_CPREG_DEF(ARM64_REG_HID4, 3, 0, 15, 4, 0, PL1_RW),
    N66_CPREG_DEF(ARM64_REG_HID8, 3, 0, 15, 8, 0, PL1_RW),
    N66_CPREG_DEF(ARM64_REG_HID7, 3, 0, 15, 7, 0, PL1_RW),
    N66_CPREG_DEF(ARM64_REG_LSU_ERR_STS, 3, 3, 15, 0, 0, PL1_RW),
    N66_CPREG_DEF(PMC0, 3, 2, 15, 0, 0, PL1_RW),
    N66_CPREG_DEF(PMC1, 3, 2, 15, 1, 0, PL1_RW),
    N66_CPREG_DEF(PMCR1, 3, 1, 15, 1, 0, PL1_RW),
    N66_CPREG_DEF(PMSR, 3, 1, 15, 13, 0, PL1_RW),

    // Aleph-specific registers for communicating with QEMU

    // REG_QEMU_CALL:
    { .cp = CP_REG_ARM64_SYSREG_CP, .name = "REG_QEMU_CALL",
      .opc0 = 3, .opc1 = 3, .crn = 15, .crm = 15, .opc2 = 0,
      .access = PL0_RW, .type = ARM_CP_IO, .state = ARM_CP_STATE_AA64,
      .readfn = qemu_call_status,
      .writefn = qemu_call },

    REGINFO_SENTINEL,
};

In the above snippet, there is a new system register definition, named REG_QEMU_CALL. Similar to the Apple-specific registers defined before, we define our new custom register as an instance of QEMU’s ARMCPRegInfo struct.

The fields opc0, opc1, crn, crm, and opc2 are used to identify a system register, and differentiate it from the rest. They are needed to construct the opcodes (mrs/msr) for accessing the register. There are several restrictions on the choice of those fields, but the main priority is to choose a unique combination, that does not clash with an existing system register.

The access field instructs QEMU about the access restrictions to the register based on the current PL (Privilege Level). Our register sets access to PL0_RW, making it readable and writable at PL0/EL0 and above.

Finally, the fields readfn and writefn define the callbacks that will be executed upon reading and writing the register, respectively. While reading the register might be useful in the future, for our purposes we only need write access for now. Therefore, readfn is a stub:

uint64_t qemu_call_status(CPUARMState *env, const ARMCPRegInfo *ri)
{
    // NOT USED FOR NOW
    return 0;
}

The QEMU Call API

When a userspace application needs to perform an action that’s implemented by the operating system (for example, file system access, network access, etc.), it performs a system call. This is very similar to the way a guest operating system performs a call to the hypervisor (and, in fact, the hypervisor call functionality is engineered to resemble system calls). Both a system call and a hypervisor call usually identify the required functionality by a number passed in the first register (that’s the system call number, or the hypercall number). Additional arguments are usually passed either in other registers, or in memory pointed by the registers.

We decided to follow a similar convention for our QEMU call, with a small change. To minimize the need for inline assembly (which is usually required for making system and hypervisor calls, since the arguments have to be stored in specific registers), we chose to store all the arguments (including the number of the QEMU call) in memory. We defined the following struct to make working with the data easier:

typedef struct __attribute__((packed)) {
    // Request
    qemu_call_number_t call_number;
    union {
        // File Descriptors API
        qc_close_args_t close;
        qc_fcntl_args_t fcntl;
        // Socket API
        qc_socket_args_t socket;
        qc_accept_args_t accept;
        qc_bind_args_t bind;
        qc_connect_args_t connect;
        qc_listen_args_t listen;
        qc_recv_args_t recv;
        qc_send_args_t send;
    } args;

    // Response
    int64_t retval;
    int64_t error;
} qemu_call_t;

The struct begins with the call_number field, which identifies the requested functionality (qemu_call_number_t is an enum). The arguments to the QEMU call follow. Since each call number comes with a corresponding set of arguments, they appear in a union - there won’t be a situation where two or more types of arguments should be accessed concurrently. The rest of the structure contains the retval and error fields, used to signal the caller with the results of the QEMU call.

But how would the handler of the QEMU Call know where to look for that data? Since we implement QEMU calls via (write) access to REG_QEMU_CALL, we can simply use the address of the QEMU call data as the value written. That way, when the writefn callback is executed, we simply read the data from the written address:

void qemu_call(CPUARMState *env, const ARMCPRegInfo *ri, uint64_t value)
{
    CPUState *cpu = qemu_get_cpu(0);
    qemu_call_t qcall;

    // Read the request
    cpu_memory_rw_debug(cpu, value, (uint8_t*) &qcall, sizeof(qcall), 0);

At this point, we have the data for the QEMU call in qcall, and can parse it to choose the correct functionality:

switch (qcall.call_number) {
    // File Descriptors
    case QC_CLOSE:
        qcall.retval = qc_handle_close(cpu, qcall.args.close.fd);
        break;
    // ... more cases ...
    default:
        // TODO: handle unknown call numbers
        break;
}

After handling the call, we populate the retval and the error fields of qcall, and complete the callback. The control is returned to the guest, which resumes execution from the opcode following the write access to REG_QEMU_CALL. At that point, since the functionality has been provided, and the status is already in memory, the guest can simply read the returned data, and act accordingly.

Implementing System APIs

To make programming with QEMU calls as straightforward as possible, we wanted to avoid exposing complex functionality. Instead, we chose to match each call with a POSIX system call. By taking this approach, we gain 2 important benefits:

On the QEMU side, the implementation of each call mostly boils down to making the appropriate system call, with some state data stored locally.
On the guest side, it’s even more straightforward: for each implemented QEMU call, we have a wrapper that matches the underlying system call in signature. It simply populates a qemu_call_t structure, executes the QEMU call opcode, and reads the return value and/or error from the result.

Following are a couple of examples.

Sockets API (QEMU side)

Each system call we implement (socket, accept, bind, connect, listen, recv, and send) has a matching handler function, executed based on the QEMU call number parsed from the request. The arguments (taken from the request) match the underlying system call APIs, with the addition of the cpu argument (used for accessing the guest memory). These are the handler declarations:

int32_t qc_handle_socket(CPUState *cpu, int32_t domain, int32_t type,
                         int32_t protocol);
int32_t qc_handle_accept(CPUState *cpu, int32_t sckt, struct sockaddr *addr,
                         socklen_t *addrlen);
int32_t qc_handle_bind(CPUState *cpu, int32_t sckt, struct sockaddr *addr,
                       socklen_t addrlen);
int32_t qc_handle_connect(CPUState *cpu, int32_t sckt, struct sockaddr *addr,
                          socklen_t addrlen);
int32_t qc_handle_listen(CPUState *cpu, int32_t sckt, int32_t backlog);
int32_t qc_handle_recv(CPUState *cpu, int32_t sckt, void *buffer,
                       size_t length, int32_t flags);
int32_t qc_handle_send(CPUState *cpu, int32_t sckt, void *buffer,
                       size_t length, int32_t flags);

Following is the code of the socket handler:

int32_t qc_handle_socket(CPUState *cpu, int32_t domain, int32_t type,
                         int32_t protocol)
{
    int retval = find_free_socket();

    if (retval < 0) {
        guest_svcs_errno = ENOTSOCK;
    } else if ((guest_svcs_fds[retval] = socket(domain, type, protocol)) < 0) {
        retval = -1;
        guest_svcs_errno = errno;
    }

    return retval;
}

The implementation is rather straightforward:

find_free_socket looks for an unoccupied (i.e., containing -1) cell in a local integer array representing QEMU file descriptors.
If no unoccupied cells were found, ENOMEM is set as an error, and the function completes.
Otherwise, socket is called with the passed arguments.
If the call is successful, the result (the allocated file descriptor) is stored in the file descriptor array (in the spot found in step 1).
If the call fails, the matching error number is set.

Another example - the code of the send handler:

int32_t qc_handle_send(CPUState *cpu, int32_t sckt, void *g_buffer,
                       size_t length, int32_t flags)
{
    VERIFY_FD(sckt);
    uint8_t buffer[MAX_BUF_SIZE];

    int retval = -1;

    if (length > MAX_BUF_SIZE) {
        guest_svcs_errno = ENOMEM;
    } else {
        cpu_memory_rw_debug(cpu, (target_ulong) g_buffer, buffer, length, 0);

        if ((retval = send(guest_svcs_fds[sckt], buffer, length, flags)) < 0) {
            guest_svcs_errno = errno;
        }
    }

    return retval;
}

This implementation is rather simple, and similar to socket in structure:

VERIFY_FD makes sure the passed socket number (a file descriptor, really), is valid. The number is actually an index for the file descriptors array, and VERIFY_FD simply makes sure the matching cell in the array is valid (i.e., not set to -1).
For simplicity, the buffer for transferring the data from the guest is statically allocated, and we simply make sure the sent data length doesn’t exceed that limit.
The data to be sent is copied from the geust into the locally allocated buffer.
send is called with the passed arguments (and the pointer to the local buffer). It’s return value is used as the return value of the QEMU call, and in case of an error, the error number is set, too.

Sockets API (Guest side)

Following is the function that performs the actual QEMU calls (i.e., writes to the REG_QEMU_CALL system register):

void qemu_call(qemu_call_t *qcall)
{
    asm volatile ("mov x0, %[addr]"::[addr] "r" (qcall));
    asm volatile (".byte 0x00");
    asm volatile (".byte 0xff");
    asm volatile (".byte 0x1b");
    asm volatile (".byte 0xd5");
}

It simply puts the address of the allocated qemu_call_t structure in x0, and moves x0 to REG_QEMU_CALL (this instruction cannot be written in standard assembly, and therefore has to be manually encoded as 4 bytes).

While the header files for QEMU calls are partially shared between QEMU and guest development (to keep the strucures, like qemu_call_t, in sync), the implemented methods are different. So instead of handlers for QEMU calls, there are simply APIs that match the underlying system calls in name:

int qc_socket(int domain, int type, int protocol);
int qc_accept(int sckt, struct sockaddr *addr, socklen_t *addrlen);
int qc_bind(int sckt, const struct sockaddr *addr, socklen_t addrlen);
int qc_connect(int sckt, const struct sockaddr *addr, socklen_t addrlen);
int qc_listen(int sckt, int backlog);
ssize_t qc_recv(int sckt, void *buffer, size_t length, int flags);
ssize_t qc_send(int sckt, const void *buffer, size_t length, int flags);

All of the above functions are implemented as wrappers around the following function, that simply executes qemu_call with the passed data, sets the error number, and returns the return value:

static int qemu_sckt_call(qemu_call_t *qcall)
{
    qemu_call(qcall);

    guest_svcs_errno = qcall->error;
    return (int) qcall->retval;
}

For example, following are the implementations of socket and send:

int qc_socket(int domain, int type, int protocol)
{
    qemu_call_t qcall = {
        .call_number = QC_SOCKET,
        .args.socket.domain = domain,
        .args.socket.type = type,
        .args.socket.protocol = protocol,
    };

    return qemu_sckt_call(&qcall);
}

ssize_t qc_send(int sckt, const void *buffer, size_t length, int flags)
{
    qemu_call_t qcall = {
        .call_number = QC_SEND,
        .args.send.socket = sckt,
        .args.send.buffer = (void *) buffer,
        .args.send.length = length,
        .args.send.flags = flags,
    };

    return qemu_sckt_call(&qcall);
}

TCP Tunnel

Provided with the socket API described above, (as well as a couple of additional functions for managing file descriptors, namely close and fcntl), creating TCP tunnels becomes rather trivial. The algorithm for setting up a tunnel listening on QEMU host, and forwarding all connections to a port on the guest is as follows:

Create a TCP socket on the host, bind it to a port, and listen to incoming connections on that socket. These are performed via QEMU calls corresponding to the socket, bind, and listen system calls. The result of those calls is a socket on the host, listening on the required port. Additionally, an fcntl QEMU call is used to mark the socket as non-blocking.
At this point, we enter an infinite loop, and keep polling the socket via accept QEMU calls. Once a connection (on the host side) is made, the call to accept will return a new socket, that identifies that connection.
We instantiate a new socket connection to the target port on the guest (for our main purpose, that’ll be the connection to our SSH server on iOS). This is done via a normal connect system call (not a QEMU call).
We move into another infinite loop, that keeps polling the 2 sockets (the socket connected to the guest port, and the socket connected to the host), by calling recv on each of the sockets sequentially. Whenever data is received from one of the sockets, it’s sent to the other socket, as well. The send and recv used are normal system calls when accessing the guest socket, and QEMU calls when accessing the host socket.
Finally, when one of the connections is closed, we close the other connection as well (via a close system/QEMU call), and go back to accepting additional connections.

The above algorithm is rudimentary, and supports a single connection at a time. It can be enhanced by forking on each accept, to allow multiple simultaneous connections.

Of course, it’s possible to forward connections from the guest to the outside world, as well. This is done by instantiating the listening socket (from step 1) with normal system calls, and the second connection socket (from step 3) with a QEMU call.

SSH Server

With the ability to forward arbitrary TCP ports from the host machine to iOS, we can run an SSH server on iOS, and use it as a shell for controlling the system. This lets us open multiple shell connections simultaneously (compared to a single shell over serial), as well as transfer files easily while the system is online.

Luckily, iosbinpack that we copied to our iOS image earlier, includes a basic SSH server named Dropbear. By running it, and forwarding a port on the host to port 22 in iOS, we can use a regular SSH client and connect to our iOS system remotely:

$ /iosbinpack64/usr/local/bin/dropbear --shell /iosbinpack64/bin/bash -R -E
$ /bin/tunnel 2222:127.0.0.1:22

Note, that dropbear requires write access to the filesystem, in order to generate its SSH keys (alternatively, keys can be generated offline on the host, copied over to the image, and dropbear can be directed to use those generated keys). Therefore, we recommend using our updated instructions (that include a disk image mounted with write access).

Future Work

While basic network connectivity is achieved via SSH, our solution still has some downsides that have to be fixed, as well as missing features that have to be developed:

Performance: currently, our iOS system is very basic, and doesn’t support interrupts. Therefore, our TCP tunnel can’t just wait for a packet to arrive from the host while idling - instead, it has to actively poll the host for new data. Of course, to avoid an atrocious waste of CPU cycles, we can sleep between polls - but there’s a tradeoff between data latency and high CPU usage. Once interrupts are implemented, it’ll be possible to idle while waiting for data from the host. This will considerably lower CPU usage of our tunnel, without hurting performance.
Currently, the tunnel supports a single connection at a time. While SSH, through port forwarding, can take care of multiplexing additional connections, adding support for multiple simultaneous connections should make our tunnel more robust, versatile, and allow it to function without being dependent on SSH.
Using an SSH server to control iOS is great, but as mentioned in the beginning of the post, the ultimate goal is to set up a VPN connection, that will provide a full-featured network access to iOS, alleviating the need to set up TCP tunnels for each connection. Potentally, this will also provide support for additional protocols, such as UDP.