Recursive Complexity

Thoughts and musings of a programmer and wanna be entrepreneur

Posts Tagged ‘tcp/ip

Linux Socket Calls and What Happens Internally

with 2 comments

This post is the result of my own curiosity and need to find out how data sent by user space applications like ping and telnet reach a suitable network device on the system which then sends it across the network. The material in this post solely concerns the path data takes from user space to kernel space and finally the network device and not beyond that. Although most of this material is obtained by digging into Linux kernel source code, the logic might apply to Windows and other operating systems as well.

A short note on Linux system calls


In Linux most of the heavy lifting of the system calls is handled by the GNU C library glibc which contains wrappers for these system calls. What the user calls is actually a library function and not the system call itself. This provides the advantage of not having to worry about any changes in the system call interface as it’s the library that has to change and not the user program. In Linux, each system call has a unique identifier associated with it (__NR_syscall) which can be found in /usr/include/asm-generic/unistd.h on any Linux installation. This can be a little misleading though as the below paragraph summarizes the issue. This system call identifier is placed in AX register and the parameters to the syscall are placed in other registers and the assembly instruction syscall or sysenter is executed which transfers control to the kernel which then based on the syscall identifier executes the appropriate kernel function. In older systems, a software interrupt 0x81 used to be issued to transfer control to the kernel during a system call. This has been replaced now by the above instructions and this keeps changing when more efficient methods are discovered to do the same.

The system calls related to socket API’s are tricky to figure out as there isn’t a direct relation to the glibc syscall wrapper and the kernel function that gets called. This is the case on 32-bit architectures. All socket system calls are multiplexed into a single kernel function called sys_socketcall defined in net/socket.c (as is all other socket system calls). This function takes as its first parameter, the sub-function identifier that identifies the appropriate function that needs to be called. This function acts as the entry point for all socket system calls on 32-bit architectures as far as I know. On 64-bit architectures this is not the case and each glibc library function calls the relevant system call directly.

Socket API’s behind the scenes.


We will take a look at two functions, socket() and connect(), that should lay the foundation and make it easy to follow what happens when other functions are called.

int socket(int domain, int type, int protocol)

The very first thing an application does to enable itself to communicate across the network is to create a socket. There are three parameters passed to this function, the domain of the socket which also happens to be the protocol family (AF_INET, AF_INET6, AF_UNIX etc), the type of the socket (SOCK_STREAM, SOCK_DGRAM, SOCK_RAW etc) and the protocol (generally 0 or the protocol number that is supported by the communication domain/protocol family) that would be used to communicate using the socket. This function returns a file descriptor that describes the socket (in Linux everything or almost everything is a file) that is created on successful completion. The glibc library function socket invokes the sys_socketcall kernel function on 32-bit arch or sys_socket kernel function directly on 64-bit arch. Nonetheless, sys_socket is the final destination for the socket system call.

Now lets see what happens in sys_socket kernel function. Based on the protocol family/domain, it chooses the net_proto_family structure that describes this protocol. The pointers to these structures are stored in the global array net_families[] in file net/socket.c. Every protocol driver when it loads, registers its protocol with the kernel which then updates this array to reflect support for this protocol. After obtaining a pointer to the appropriate protocol handler, it calls the create function present inside the structure. This for AF_INET/PF_INET is the inet_create function in file net/ipv4/af_inet.c. Each socket type can support multiple protocols which can be obtained from the array of inetsw and the socket type passed to the socket system call. This list is then iterated in the inet_create, function matching the protocol passed to the system call with the protocol supported by each item in the list. For AF_INET family inetsw_array[] contains all the socket type and protocol combination it supports. If there is a match, then the socket operations is assigned to the operations supported by the handler, the socket protocol is assigned to the protocol of the handler and the control returns to sys_socket.

Next, sys_socket calls sock_map_fd to allocate a file descriptor and make it represent the socket. This function in turn calls sock_alloc_file which gets an unused file descriptor to represent a file, creates a file, assigns the created socket as the private data for the file. It also assigns socket_file_ops structure as the file operations. The control is then returned to sys_socket which return the created file descriptor to user space.

int connect(int sockfd, const struct sockaddr *addr, socklen_t addrlen);

The connect socket function establishes a connection to the peer identified by the sockaddr structure pointer passed as the second parameter. The first parameter is the socket file descriptor obtained by calling function socket. The last parameter is the size of sockaddr structure. The system call related to function connect is sys_connect defined in file net/socket.c. The very first thing the sys_connect function does is to lookup the file descriptor that represents the socket using function sockfd_lookup_light. This function searches the kernel file descriptor table for the file that represents the socket and returns it. The private data of the file contains a pointer to structure socket as mentioned above in the description for socket system call which is returned to function sys_socket. Then the user space address of sockaddr is moved to kernel space and the connect function of the socket operations is called. This for socket family of AF_INET and socket type of SOCK_STREAM, points to inet_stream_ops->connect. The structure socket obtained contains a pointer to structure sock which is the network layer representation of the socket. Using this pointer, a call is made to sk->sk_prot->connect which for ipv4 stream ops happens to be tcp_prot->connect which is tcp_v4_connect. These pointers are already assigned during socket creation time (see above explanation of the creation of a socket). tcp_v4_connect calls ip_route_connect which consults the kernel routing table to figure out the device that should be used for the connection and using this information tries to establish a connection to the peer and returns the result back to the user.

The above description is just an overview of how socket creation and connection establishment works inside the Linux kernel. This is a complex mechanism with a lot of protocols and drivers involved and needs patience and constant effort to understand as there is very little documentation. Digging through the kernel is probably the best way to figure out things as is with many other things related to the kernel.

Written by Vivek S

June 4, 2012 at 9:54 am

Posted in Tech

Tagged with , ,

TED Blog

The TED Blog shares interesting news about TED, TED Talks video, the TED Prize and more.