IC221: Systems Programming (SP15)


Home Policy Calendar

Lec 26: Socket Addressing and Client Socket Programming

Table of Contents

1 Sockets

In systems programming a socket is much like a file; in fact, it is a file descriptor that happens to write to the network device rather than to a file on disc. As long as you think of it that way, you're in pretty good shape.

The system call to open a socket is socket():

int socket(int domain, int type, int protocol);

The arguments can be described as follows:

  • domain is the addressing domain of socket, which for our purposes is an internet socket or AF_INET
  • type is the type of socket describing which transport protocol we are using. HTTP traffic is over TCP, so that would be SOCK_STREAM. If we wanted to open a UDP socket, the keyword would be SOCK_DGRAM
  • protocol is additional information about the protocol of the socket, but we will not need to use this option, so we'll set it to 0.

The return value of socket() is a integer file descriptor on success, and -1 on failure. All further socket operations, connect(), read(), write(), bind(), accept() and etc, require that integer file descriptor.

2 Addressing

Before we can dive into socket programming, we have to deal with network addressing. Unfortunately, in C, addressing of sockets can be frustrating because depending on the protocol, an address can be of different lengths. On the surface, this might not cause an issue, but it is HUGE for C program because we deal with fixed size buffers. The difference between a 4-byte and 8-byte or 16-byte address is substantial.

At the same time, we also want our programs, even with different address lengths to look the same, use roughly the same structure types, and function in the same way. The result of these needs is that addressing of sockets in C can be complicated, requiring many forms of casting until, finally, you reach the type your looking. In between "… these are not the types you're looking for."

To simplify our discussion, we'll only be covering addressing for IPv4, which uses 4-byte address space. The shorthand for IPv4 is AF_INET, which you will see used throughout.

2.1 Glossary of Structures and Functions

There will be many new structures and functions we'll use in this lessons. Below is quick overview of each of the structures you need with a brief description, more detailed discussion follows.

  • AF_INET : The address family value for IPv4 addresses.
  • struct in_addr : type for storing a internet address, it has a single member s_addr which is a unsigned integer to store the 4 bytes of IPv4 address.
  • struct sockaddr : a generic socket address structure that is used for all addressing types. Often, we'll cast these to a struct sockaddr_in since we are only concerned with IPv4 addresses.
  • struct sockaddr_in : a specific socket address to store IP and port information information. It has 3 members:
    • short sin_faily : the address family, which should be AF_INET.
    • short sin_port : the port number in network byte order, converted using htons()
    • struct in_addr sin_addr : a struct in_addr to store the IP address itself.
  • struct addrinfo : a structure used by getaddrinfo() to store address information and hints. It has a following relevant members:
    • int ai_family : the address family, which should be AF_INET for us
    • struct sockaddr ai_addr : the socket address returned after a query

Here is a glossary of library/system calls we will use and their purpose:

  • inet_ntoa() convert a struct in_addr to a dotted quad string: "Network-to-Address"
  • inet_aton() convert a string of an IP address as a dotted quad into a struct in_addr: "Address-to-Network=
  • getaddrinfo() convert a domain name into an IP address, stored in the ai_addr member as a struct sockaddr in the struct addrinfo result
  • htons() convert a short stored in host byte order into a network byte order: "Host-to-Network"
  • ntohs() convert a short sroted in network byte order to host byte order: "Network-to-Host"

3 Storing an IP address in struct in_addr

The structure that stores 32-bit IPv4 address is struct in_addr:

//uint32_t is the same as unsigned int
typedef uint32_t in_addr_t;
struct in_addr
{
  in_addr_t s_addr;
};

The in_addr structure has one member, s_addr, whose type is uint32_t, which is just a fancy way of saying unsigned int. Let's look at an example of using the in_addr structure to store an IP address:

#include <stdio.h>
#include <stdlib.h>
#include <netinet/in.h> //for struct in_addr
#include <arpa/inet.h>  //for inet_ntoa()

int main(){

  //in_addr struct has a single member s_addr
  struct in_addr addr;
  unsigned char * ip;

  //have ip point to s_addr
  ip = (unsigned char *) &(addr.s_addr);

  //set the bytes for "10.4.32.41"
  ip[0]=10;
  ip[1]=4;
  ip[2]=32;
  ip[3]=41;

  //print it out
  printf("Hello %s\n", inet_ntoa(addr));

}

Recall, from our conversations on types and casting, that this line:

ip = (unsigned char *) &(addr.s_addr);

sets the pointer ip to reference the address. Then we can set the bytes directly byte-by-byte since ip is a unsigned char pointer. At the end, we can print out the address in quad notation using the function inet_ntoa() which stands for network-to-address.

This is clearly a very cumbersome way to set addresses, so instead we can just use the dotted quad notation and have the operating system do the conversion for us. To do this, we use inet_aton() which stands for address-to-network, which converts an IPv4 address into struct addr.

#include <stdio.h>
#include <stdlib.h>
#include <netinet/in.h>
#include <arpa/inet.h>


int main(){

  //in_addr struct has a single member s_addr
  struct in_addr addr;

  //Convert the IP dotted quad into struct in_addr
  inet_aton("10.4.32.41", &(addr));

  printf("Hello %s\n", inet_ntoa(addr));

}

3.1 Storing IP Addresses in struct sockaddr and struct sockaddr_in

The next level of addressing is to combine an IP address with port information. Since we are concerned with internet addressing, we will use the struct sockaddr_in:

struct sockaddr_in {
    short            sin_family;   //address family, set to AF_INET
    unsigned short   sin_port;     //the port in network byte order
    struct in_addr   sin_addr;     //the inet address
};

This is just one kind of socket address, but a socket can be used for a variety of things, not just internet communication. As a result, there is also a generic socket type called struct sockaddr without the _in suffix. Know that whenever you see data typed as struct sockaddr you can convert it into struct sockaddr_in and vice versa. The two types are the same size, that is, occupy the same number of bytes in memory. The casting just changes how those bytes are interpreted.

In the following code, for example:

#include <stdio.h>
#include <stdlib.h>

#include <netinet/in.h>
#include <arpa/inet.h>


int main(){


  //use a generic socket address to store everything
  struct sockaddr saddr;

  //cast generic socket to an inet socket
  struct sockaddr_in * saddr_in = (struct sockaddr_in *) &saddr;

  //Convert IP address into inet address stored in sockaddr
  inet_aton("10.4.32.41", &(saddr_in->sin_addr));

  //print out IP address
  printf("Hello %s\n", inet_ntoa(saddr_in->sin_addr));

}

We declare a generic socket address on the stack, but it is cast to a inet socket address to acces the sin_addr member.

4 Resolving Domain Names

The next part of addressing is the conversion of a domain name into an IP address. IP address are not completely unusable for humans, but they are not the preferred way to reference a remote host. Instead, we use the domain name. For example, when we go to the www.usna.edu that domain must be resolved into an ip address.

4.1 Converting a Domain Name into an IP address

The resolving protocol is called DNS, or Domain Name System, and it is implemented for us through the gataddrinfo() library function. Here is the function declaration:

int getaddrinfo(const char *node, const char *service,
                      const struct addrinfo *hints,
                      struct addrinfo **res);

Here is a description of the arguments:

  • node : a string of the address/domain name you wish to be resolved
  • service : name of the service for the domain you're interested in, set to NULL for our usage
  • hints : an addrinfo of "hints" describe the kinds of address information we are interested in related to the domain
  • res : a new addrinfo structure will be allocated and the pointer will be referenced by res

On errror, getaddrinfo() will return a non-zero value, which will set a special error informatio field. To catch errors you will use the following format style:

if( (s = getaddrinfo(hostname, NULL, &hints, &result)) != 0){
  fprintf(stderr, "getaddrinfo: %s\n",gai_strerror(s));  // <---- converts the error into a message
  exit(1);
}

The next part of using getaddrinfo() properly is understanding the struct addrinfo, which has the following members:

struct addrinfo {
          int              ai_flags;  
          int              ai_family;
          int              ai_socktype; 
          int              ai_protocol;
          size_t           ai_addrlen;
          struct sockaddr *ai_addr;
          char            *ai_canonname;
          struct addrinfo *ai_next;
      };

For our purposes, since we are only concerned with IPv4 internet addressing, we only need to focus on two fields: ai_family and ai_addr.

  • ai_family : indicates the address family of this address, should always be set to AF_INET
  • ai_addr : a socket address storing the resolved IP address

One peculiar aspect of a getaddrinfo() call beyond retrieving the results is that you must also specify hints to the kinds of addresses you are interested in. We are only interested in AF_INET address types, so we can always declare and use the hints option like so:

struct addrinfo hints;     

//zero out the hints structure (look up memset for details)
memset(&hints,0,sizeof(struct addrinfo));  

//set the ai_family field per our needs, AF_INET
hints.ai_family = AF_INET;

Now we have enough to write a hello-world program for resolving an domain name.

#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <netinet/in.h>
#include <arpa/inet.h>
#include <netdb.h>

int main(){

  char hostname[]="www.usna.edu";  //the hostname we are looking up

  struct addrinfo *result;    //to store results
  struct addrinfo hints;      //to indicate information we want

  struct sockaddr_in *saddr;  //to reference address

  int s; //for error checking

  memset(&hints,0,sizeof(struct addrinfo));  //zero out hints
  hints.ai_family = AF_INET; //we only want IPv4 addresses

  //Convert the hostname to an address
  if( (s = getaddrinfo(hostname, NULL, &hints, &result)) != 0){
    fprintf(stderr, "getaddrinfo: %s\n",gai_strerror(s));
    exit(1);
  }

  //convert generic socket address to inet socket address
  saddr = (struct sockaddr_in *) result->ai_addr;

  //print the address
  printf("Hello %s\n", inet_ntoa(saddr->sin_addr));

  //free the addrinfo struct
  freeaddrinfo(result);
}

One thing to be careful about when dealing with getaddrinfo() is that the resulting address is stored in struct sockaddr but we are only using struct sockaddr_in. Fortunately, since we only hinted towards AF_INET, we know that the sockaddr is actually a sockaddr_in and so we can cast:

struct sockaddr_in *saddr;  //to reference address

//convert generic socket address to inet socket address
saddr = (struct sockaddr_in *) result->ai_addr;

Once we've called the data the right type, we can treat it like a sockaddr_in and get to the underlying in_addr:

//print the address
printf("Hello %s\n", inet_ntoa(saddr->sin_addr));

Last, but not least: getaddrinfo() allocates new memory to store the results addrinfo. It must be freed with freeaddrinfo()

//free the addrinfo struct
freeaddrinfo(result);

4.2 Resolving IP addresses to IP addresses?

One nice thing about getaddrinfo() is that it can take a domain name or an IP address. If it finds that you've provided an IP address, it will not resolve it and return it already set in the ai_sockaddr for the results. For example:

#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <netinet/in.h>
#include <arpa/inet.h>
#include <netdb.h>

int main(){

  char hostname[]="10.4.32.41";  //<--- Not a hostname, but an IP address

  struct addrinfo *result;    //to store results
  struct addrinfo hints;      //to indicate information we want

  struct sockaddr_in *saddr;  //to reference address

  int s; //for error checking

  memset(&hints,0,sizeof(struct addrinfo));  //zero out hints
  hints.ai_family = AF_INET; //we only want IPv4 addresses

  //Convert the hostname to an address
  if( (s = getaddrinfo(hostname, NULL, &hints, &result)) != 0){
    fprintf(stderr, "getaddrinfo: %s\n",gai_strerror(s));
    exit(1);
  }

  //convert generic socket address to inet socket address
  saddr = (struct sockaddr_in *) result->ai_addr;

  //print the address
  printf("Hello %s\n", inet_ntoa(saddr->sin_addr));  //<---- will print that IP address here

  //free the addrinfo struct
  freeaddrinfo(result);
}

4.3 Network Byte Order and Ports

The last part of the puzzle for addressing is the port number. Recall the sockaddr_in structure:

struct sockaddr_in {
    short            sin_family;   //address family, set to AF_INET
    unsigned short   sin_port;     //the port in network byte order
    struct in_addr   sin_addr;     //the inet address
};

You see that there is a member, sin_port and your instinct would be to set that directly. For example, if we wanted to contact www.usna.edu on port 80, after doing getaddrinfo(), we'd do something like this:

//convert generic socket address to inet socket address
saddr = (struct sockaddr_in *) result->ai_addr;

saddr->sin_port = 80; //<-- setting port in host order!

It turns out that this DOES NOT work, and it doesn't because of a fundamental problem with data representation. Let's consider the number 80 as it is represented in bits:

pow: 128  64  32  16   8   4   2   1
     ----------------------------------      
bits:  0   1   0   1   0   0   0   0

The above makes since, 8010 = 01010002. If we look more closely, we see that the bit furthest to the left is most significant, or has the highest value, 128. If you think a bit more about it, it seems like an arbitrary choice, couldn't the bit furthest to the left be most significant instead? Instead, we have a the following representation:

pow:   1   2    4   8  16  32  64  128
     ----------------------------------
bits:  0   0    0   0   1   0   1   0

Now we find that 8010 = 000010102. While this might seem a bit awkward, it really is no better nor worse than the other choice. In fact, in the early days of computing, there was a holy war about which order the bits should be written in.

The term used for the ordering of bits is called endian or endianness, and there are two camps: big endian and little endian.

  • Little Endian : The most significant byte is stored in the smallest address.
  • Big Endian : The most significant byte is stored in the biggest address.

In most modern computing systems, little endian is king, but back when the internet was designed and the equipment put in place, it was not clear what data representation should be preferred. As such, the network is big endian, or more precisely routing information is stored in network byte order while data on the host is stored in host byte order.

To facilitate this use, all Unix systems implement a set of conversion functions:

  • htons() : convert data from host order to network order for a short (or two-byte) type
  • ntohs() : convert data from network order to host order for a short (or two-byte) type

There area also hton*() for other types, like integers and longs and etc. On big endian machines host and network order are the same, so these functions do nothing, but it is good practice to always attempt to convert accordingly.

Now, finally, returning to the setting of the port address, we see that we actually must do so like such:

//convert generic socket address to inet socket address
saddr = (struct sockaddr_in *) result->ai_addr;

saddr->sin_port = htons(80); //<-- setting port in network byte order!!!

5 Socket

In systems programming a socket is much like a file; in fact, it is a file descriptor that happens to write to the network device rather than to a file on disc. As long as you think of it that way, you're in pretty good shape.

The system call to open a socket is socket():

int socket(int domain, int type, int protocol);

The arguments can be described as follows:

  • domain is the addressing domain of socket, which for our purposes is an internet socket or AF_INET
  • type is the type of socket describing which transport protocol we are using. HTTP traffic is over TCP, so that would be SOCK_STREAM. If we wanted to open a UDP socket, the keyword would be SOCKDGRAM
  • protocol is additional information about the protocol of the socket, but we will not need to use this option, so we'll set it to 0.

The return value of socket() is a integer file descriptor on success, and -1 on failure. All further socket operations, connect(), read(), write(), and etc, require that integer file descriptor. Today we will focus just on the client side operations of a socket.

5.1 Client Sockets

A client sockets goal is to connect to a foreign address where a server socket is listening. Visually, we think of this process like so.

lec23-client-socket.png

Figure 1: Client Socket Life Cycle

A new socket is opened using the socket() system call, but the socket being open doesn't mean it is connected to anything. To do that, you use the connect() system call that takes as input a given address, IP-port pair. Once connected with the remote server, the client can read and write to that socket to participate using an application layer protocol. The connect system call takes the following arguments:

int connect(int socket, const struct sockaddr *address, socklen_t address_len);

Generally, given a socket socket and a socket address address, try and connect the socket to that foreign address. Note that the socket address is a generate socket (struct sockaddr) which is not necessarily an IP socket address (struct soaddr_in), so you'll need to cast:

int sock;
struct sockaddr_in saddr_in;

//fill in the address for usna.edu
saddr_in.sin_family = AF_INET;
inet_aton("10.4.32.41", &(saddr_in.sin_addr));
saddr_in.sin_prot = htons(80);

//open a socket
if( (sock = socket(AF_INET, SOCK_STREAM, 0))  < 0){
  perror("socket");
  exit(1);
}

//connect socket to the server
if(connect(sock, (struct sockaddr *) &saddr_in, sizeof(struct sockaddr_in)) < 0){
  perror("connect");
  exit(1);
 }

5.2 Socket I/O

Sockets are file descriptors, and so we use the same interface to read and write from them as we did for other kinds of file descriptors, like open files and pipes and etc. That interface is read() and write(), and should be very familiar to you now.

//read from socket and write to stdout
while( (n = read(sock, buf, BUF_SIZE)) > 0){
   if( write(sock, buf, n) < 0){
       perror("write");
       exit(1);
   }
}

if( n < 0 ){
  perror("read");
  exit(1);
}

5.3 Putting it all together

A typical client program we might want to write is one that can connect to a web server and download the web page, i.e., the HTML. Here is such a program:

#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <unistd.h>

#include <netinet/in.h>
#include <arpa/inet.h>
#include <netdb.h>

int main(){

  char hostname[]="usna.edu";    //the hostname we are looking up
  short port=80;                 //the port we are connecting on

  struct addrinfo *result;       //to store results
  struct addrinfo hints;         //to indicate information we want

  struct sockaddr_in *saddr_in;  //socket interent address

  int s,n;                       //for error checking

  int sock;                      //socket file descriptor

  char request[]="GET /index.html\n\r"; //the GET request

  char response[4096];           //read in 4096 byte chunks


  //setup our hints
  memset(&hints,0,sizeof(struct addrinfo));  //zero out hints
  hints.ai_family = AF_INET; //we only want IPv4 addresses

  //Convert the hostname to an address
  if( (s = getaddrinfo(hostname, NULL, &hints, &result)) != 0){
    fprintf(stderr, "getaddrinfo: %s\n",gai_strerror(s));
    exit(1);
  }

  //convert generic socket address to inet socket address
  saddr_in = (struct sockaddr_in *) result->ai_addr;

  //set the port in network byte order
  saddr_in->sin_port = htons(port);

  //open a socket
  if( (sock = socket(AF_INET, SOCK_STREAM, 0))  < 0){
    perror("socket");
    exit(1);
  }

  //connect to the server
  if(connect(sock, (struct sockaddr *) saddr_in, sizeof(*saddr_in)) < 0){
    perror("connect");
    exit(1);
  }

  //send the request
  if(write(sock,request,strlen(request)) < 0){
    perror("send");
  }

  //read the response until EOF
  while( (n = read(sock, response, 4096)) > 0){

    //write response to stdout
    if(write(1, response, n) < 0){
      perror("write");
      exit(1);
    }
  }

  if (n<0){
    perror("read");
  }

  //close the socket
  close(sock);

  return 0; //success
}

Following the client program, we can see that first we resolve the hostname into an IP address, storing that information in the socket address saddr_in. Next we open a new TCP socket, connect to the server, and the read all the data to the server, printing the output to the terminal.