From Socket API to Kernel === # Socket API - Berkeley sockets is an application programming interface (API) for Internet sockets and Unix domain sockets, used for inter-process communication (IPC). - It originated with the 4.2BSD Unix released in 1983. - A socket is an abstract representation (handle) for the local endpoint of a network communication path. - The Berkeley sockets API represents it as a file descriptor (file handle) in the Unix philosophy that provides a common interface for input and output to streams of data. - Evolved with little modification from a de facto standard into a component of the POSIX specification. - Also known as BSD sockets, acknowledging the first implementation in the Berkeley Software Distribution. - ![](https://i.imgur.com/gMT9CIj.png) - Source : [Berkeley sockets](https://en.wikipedia.org/wiki/Berkeley_sockets) - 不同的 protocols 使用一致的 APIs # socket() ``` NAME socket - create an endpoint for communication SYNOPSIS #include <sys/types.h> /* See NOTES */ #include <sys/socket.h> int socket(int domain, int type, int protocol); ``` - The **domain** argument specifies a communication domain; this selects the protocol family which will be used for communication. These families are defined in <sys/socket.h>. - AF_INET for IPv4, AF_INET6 for IPv6 - AF_UNIX for Local communication (Unix domain socket) - The socket has the indicated **type**, which specifies the communication semantics. - SOCK_STREAM : sequenced, reliable, two-way, connection-based byte streams. - SOCK_DGRAM : datagrams (connectionless, unreliable messages of a fixed maximum length) - The **protocol** specifies a particular protocol to be used with the socket. - On success, a file descriptor for the new socket is returned. - Socket is treated as a file as many other things in Unix philosophy. File descriptor is used as a handle for a socket in user space. - Linux VFS (Virtual File System) - struct file - const struct file_operations *f_op; - [static const struct file_operations socket_file_ops](http://lxr.free-electrons.com/source/net/socket.c?v=3.2#L135) - [sock_alloc_file(socket *, file * *, int) : int](http://lxr.free-electrons.com/source/net/socket.c?v=3.2#L349) - sock_map_fd(socket *, int) : int - sys_socket(int, int, int) : long int - [container_of macro](http://lxr.free-electrons.com/source/include/linux/kernel.h?v=3.2#L659) - socket system call - [SYSCALL_DEFINE3(socket, int, family, int, type, int, protocol)](http://lxr.free-electrons.com/source/net/socket.c?v=3.2#L1302) - [__sock_create()](http://lxr.free-electrons.com/source/net/socket.c?v=3.2#L1176) - const struct net_proto_family *pf - security_socket_create() - pf = rcu_dereference(net_families[family]); - net_families array is initialized for AF_INET in inet_init() during startup - err = pf->create(net, sock, protocol, kern); **//calling protocol-specifc create()** - [struct inet_family_ops for AF_INET](http://lxr.free-electrons.com/source/net/ipv4/af_inet.c?v=3.2#L993) - [inet_create()](http://lxr.free-electrons.com/source/net/ipv4/af_inet.c?v=3.2#L265) - struct inet_protosw *answer; - iterate through inetsw[sock->type] to find the appropriate answer - [struct inet_protosw inetsw_array](http://lxr.free-electrons.com/source/net/ipv4/af_inet.c?v=3.2#L1002) - initialized in inet_init() - ops = answer->ops; **// install socket oeprations** - sk = [sk_alloc(net, PF_INET, GFP_KERNEL, answer_prot);](http://lxr.free-electrons.com/source/net/core/sock.c?v=3.2#L1119) - sk --> struct tcp_sock (struct proto tcp_prot.obj_size) - [struct tcp_prot](http://lxr.free-electrons.com/source/net/ipv4/tcp_ipv4.c?v=3.2#L2597) - [struct tcp_sock](http://lxr.free-electrons.com/source/include/linux/tcp.h?v=3.2#L294) - the sock-series struct is very interesting - it's pretty much like class inheritence in C++ - [struct tcp_sock](http://lxr.free-electrons.com/source/include/linux/tcp.h?v=3.2#L294) --> struct inet_connection_sock --> struct inet_sock --> struct sock (as the base struct/class) - [Type punning](https://en.wikipedia.org/wiki/Type_punning) - another example is the sockaddr-series struct # bind() ``` NAME bind - bind a name to a socket SYNOPSIS #include <sys/socket.h> int bind(int socket, const struct sockaddr *address, socklen_t address_len); ``` - The bind() function shall assign a local socket address address to a socket identified by descriptor socket that has no local socket address assigned. - **socket** Specifies the file descriptor of the socket to be bound. - **address** Points to a sockaddr structure containing the address to be bound to the socket. The length and format of the address depend on the address family of the socket. - **address_len** Specifies the length of the sockaddr structure pointed to by the address argument. - Upon successful completion, bind() shall return 0; otherwise, -1 shall be returned and errno set to indicate the error. ``` /* Structure describing a generic socket address. */ struct sockaddr { __SOCKADDR_COMMON (sa_); /* Common data: address family and length. */ char sa_data[14]; /* Address data. */ }; /* Structure describing an Internet socket address. */ struct sockaddr_in { __SOCKADDR_COMMON (sin_); in_port_t sin_port; /* Port number. */ struct in_addr sin_addr; /* Internet address. */ /* Pad to size of `struct sockaddr'. */ unsigned char sin_zero[sizeof (struct sockaddr) - __SOCKADDR_COMMON_SIZE - sizeof (in_port_t) - sizeof (struct in_addr)]; }; ``` - bind system call - [SYSCALL_DEFINE3(bind, int, fd, struct sockaddr __user *, umyaddr, int, addrlen)](http://lxr.free-electrons.com/source/net/socket.c?v=3.2#L1424) - look up struct socket object from fd (fd --> file --> socket) - return file->private_data; /* set in sock_map_fd */ - calling PF-specific bind() - sock->ops->bind() - for AF_INET, its [inet_bind()](http://lxr.free-electrons.com/source/net/ipv4/af_inet.c?v=3.2#L450) - calling protocol-specific bind() if there is one - calling protocol-specifc get_port() - [inet_csk_get_port()](http://lxr.free-electrons.com/source/net/ipv4/inet_connection_sock.c?v=3.2#L91) # connect() ``` NAME connect - initiate a connection on a socket SYNOPSIS #include <sys/types.h> /* See NOTES */ #include <sys/socket.h> int connect(int sockfd, const struct sockaddr *addr, socklen_t addrlen); ``` - The connect() system call connects the socket referred to by the file descriptor **sockfd** to the address specified by **addr**. The **addrlen** argument specifies the size of addr. - Generally, connection-based protocol sockets may successfully connect() only once; connectionless protocol sockets may use connect() multiple times to change their association. - If the connection or binding succeeds, zero is returned. On error, -1 is returned, and errno is set appropriately. - For TCP socket, connect() is a blocking call in which three-way handshake is performed - connect system call - [SYSCALL_DEFINE3(connect, int, fd, struct sockaddr __user *, uservaddr, int, addrlen)](http://lxr.free-electrons.com/source/net/socket.c?v=3.2#L1578) - err = sock->ops->connect() - AF_INET / SOCK_DGRAM : [inet_dgram_connect()](http://lxr.free-electrons.com/source/net/ipv4/af_inet.c?v=3.2#L542) - calling protocol-specific connect() : [ip4_datagram_connect()](http://lxr.free-electrons.com/source/net/ipv4/datagram.c?v=3.2#L23) - setting port, address, routing etc. - - AF_INET / SOCK_STREAM : [inet_stream_connect()](http://lxr.free-electrons.com/source/net/ipv4/af_inet.c?v=3.2#L585) - if SS_UNCONNECTED, calling protocol-specific connect() : [tcp_v4_connect()](http://lxr.free-electrons.com/source/net/ipv4/tcp_ipv4.c?v=3.2#L148) - SYN packet will be sent - blocking - non-blocking - [struct sk_buff](http://lxr.free-electrons.com/source/include/linux/skbuff.h?v=3.2#L372) - tcp_tw_recycle - [Coping with the TCP TIME-WAIT state on busy Linux servers](https://vincent.bernat.im/en/blog/2014-tcp-time-wait-state-linux) - - # listen() ``` NAME listen - listen for connections on a socket SYNOPSIS #include <sys/types.h> /* See NOTES */ #include <sys/socket.h> int listen(int sockfd, int backlog); ``` - listen() marks the socket referred to by sockfd as a passive socket, that is, as a socket that will be used to accept incoming connection requests using accept(2). - The **sockfd** argument is a file descriptor that refers to a socket of type SOCK_STREAM or SOCK_SEQPACKET. - The **backlog** argument defines the maximum length to which the queue of pending connections for sockfd may grow. If a connection request arrives when the queue is full, the client may receive an error with an indication of ECONNREFUSED or, if the underlying protocol supports retransmission, the request may be ignored so that a later reattempt at connection succeeds. - listen system call - [SYSCALL_DEFINE2(listen, int, fd, int, backlog)](http://lxr.free-electrons.com/source/net/socket.c?v=3.2#L1453) - calling PF-specific listen [inet_listen()](http://lxr.free-electrons.com/source/net/ipv4/af_inet.c?v=3.2#L193) - [inet_csk_listen_start()](http://lxr.free-electrons.com/source/net/ipv4/inet_connection_sock.c?v=3.2#L650) - net.ipv4.tcp_max_syn_backlog # accept() ``` NAME accept, accept4 - accept a connection on a socket SYNOPSIS #include <sys/types.h> /* See NOTES */ #include <sys/socket.h> int accept(int sockfd, struct sockaddr *addr, socklen_t *addrlen); #define _GNU_SOURCE /* See feature_test_macros(7) */ #include <sys/socket.h> int accept4(int sockfd, struct sockaddr *addr, socklen_t *addrlen, int flags); ``` - The accept() function shall extract the first connection on the queue of pending connections, create a new socket with the same socket type protocol and address family as the specified socket, and allocate a new file descriptor for that socket. - **socket** Specifies a socket that was created with socket(), has been bound to an address with bind(), and has issued a successful call to listen(). - If the listen queue is empty of connection requests and O_NONBLOCK is not set on the file descriptor for the socket, accept() shall block until a connection is present. If the listen() queue is empty of connection requests and O_NONBLOCK is set on the file descriptor for the socket, accept() shall fail and set errno to [EAGAIN] or [EWOULDBLOCK]. - If **flags is 0, then accept4() is the same as accept(). The following values can be bitwise ORed in flags to obtain different behavior: - SOCK_NONBLOCK - SOCK_CLOEXEC - On success, these system calls return a nonnegative integer that is a descriptor for the accepted socket. - accept system call - [SYSCALL_DEFINE4(accept4, int, fd, struct sockaddr __user *, upeer_sockaddr, int __user *, upeer_addrlen, int, flags)](http://lxr.free-electrons.com/source/net/socket.c?v=3.2#L1486) - [inet_csk_accept()](http://lxr.free-electrons.com/ident?v=3.2;i=inet_csk_accept) - long timeo = sock_rcvtimeo(sk, flags & O_NONBLOCK); - setsockopt() - SO_RCVTIMEO - MAX_SCHEDULE_TIMEOUT - [Byte Order](http://www.bruceblinn.com/linuxinfo/ByteOrder.html) - "since it is not possible to predict the type of system at either end of the network, network protocols must define the byte order that is used for multi-byte values in their headers. This is called the network byte order, and for TCP/IP, it is big endian." - __be16 / __be32 -