contributed by < yaohwang99
>
socket()
, bind()
listen()
to create socket and link to server address.accept()
will block the process until connected to some client.The epoll API monitors multiple file descriptors to see if I/O is possible on any of them.
The central concept of the epoll API is the epoll instance, an in-kernel data structure which, from a user-space perspective, can be considered as a container for tow lists:
epoll_create()
creates a new epoll instance and a fd referring to that instance.epoll_ctl()
adds items to the interest list.epoll_wait()
waits for I/O events(fetch items from the ready list), blocking the calling thread.epoll_ctl(epoll_fd, EPOLL_CTL_ADD, listener, &ev)
links ev
with listener
and add to interest list.ev
describes listerner
that it is available for read and edge trigger.The above code initiates the epoll instance by adding listener
to the interest list.
epoll_waits
writes all the ready events to events[]
and return the number of ready events.
Iterate through the ready events. If the event is listener
, then accept new client. Note that listener
is set to non-blocking so if accept failed, the thread can continue to handle other event.
Also, the program keep tracks of the client list by push_back_client()
.
Test program by Manistein
From the test program we can see that the reader my fail to read all the data in the buffer, but the fd will not be ready at the next call of epoll_wait
. So the remaining data in the buffer will never be read.
Create MAX_THREAD
number of threads
Each thread executes bench_worker()
:
pthread_cond_wait(&worker_wait, &worker_lock)
releases worker_lock
and blocked until pthread_cond_broadcast(&worker_wait)
is called by bench()
. This step is to make sure each thread is ready before connected to the server.
After connection finishs read and write, store the measured time in time_res[]
which is protected by the mutex lock.
At line 13, the program calls pthread_cond_broadcast
to unblock every thread that is blocked by worker_wait
.
At line 25, the program will output the result of the average response time of each thread.
Concurrency Managed Work Queue
In the original wq implementation, a multi threaded (MT) wq had one worker thread per CPU and a single threaded (ST) wq had one worker thread system-wide. A single MT wq needed to keep around the same number of workers as the number of CPUs. The kernel grew a lot of MT wq users over the years and with the number of CPU cores continuously rising, some systems saturated the default 32k PID space just booting up.
Concurrency Managed Workqueue (cmwq) is a reimplementation of wq with focus on the following goals:
CMWQ design:
At line 28, a new work queue is allocated with WQ_UNBOUND
The last argument 0
is @max_active which determines the maximum number of execution contexts per CPU which can be assigned to the work items of a wq. For example, with @max_active of 16, at most 16 work items of the wq can be executing at the same time per CPU.
For an unbound wq, the limit is higher of 512
and 4 * num_possible_cpus()
. These values are chosen sufficiently high such that they are not the limiting factor while providing protection in runaway cases.
The number of active work items of a wq is usually regulated by the users of the wq, more specifically, by how many work items the users may queue at the same time. Unless there is a specific need for throttling the number of active work items, specifying ‘0’ is recommended.
At line 29 of the above code, a kernel thread is created to run echo_server_daemon()
, which creates new work and insert into the workqueue.
A socket is stored as a structure in the kernel:
The in-kernel server is created by the following steps:
sock_create()
creates a socket.
kernel_bind()
binds the socket to the server address.
kernel_listen()
sets the socket to listen mode.
Use kernel_accept()
to create a new socket for a client, and then assign the socket to a work and insert to the work queue.
From The Kernel Module Programming Guide:
4.5 Passing Command Line Arguments to a Module
Modules can take command line arguments, but not with the argc/argv you might be used to.
To allow arguments to be passed to your module, declare the variables that will take the values of the command line arguments as global and then use the module_param() macro, (defined in include/linux/moduleparam.h) to set the mechanism up. At runtime, insmod will fill the variables with any command line arguments that are given, like insmod mymodule.ko myvariable=5 . The variable declarations and macros should be placed at the beginning of the module for clarity. The example code should clear up my admittedly lousy explanation.
Linux Application Performance introduction
process–based vs. thread–based vs. event–based
select()
:
nfds
is the highest-numbered file descriptor in any of the three sets, plus 1.32 * 32
bit map to record if the event is set.poll()
:
nfds
.epoll()
:
select
and poll
needs to traverse through the fd list to check if any of them is set. epoll
only needs to check if the "ready list" is not empty.select
or poll
performs better when only very few fd is in the set.
The reason that pselect()
or ppoll
is needed is that if one wants to wait for either a signal or for a file descriptor to become ready, then an atomic test is needed to prevent race conditions.
(Suppose the signal handler sets a global flag and returns. Then a test of this global flag followed by a call of select() could hang indefinitely if the signal arrived just after the test but just before the call. By contrast, pselect() allows one to first block signals, handle the signals that have come in, then call pselect() with the desired sigmask, avoiding the race.)
Similar to ktcp, the module creates a kernel thread to run http_server_daemon()
in the background.
In open_listen_socket()
, a lot of option is set for the sock
.
SO_REUSEADDR
: Reuse of local addresses is supported.
SO_SNDBUF
: Send buffer size.
SO_RCVBUF
: Receive buffer size.
TCP_NODELAY
: Applications that require lower latency on every packet sent must be run on sockets with TCP_NODELAY enabled.
Remove TCP_CORK
: When the logical packet has been built in the kernel by the various components in the application, tell TCP to remove the cork. TCP will send the accumulated logical packet right away, without waiting for any further packets from the application.
hstress
uses both multi-thread and epoll to increase the number of request per second.
The number of requests/sec
is also affected by:
By tweaking the number of thread and concurrency level, we can see different results with the same server.
Define a new structure for binding the socket to the work.
Allocate a workqueue.
Modify the to match the function used for the work.
The key point is to use the macro container_of()
to access struct http_work *hwork
and remember to release the memeory.
Result: