Friday, June 28, 2019

Memory Understanding and Disruptor in Multithreading Environment

Disruptor is an open-source Java library written by LMAX Exchange Teem, built out of  an understanding of Mechanical Sympathy to make hardware and software working together to complete a multi-threading task too fast.

The major bottleneck these days is not the CPU speed but the memory access speed.
To understand better, you need first to understand how the hardware works.

Memory


1- Each CPU core has separate caches except L3 cache.
2- The cache size is too small, it is on mages and kilo and even bytes.
but these caches are too fast to read or write from them so they are important to achieve a task in short time.
3- The L3 cache, and higher-level caches, are shared between the cores and are not split.
4- Caches are generally sized in powers of two: 4, 8, 16 etc.


When a processor wants to access a memory address, it will first check in L1. If it was not there, it will produce a cache miss which means the processor will check then in L2. Same for L2 and L3 and the processor can end up checking in the DRAM which is about 60 times slower than L1.


Latency from CPU to...Approx. number of
CPU cycles
Approx. time
in nanoseconds
Main memory~60-80ns
QPI transit
(between sockets, not drawn)
~20ns
L3 cache~40-45 cycles,~15ns
L2 cache~10 cycles,~3ns
L1 cache~3-4 cycles,~1ns
Register1 cycle

Caches generally use SRAM (Static RAM) which is much faster but costlier than DRAM (Dynamic RAM) used in memory.




CPU caches are divided into a sets each set called cache lines , Modern cache lines have a size of 64 bytes, So CPU fetches 64 bytes from memory to cache and writes the same number of bytes to the memory.

This means if you have an 4 byte integer array whose value will lie close to each other, CPU will fetch 16 integers at once from this array.

The processor will read or write an entire cache line when any location in the 64 byte region is read or written.

This is also the same reason if you create a 2-D matrix, it’s row major traversal will be faster than it’s column major traversal. A row is stored in consecutive memory locations and is thus fetched in a cache line.

A cache is divided into a number of sets, Each set can hold a number of cache lines.

Cache Invalidation 

what happens when data goes into two different cache locations. 
e.g. A variable x might exists in L1 Cache of Core1 and exists at L1 Cahche of Core2 . 
What will happen when you update that variable?

When multiple threads try to access the same value at same time(data race) and at least one of them is a write, Whenever a write happens for a memory location in cache, all the other cache references for that location are invalidated and is fetched again from the memory by the core, if required. However, since the unit of cache is a cache line, the whole cache line will be goner rather than only required bytes.

Let’s take an example with two variables x and y accessed by two threads scheduled on two different cores. The first thread modifies x and the second modifies y few ns later.



If both variables are part of the same cache line (hashing their address gives the same cache line), the second core will see its whole line marked as invalid even though the program was only interested in y and not x. As a result, the second core will fetch a more recent copy of this line somewhere else (in L2, L3 or DRAM). This problem is known as false sharing.



This creates new problems for Multithreaded systems, when two or more threads try to simultaneously modify bytes falling on the same cache line , and no compiler warning is going to tell you that you just wrote code that's going to be very inefficient for concurrent access.

Avoid cache line sharing between threads (false sharing) by Cache Line Padding

You'll see that the Disruptor eliminates this problem, at least for architecture that has a cache size of 64 bytes or less, by adding padding to  the cache line between any fields with 7 longs. to  ensure the ring buffer's sequence number is never in a cache line with anything else.
public long p1, p2, p3, p4, p5, p6, p7; // cache line padding
    private volatile long cursor = INITIAL_CURSOR_VALUE;
    public long p8, p9, p10, p11, p12, p13, p14; // cache line padding
So there's no false sharing, no unintended contention with any other variables, no needless cache misses.

It's worth doing this on your Entry classes too - if you have different consumers writing to different fields, you're going to need to make sure there's no false sharing between each of the fields.


Out of Order Execution

Out of Order (OoO) means when a program is executed it doesn't matter if its instructions are re-ordered provided the same end result is achieved. 

For performance reasons, most modern CPU employ performance optimizations which may result in out-of-order execution.


CPU cores contain multiple execution units. For example, a modern Intel CPU contains 6 execution units which can do a combination of arithmetic, conditional logic, and memory manipulation.  Each execution unit can do some combination of these tasks.  These execution units can operate in parallel allowing instructions to be executed in parallel.


CPU, and compiler, are free to do whatever they see fit to improve performance.



In multithreading program you need to ensure that all instructions either side of the barrier appear in the correct program order if observed from another CPU and that memory visible by ensuring the data is propagated to the cache sub-system. 
This can be done by using Memory Barrier.
Example:

int x = 0, y = 0, r1 = 0, r2 = 0;
public void Thread1()
{    
    y = 1;  // Store y
    r1 = x; // Load x            
       
}
     
public void Thread2()
{
    x = 1;  // Store x
    r2 = y; // Load y     
}

When Thread1 and Thread2 start , They can execute in different order or can execute together which will leads to different result according to the execution order. 

A processor can also defer its store operations in a buffer, and this causes operations to be out of order. If both stores to x and y are delayed, then the load from x in Thread 1 will read 0, and the load from y in Thread2 will read 0 too!

To fix the problem, add a memory barrier Thread.MemoryBarrier() between the store and load operations in each thread. 

Memory Fenses

A memory barrier, also known as a member, memory fence or fence instruction, 
is a type of barrier instruction that causes a central processing unit (CPU) or compiler to enforce an ordering constraint on memory operations issued before and after the barrier instruction. This typically means that operations issued prior to the barrier are guaranteed to be performed before operations issued after the barrier.

There are different kinds of memory barriers:
1- Store memory barrier (write memory barrier): All the STORE operations that appear before the memory barrier will appear to happen before all the STORE operations that appear after the memory barrier ( The equivalent CPU instruction is SFENCE).
the memory barrier simply ensures that all writes to the protected data will be visible to other CPU's or devices if the write to release the lock is visible. 
The store buffers flushed to cache for the CPU on which it is issued.
This will make the program state visible to other CPUs so they can act on it if necessary.
2- Load memory barrier (read memory barrier)

All the LOAD operations that appear before the barrier will appear to happen before all the LOAD operations that appear after the memory barrier The equivalent CPU instruction isLFENCE).
Full memory barrier

When a program runs on a single-CPU machine, the hardware performs the necessary bookkeeping to ensure that the program executes as if all memory operations were performed in the order specified by the programmer (program order), so memory barriers are not necessary.

Example
The following two-processor program gives an example of how such out-of-order execution can affect program behavior:
Initially, memory locations x and f both hold the value 0. The program running on processor #1 loops while the value of f is zero, then it prints the value of x. The program running on processor #2 stores the value 42 into x and then stores the value 1 into f. Pseudo-code for the two program fragments is shown below. The steps of the program correspond to individual processor instructions.
Processor #1:
while (f == 0);
 // Memory fence required here
 print x;
Processor #2:
x = 42;
 // Memory fence required here
 f = 1;
One might expect the print statement to always print the number "42"; however, if processor #2's store operations are executed out-of-order, it is possible for f to be updated before x, and the print statement will not be invoked it will come out from loop. Similarly, processor #1's load operations may be executed out-of-order and it is possible for x to be read before f is checked, and again the print statement might therefore print an unexpected value. For most programs neither of these situations is acceptable. A memory barrier can be inserted before processor #2's assignment to f to ensure that the new value of x is visible to other processors at or prior to the change in the value of f. Another can be inserted before processor #1's access to x to ensure the value of x is not read prior to seeing the change in the value of f.

The other thing a memory barrier does is force an update of the various CPU caches - for example, a write barrier will flush all the data that was written before the barrier out to cache, therefore any other thread that tries to read that data will get the most up-to-date version regardless of which core or which socket it might be executing by.

If your field is volatile, the Java Memory Model inserts a write barrier instruction after you write to it, and a read barrier instruction before you read from it.



This means if you write to a volatile field, you know that:
  1. Any thread accessing that field after the point at which you wrote to it will get the updated value 
  2. Anything you did before you wrote that field is guaranteed to have happened and any updated data values will also be visible, because the memory barrier flushed all earlier writes to the cache.

Volatile

Volatile variables require less coding to use than synchronized blocks and often have less runtime overhead because of locks it makes, but they can only be used to do a subset of the things that synchronized can.
Locks offer two primary features: mutual exclusion and visibility. 
Mutual exclusion lock means that only one thread at a time may hold a given lock.
Visibility is ensuring that changes made to shared data prior to releasing a lock are made visible to other threads.
Volatile variables share the visibility features of synchronized, but none of the atomicity features.
if you have a field that you need to first read and only then write based on the value you just read volatile is not enough; instead, you can use AtomicReference, AtomicInteger, or synchronization.
volatile alone is not strong enough to implement a counter, a mutex, or any class that has invariants that relate multiple variables (such as "start <=end").
You might prefer to use volatile variables instead of locks for one of two principal reasons: simplicity or scalability.
volatile variables (unlike locks) cannot cause a thread to block, so they are less likely to cause scalability problems.
Volatile can be used to provide thread safety, but only in a very restricted set of cases.

Conditions for correct use of volatile

You can use volatile variables instead of locks only under a restricted set of circumstances. 
Both of the following criteria must be met for volatile variables to provide the desired thread-safety:
- Writes to the variable do not depend on its current value.
- The variable does not participate in invariants with other variables.

While the increment operation (x++) may look like a single operation, it is really a compound read-modify-write sequence of operations that must execute atomically.

Disruptor

In 2010,  LMAX Exchange Teem surprised everyone with how fast their system can be by doing all the business logic on a single thread. 

Even though a single thread was an important concept in their solution, Disruptor (which is not part of the business logic) works in a multithreaded environment. 

Disruptor is a high-performance library for passing messages between threads.
Disruptor is based on ring buffer.
The Ring Buffer curser is volatile as you see in the code above , and it's one of the reasons we can get away with implementing the Disruptor without locking.


Download Disruptor Example


Download Example



https://en.wikipedia.org/wiki/Memory_barrier

No comments:

Post a Comment

java - fill distinct objects in ArrayList

If you have a list that contains some objects that it is considered as duplication when two or three fields of these objects are equal. How ...