c - general question about solving problems with parallelisation -
I have a general question about programming parallel algorithms in C. Let's say that our work is to implement some matrix algorithms with the MPI. And / or OpenMPs are some situations, such as false partnerships in OpenMP or IPI, where the dependence of the matrix dimension (divided into columns cyclic processes) arises, there can be some problems. Will it be a common effort to solve a good and common situation, for example, transporancing to the matrix, as this can reduce the required communication or avoid the false sharing problem? After that you will undo the dignity of course, assuming that better speed will increase than this, I do not think that it will be very clever and there will be more ways of laziness to do this. But I am curious to read some of these challenges.
Let's first start with the first question: it can do to move the feeling ? The answer is, it depends, and you can guess whether it will improve things or not.
One time memory bandwidth cost is 2 * (faster going through memory) + 2 * (slow running through memory) where those memory operations are in Multicore In case, or network communication there are literally memory operations. You are fast reading matrix and putting it in slow memory. (You can do this by running matriculation one time by reading the matrix in a cache-size block, transposing it in the cache and writing in sequence), essentially, 4 *.
Whether it is a win or not, it depends on how often you will reach the array. If you hit the entire non-transpised array 4 times with memory access in "wrong" direction, then you will obviously win by doing two transposes. If you are only going through a non-transmitted array once in the wrong direction, then you certainly will not be able to win by interacting.
Being a big question, @Elexandre is absolutely right here: Trying to apply your own linear algebraic routine is a crazy look, for example, figure 3; Between exceptional and high-tuned performance, there may be factors in 40 (say) GEMM Operation These things are extremely limited to memory-bandwidth, and in parallel it means that the network is limited.
or a full solver environment such as
Comments
Post a Comment