MPI Hands On – Version 2.2.1

Transcription

MPI Hands-On – List of the exercises
1 MPI Hands-On – Exercise 1: MPI Environment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
2 MPI Hands-On – Exercise 2: Ping-pong . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
3 MPI Hands-On – Exercise 3: Collective communications and reductions . . . . . . . . . . . . . . . . 5
4 MPI Hands-On – Exercise 4: Matrix transpose . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
5 MPI Hands-On – Exercise 5: Matrix-matrix product . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
6 MPI Hands-On – Exercise 6: Communicators . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
7 MPI Hands-On – Exercise 7: Read an MPI-IO file . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
8 MPI Hands-On – Exercise 8: Poisson’s equation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
INSTITUT DU DÉVELOPPEMENT
ET DES RESSOURCES
EN INFORMATIQUE SCIENTIFIQUE
MPI Hands On – Version 2.3 – January 2017
J. Chergui, I. Dupays, D. Girou, P.-F. Lavallée, D. Lecas, P. Wautelet
2/24
1 – MPI Hands-On – Exercise 1: MPI Environment
MPI Hands-On – Exercise 1: MPI Environment
All the processes print a different message, depending on their odd or even rank.
For example, for the odd-ranked processes, the message will be:
I am the odd-ranked process, my rank is M
For the even-ranked processes:
I am the even-ranked process, my rank is N
Remark: You could use the Fortran intrinsic function mod to test if the rank is
even or odd. The function mod(n,m) gives the remainder of n divided by m.
ET DES RESSOURCES
3/24
2 – MPI Hands-On – Exercise 2: Ping-pong
MPI Hands-On – Exercise 2: Ping-pong
Point to point communications: ping-pong between two processes
1
2
3
In the first sub-exercise, we will do only a ping (sending a message from process 0
to process 1).
In the second sub-exercise, after the ping we will do a pong (process 1 sends the
message received from process 0).
In the last sub-exercise, we will do a ping-pong with different message sizes.
This means:
1
2
3
Send a message of 1000 reals from process 0 to process 1 (this is only a ping).
Create a ping-pong version where process 1 sends the message received from
process 0 and measures the communication with the MPI_WTIME() function.
Create a version where the message size vary in a loop and which measures
communication durations and bandwidths.
ET DES RESSOURCES
4/24
2 – MPI Hands-On – Exercise 2: Ping-pong
Remarks:
The generation of random numbers uniformly distributed in the range [0., 1.[ is
made by calling the Fortran random_number subroutine:
call random_number(variable)
variable can be a scalar or an array
The time duration measurements could be done like this:
...................................................
time_begin=MPI_WTIME ()
...................................................
time_end=MPI_WTIME ()
print (’("... in",f8.6," seconds.")’),time_end-time_begin
...................................................
ET DES RESSOURCES
5/24
3 – MPI Hands-On – Exercise 3: Collective communications and reductions
MPI Hands-On – Exercise 3: Collective communications and reductions
The Raim of this exercice is to compute pi by numerical integration.
1
4
π = 0 1+x
2 dx.
We use the rectangle method (mean point).
Let f (x) =
4
1+x2
be the function to integrate.
nbblock is the number of points of discretization.
width =
1
nbblock
the length of discretization and the width of all rectangles.
Sequential version is available in the pi.f90 source file.
You have to do the parallel version with MPI in this file.
ET DES RESSOURCES
6/24
4 – MPI Hands-On – Exercise 4: Matrix transpose
MPI Hands-On – Exercise 4: Matrix transpose
The goal of this exercise is to practice with the derived datatypes.
A is a matrix with 5 lines and 4 columns defined on the process 0.
Process 0 sends its A matrix to process 1 and transposes this matrix during the
send.
1.
6. 11. 16.
2.
7. 12. 17.
1.
2.
3.
4.
3.
8. 13. 18.
6.
7.
8.
9. 10.
4.
9. 14. 19.
11. 12. 13. 14. 15.
5. 10. 15. 20.
16. 17. 18. 19. 20.
Process 0
5.
Process 1
Figure 1 : Matrix transpose
To do this, we need to create two derived datatypes, a derived datatype
type_line and a derived datatype type_transpose.
ET DES RESSOURCES
7/24
5 – MPI Hands-On – Exercise 5: Matrix-matrix product
MPI Hands-On – Exercise 5: Matrix-matrix product
Collective communications: matrix-matrix product C = A × B
The matrixes are square and their sizes are a multiple of the number of processes.
The matrixes A and B are defined on process 0. Process 0 sends a horizontal slice
of matrix A and a vertical slice of matrix B to each process. Each process then
calculates its diagonal block of matrix C.
To calculate the non-diagonal blocks, each process sends to the other processes its
own slice of A (see figure 2).
At the end, process 0 gathers and verifies the results.
ET DES RESSOURCES
8/24
B
0
b
b
b
b
b
b
b
b
b
b
b
b
b
b
b
b
1
2
3
A
0
1
2
3
b
b
b
b
b
b
b
b
b
b
b
b
b
b
b
b
××××
××××
××××
××××
b
b
b
b
bc
bc
bc
bc
bc
bc
bc
bc
bc
bc
bc
bc
C
Figure 2 : Distributed matrix product
ET DES RESSOURCES
9/24
The algorithm that may seem the most immediate and the easiest to program,
consisting of each process sending its slice of its matrix A to each of the others,
does not perform well because the communication algorithm is not
well-balanced. It is easy to seen this when doing performance measurements and
graphically representing the collected traces. See the files
produit_matrices_v1_n3200_p4.slog2,
produit_matrices_v1_n6400_p8.slog2 and
produit_matrices_v1_n6400_p16.slog2, using the jumpshot of MPE
(MPI Parallel Environment).
ET DES RESSOURCES
10/24
Figure 3 : Parallel matrix product on 4 processes, for a matrix size of 3200 (first algorithm)
ET DES RESSOURCES
11/24
Figure 4 : Parallel matrix product on 16 processes, for a matrix size of 6400 (first algorithm)
ET DES RESSOURCES
12/24
Changing the algorithm in order to shift slices from process to process, we
obtain a perfect balance between calculations and communications and have a
speedup of 2 compared to the naive algorithm. See the figure produced by
the file produit_matrices_v2_n6400_p16.slog2.
Figure 5 : Parallel matrix product on 16 processes, for a matrix size of 6400 (second algorithm)
ET DES RESSOURCES
13/24
6 – MPI Hands-On – Exercise 6: Communicators
MPI Hands-On – Exercise 6: Communicators
Using the Cartesian topology defined below, subdivide in 2 communicators
following the lines by calling MPI_COMM_SPLIT()
v(:)=1,2,3,4
1
w=1.
1
w=2.
3
w=3.
5
w=4.
7
v(:)=1,2,3,4
0
w=1.
0
0
w=2.
2
w=3.
4
1
2
w=4.
6
3
Figure 6 : Subdivision of a 2D topology and communication using the obtained 1D topology
ET DES RESSOURCES
14/24
7 – MPI Hands-On – Exercise 7: Read an MPI-IO file
MPI Hands-On – Exercise 7: Read an MPI-IO file
We have a binary file data.dat with 484 integer values.
With 4 processes, it consists of reading the 121 first values on process 0, the
121 next on the process 1, and so on.
We will use 4 different methods:
Read
Read
Read
Read
via
via
via
via
explicit offsets, in individual mode
shared file pointers, in collective mode
individual file pointers, in individual mode
shared file pointers, in individual mode
To compile and execute the code, use make, and to verify the results, use
make verification which runs a visualisation program corresponding to the
four cases.
ET DES RESSOURCES
15/24
8 – MPI Hands-On – Exercise 8: Poisson’s equation
MPI Hands-On – Exercise 8: Poisson’s equation
Resolution of the following Poisson equation :
 2

∂ u + ∂ 2 u = f (x, y) in [0, 1]x[0, 1]


∂x2
∂y 2
= 0. on the boundaries

 u(x, y)

f (x, y)
= 2. x2 − x + y 2 − y
We will solve this equation with a domain decomposition method :
The equation is discretized on the domain with a finite difference method.
The obtained system is resolved with a Jacobi solver.
The global domain is split into sub-domains.
The exact solution is known and is uexact (x, y) = xy (x − 1) (y − 1).
ET DES RESSOURCES
16/24
To discretize the equation, we define a grid with a set of points (xi , yj )
xi =
yj =
hx =
hy =
hx :
hy :
ntx :
nty :
i hx for i = 0, . . . , ntx + 1
j hy for j = 0, . . . , nty + 1
1
(ntx + 1)
1
(nty + 1)
x-wise step
y-wise step
number of x-wise interior points
number of y-wise interior points
In total, there are ntx+2 points in the x direction
and nty+2 points in the y direction.
ET DES RESSOURCES
17/24
Let uij be the estimated solution at position xi = ihx and xj = jhy .
The Jacobi solver consist of computing :
un+1
=
ij
n
n
n
c0 (c1 (un
i+1j + ui−1j ) + c2 (uij+1 + uij−1 ) − fij )
with:
c0 =
ET DES RESSOURCES
1 h2x h2y
2 h2x + h2y
1
c1 = 2
hx
1
c2 = 2
hy
18/24
In parallel, the interface values of subdomains must be exchanged between the
neighbours.
We use ghost cells as receive buffers.
ET DES RESSOURCES
ut
ut
ut
ut
ut
ut
ut
ut
ut
ut
8ut – MPI
Hands-On
– Exercise
8:ut Poisson’s
equation
19/24
ut
ut
ut
ut
ut
ut
ut
ut
ut
ut
ut
ut
ut
ut
ut
ut
ut
ut
ut
ut
ut
ut
ut
ut
ut
ut
ut
ut
ut
ut
ut
ut
ut
ut
ut
utr
utr
utr
utr
ut
rut
rut
rut
rut
rut
rut
rut
rut
utr
utr
utr
utr
utr
utr
utr
rut
rut
rut
rut
rut
rut
rut
rsut
rs
rs
rs
rs
rs
rs
rs
rs
utr
rut
rut
rut
rut
rut
rut
rut
rsut
rs
rs
rs
rs
rs
rs
rs
rs
N
bc
bc
bc
bc
bc
bc
bcutr
bcutr
bcutr
bcutr
rut
rut
rut
rut
rut s
rut s
rut s
rsut
rs
rs
rs
rs
rs
rs
rs
rs
bc
bc
bc
bc
bcr
bcr
bcrut
bcrut
bcrut
bcutr
rut
rut
rut
rut
rut s
rut s
rut s
rsut
rs
rs
rs
rs
rs
rs
rs
rs
bc
bc
bc
bc
bcr
bcr
bcrut
bcrut
bcrut
bcutr
rut
rut
rut
rut
rut s
rut s
rut s
rsut
rs
rs
rs
rs
rs
rs
rs
rs
bc
bc
bc
bc
bcr
bcr
bcrut
bcrut
bcrut
bcutr
rut
rut
rut
rut
rut s
rut s
rut s
rsut
rs
rs
rs
rs
rs
rs
rs
rs
bc
bc
bc
bc
bcr
bcr
bcr
bcr
bcr
bcr
r
r
r
r
rs
rs
rs
rs
rs
rs
rs
rs
rs
rs
rs
rs
bc
bc
bc
bc
bcr
bcr
bcr
bcr
bcr
bcr
r
r
r
r
rs
rs
rs
rs
rs
rs
rs
rs
rs
rs
rs
rs
bc
bc
bc
bc
bcr
bcr
bcr
bcr
bcr
bcr
r
r
r
r
rs
rs
rs
rs
rs
rs
rs
rs
rs
rs
rs
rs
bc
bc
bc
bc
bcr
bcr
bcr
bcr
bcr
bcr
r
r
r
r
rs
rs
rs
rs
rs
rs
rs
rs
rs
rs
rs
rs
bc
bc
bc
bc
bcr
bcr
bcrld
bcrld
bcrld
bcldr
rld
rld
rld
rld
rld s
rld s
rld s
rsld
rs
rs
rs
rs
rs
rs
rs
rs
bc
bc
bc
bc
bcr
bcr
bcrld
bcrld
bcrld
bcldr
ldr
ldr
ldr
ldr
ldr s
ldr s
ldr s
ldrs
ldrs
ldrs
rs
rs
rs
rs
rs
rs
bc
bc
bc
bc
bcr
bcr
bcrld
bcrld
bcrld
bcldr
ldr
ldr
ldr
ldr
ldr s
ldr s
ldr s
ldr s
rld s
ldr
r
r
bc
bc
bc
bc
bcr
bcr
bcrld
bcrld
bcrld
bcldr
ldr
ldr
ldr
ldr
ldr s
ldr s
ldr s
ldr s
rld s
ldr
r
r
rld
rld
rld
ldr
ldr
ldr
ldr
ldr
ldr
ldr
ldr
ldr
rld
ldr
ldr
ldr
ldr
ldr
ldr
ldr
ldr
ldr
ldr
ldr
rld
ldr
ldr
ldr
ldr
ldr
ldr
ldr
ldr
ldr
ldr
ldr
rld
rld
ld
ld
ld
ld
ld
ld
ld
ld
ld
ld
ld
ld
ld
ld
ld
ld
ld
ld
ld
ld
ld
ld
ld
ld
ld
ld
ld
ld
ld
ld
ld
ld
ld
ld
ld
ld
W
S
E
Figure 7 : Exchange points on the interfaces
ET DES RESSOURCES
20/24
y
u(sx,ey+1)
u(sx,ey)
ey+1
u(sx,sy)
ey
sy
sy-1
x
u(sx-1,sy)
u(sx,sy-1)
sx-1
sx
ex
ex+1
u(ex+1,sy)
u(ex,sy)
Figure 8 : Numeration of points in different sub-domains
ET DES RESSOURCES
21/24
y
x
0
1
2
3
Figure 9 : Process rank numbering in the sub-domains
ET DES RESSOURCES
22/24
Process 0
Process 1
File
Process 2
Process 3
Figure 10 : Writing the global matrix u in a file
You need to :
Define a view, to see only the owned part of the global matrix u;
Define a type, in order to write the local part of matrix u(without interfaces);
Apply the view to the file;
Write using only one call.
ET DES RESSOURCES
23/24
Initialisation of the MPI environment.
Creation of the 2D Cartesian topology/
Determination of the array indexes for each sub-domain.
Determination of the 4 neighbour processes for each sub-domain.
Creation of two derived datatypes, type_line and type_column.
Exchange the values on the interfaces with the other sub-domains.
Computation of the global error. When the global error is lower than a specified
value (machine precision for example), we consider that we have reached the
exact solution.
Collecting of the global matrix u (the same one as we obtained in the
sequential) in an MPI-IO file data.dat.
ET DES RESSOURCES
24/24
A skeleton of the parallel version is proposed: It consists of a main program
(poisson.f90) and several subroutines. All the modifications have to be done in
the module_parallel_mpi.f90 file.
To compile and execute the code, use make and to verify the results, use
make verification which runs a reading program of the data.dat file and
compares it with the sequential version.
ET DES RESSOURCES

MPI Hands On – Version 2.2.1

Transcription

Documents pareils

SPP 1726: Microswimmers - Forschungszentrum Jülich

An Introduction to Parallel Computing With MPI Computing Lab I

Exercise session 5 - Laure

Diapositive 1

Personal Development , Personal Efficiency

Marketing and Sales, Marketing and Sales Fundamentals

Operating Systems, Unix / Mac OS X

The Battle of Raab - Free NAPOLEONIC Scenarios

Outreach to Industry 1 Meeting, Paris

Date of Request: Company name Contact person and title: E