tailieunhanh - Parallel Programming: for Multicore and Cluster Systems- P45
Parallel Programming: for Multicore and Cluster Systems- P45: Innovations in hardware architecture, like hyper-threading or multicore processors, mean that parallel computing resources are available for inexpensive desktop computers. In only a few years, many standard software products will be based on concepts of parallel programming implemented on such hardware, and the range of applications will be much broader than that of scientific computing, up to now the main application area for parallel computing | Conjugate Gradient Method 423 processor performs the arithmetic operations locally and the vector xk 1 results in a blockwise distribution. 4 The axpy-operation gk 1 gk ak wk is computed analogously to computation step 3 and the result vector gk 1 is distributed in a blockwise way. 5 The scalar product yk 1 g 1 gk 1 is computed analogously to computation step 2 . The resulting scalar value ftk is computed by the root processor of a single-accumulation operation and then broadcasted to all other processors. 6 The axpy-operation dk 1 -gk 1 pkdk is computed analogously to computation step 3 . The result vector dk 1 has a blockwise distribution. Parallel Execution Time The parallel execution time of one iteration step of the CG method is the sum of the parallel execution times of the basic operations involved. We derive the parallel execution time for p processors n is the system size. It is assumed that n is a multiple of p. The parallel execution time of one axpy-operation is given by T axpy 2 n tOp p since each processor computes n p components and the computation of each component needs one multiplication and one addition. As in earlier sections the time for one arithmetic operation is denoted by top. The parallel execution time of a scalar product is n TscaLprod 2 p - 1 top Tacc p 1 Tsb p 1 where Tacc op p m denotes the communication time of a single-accumulation operation with reduction operation op on p processors and message size m. The computation of the local scalar products with n p components requires n p multiplications and n p - 1 additions. The distribution of the result of the parallel scalar product which is a scalar value . has size 1 needs the time of a single-broadcast operation Tsb p 1 . The matrix-vector multiplication needs time Tmath_vec_mult 2- p since each processor computes n p scalar products. The total computation time of the CG method is Tcg Tmb p n A p Tmath_vec_mult 2 TscaLprod 3 Taxpy 424 7 Algorithms for Systems of .
đang nạp các trang xem trước