国内优秀的响应式网站,做视频网站视频,网站建设工期及预算,做网站一般要多少钱一、Floyd-Warshall算法
介绍
Floyd-Warshall算法#xff08;英语#xff1a;Floyd-Warshall algorithm#xff09;#xff0c;中文亦称弗洛伊德算法或佛洛依德算法#xff0c;是解决任意两点间的最短路径的一种算法#xff0c;可以正确处理有向图或负权#xff08;但…一、Floyd-Warshall算法
介绍
Floyd-Warshall算法英语Floyd-Warshall algorithm中文亦称弗洛伊德算法或佛洛依德算法是解决任意两点间的最短路径的一种算法可以正确处理有向图或负权但不可存在负权回路的最短路径问题同时也被用于计算有向图的闭包传递。
原理
其本质为动态规划给定有向图图 G ( V , E ) G (V, E) G(V,E)其中 V ( v e r t i c e s ) V(vertices) V(vertices)为顶点数 E ( e d g e s ) E(edges) E(edges)为边数并给出初始权重矩阵 w [ i ] [ j ] w[i][j] w[i][j]表示顶点 i → j i \rightarrow j i→j的权重其表达式为 w i , j { weight of edge ( i , j ) if ( i , j ) ∈ E ; ∞ if ( i , j ) ∉ E . \left.w_{i,j}\left\{\begin{array}{ll}\text{weight of edge}\left(i,j\right)\text{if}\left(i,j\right)\in E;\\\infty\text{if}\left(i,j\right)\notin E.\end{array}\right.\right. wi,j{weight of edge(i,j)∞if(i,j)∈E;if(i,j)∈/E. 即对于 i → j i \rightarrow j i→j未连通的边通常设置为一个无穷大的数 I N F INF INF对于动态规划算法需要定义状态 D i , j , k D_{i,j,k} Di,j,k:从 i i i到 j j j只以( 1.. k 1..k 1..k)集合中的节点为中间节点的最短路径的长度则可分为以下2种情况讨论 如果最短路经过点 k k k D i , j , k D i , k , k − 1 D k , j , k − 1 . D_{i,j,k}D_{i,k,k-1}D_{k,j,k-1}. Di,j,kDi,k,k−1Dk,j,k−1. 若最短路径不经过点 k k k: D i , j , k D i , j , k − 1 。 D_{i,j,k}D_{i,j,k-1\text{ 。 }} Di,j,kDi,j,k−1 。
若不能理解 k − 1 k - 1 k−1的含义则可理解为下一层 k k k的状态需要上一层 k − 1 k - 1 k−1推导出(因为要逐个枚举中间节点例如 D 1 , 3 D 1 , 2 D 2 , 3 D_{1,3} D_{1,2} D_{2,3} D1,3D1,2D2,3那么需要保证 D 1 , 2 , D 2 , 3 D_{1,2},D_{2,3} D1,2,D2,3是对应的最短距离才能导致 D 1 , 3 D_{1,3} D1,3是1号节点到3号节点的最短距离)即第 k k k层状态依赖于第 k − 1 k-1 k−1层状态故不可对 k k k层循环做并行化处理最后可以得到状态转移方程: D i , j , k min ( D i , j , k − 1 , D i , k , k − 1 D k , j , k − 1 ) D_{i,j,k}\min(D_{i,j,k-1},D_{i,k,k-1}D_{k,j,k-1}) Di,j,kmin(Di,j,k−1,Di,k,k−1Dk,j,k−1) 在实际算法中为了节约空间可以直接在原来空间上进行迭代这样空间可降至二维。
分析
时间复杂度 O ( V 3 ) O(V^3) O(V3)其中 V V V是点集对于 i , j i,j i,j两层for循环可使用OpenMP优化到线性空间复杂度 O ( V 2 ) O(V^2) O(V2)
二、CPU-GPU并行化Floyd-APSP算法
为了求到全部的最短路径不仅需要计算最短路径距离矩阵 D D D还需要计算最短路径构造矩阵 C C C。其中 C C C矩阵的定义为如果在顶点 i i i和顶点 j j j之间至少存在一条最短路径则 C i , j C_{i,j} Ci,j表示最短路径上编号最高的中间顶点否则为undefined (NULL)。构造矩阵的初值都是未定义的用数学表示如下 c i , j ( k ) { NULL i f k 0 ; k i f k ≥ 1 a n d d i , j ( k − 1 ) d i , k ( k − 1 ) d k , j ( k − 1 ) ; c i , j ( k − 1 ) otherwise. , \left.c_{i,j}^{(k)}\left\{\begin{array}{ll}\text{NULL}\mathrm{if~}k0;\\k\mathrm{if~}k\geq1\mathrm{~and~}d_{i,j}^{(k-1)}d_{i,k}^{(k-1)}d_{k,j}^{(k-1)};\\c_{i,j}^{(k-1)}\text{otherwise.}\end{array}\right.\right., ci,j(k)⎩ ⎨ ⎧NULLkci,j(k−1)if k0;if k≥1 and di,j(k−1)di,k(k−1)dk,j(k−1);otherwise., 其中 C i , j k − 1 C_{i,j}^{k-1} Ci,jk−1与上相同由于下一层受到上一层的制约为递推关系。
Algorithm1: Floyd-Warshall
Floyd-Warshall算法用于计算最短路径距离矩阵 D i , j D_{i,j} Di,j和最短路径构造矩阵 C i , j C_{i,j} Ci,j Algorithm2:
输出一对顶点 ( i , j ) (i, j) (i,j)之间最短路径的中间顶点的递归过程 可以想象为二叉树一边是往左子树遍历一边是往右子树遍历即左根右的中序遍历。
分块联合算法
该算法是为在CPU-GPU混合系统上实现高GPU利用率的快速APSP解决方案而设计的。
在分块联合算法中将 n × n n \times n n×n的距离矩阵 D i , j D_{i,j} Di,j和构造矩阵 C i , j C_{i,j} Ci,j划分为 b × b b \times b b×b的子矩阵的分块其中 b b b为分块因子为以下问题讨论方便假设 n % b 0 n \% b 0 n%b0即 n n n能整除 b b b并在每个块内有定义 A I , J a [ i , j ] A_{I, J} a[i, j] AI,Ja[i,j]来标识块索引为 ( I , J ) (I,J) (I,J)的子矩阵用数学符号表示为 1 ≤ I , J ≤ [ n b ] , 1 ≤ i , j ≤ b 1 \leq I, J \leq [\frac{n}{b}] , \\ 1 \leq i,j \leq b 1≤I,J≤[bn],1≤i,j≤b 如下图所示展现了 n 12 n 12 n12矩阵的示例其中 b 3 b3 b3 Algorithm3
针对APSP问题的分块联合算法 将该算法划分4个阶段为
首先将 n × n n \times n n×n的矩阵分解为长度为 [ n b ] × [ n b ] [\frac{n}{b}] \times [\frac{n}{b}] [bn]×[bn]的以 b × b b \times b b×b的矩阵并外层枚举节点 ( K , K ) (K, K) (K,K)其中 1 ≤ K ≤ [ n b ] 1 \leq K \leq [\frac{n}{b}] 1≤K≤[bn]并在子矩阵 b × b b \times b b×b中使用Floyd-WarShall方法求解 D K , K , C K , K D_{K, K}, C_{K,K} DK,K,CK,K。对节点 ( K , K ) (K, K) (K,K)所在的第 K K K列进行MMA即矩阵乘法加法操作对节点 ( K , K ) (K, K) (K,K)所在的第 K K K行进行MMA即矩阵乘法加法操作对于除以上涉及到的剩余节点
Algorithm4
APSP子问题的阻塞联合算法区别在于算法4的4-16行运行在GPU中算法4的合并操作17-20运行在CPU中。 Algorithm5 子分块联合APSP的矩阵-矩阵乘-加算法 代数中的MMA算法可以扩展为同时计算路径矩阵 D i , j D_{i,j} Di,j和构造矩阵 C i , j C_{i,j} Ci,j z i , j ← min ( z i , j , ∑ k 1 b x i , k y k , j ) z_{i,j} \leftarrow \min(z_{i,j}, \sum_{k1}^b x_{i,k}y_{k,j}) zi,j←min(zi,j,k1∑bxi,kyk,j)
其中 z i , j z_{i,j} zi,j为 b × b b \times b b×b的子矩阵。 Algorithm6 阶段2-阶段4可以使用矩阵乘法更新在本问题中就是极小加代数。极小加代数的乘法和加法是分离执行的极小加操作(MINPLUS)是运行在GPU中矩阵加(MMA)运行在CPU中。 这个操作减少了 Z C ZC ZC从CPU到GPU的数据传输也就允许了CPU和GPU之间的高速通信。 Code
以下代码均来自https://github.com/EricLu1218/Parallel_Programming
模拟GPU的串行
#include stdio.h
#include stdlib.hconst int INF ((1 30) - 1);
const int V 50010;
void input(char *inFileName);
void output(char *outFileName);void block_FW(int B);
int ceil(int a, int b);
void cal(int B, int Round, int block_start_x, int block_start_y, int block_width, int block_height);int n, m;
static int Dist[V][V];int main(int argc, char *argv[])
{input(argv[1]);int B 512;block_FW(B);output(argv[2]);return 0;
}void input(char *infile)
{FILE *file fopen(infile, rb);fread(n, sizeof(int), 1, file);fread(m, sizeof(int), 1, file);for (int i 0; i n; i){for (int j 0; j n; j){if (i j){Dist[i][j] 0;}else{Dist[i][j] INF;}}}int pair[3];for (int i 0; i m; i){fread(pair, sizeof(int), 3, file);Dist[pair[0]][pair[1]] pair[2];}fclose(file);
}void output(char *outFileName)
{FILE *outfile fopen(outFileName, w);for (int i 0; i n; i){for (int j 0; j n; j){if (Dist[i][j] INF)Dist[i][j] INF;}fwrite(Dist[i], sizeof(int), n, outfile);}fclose(outfile);
}int ceil(int a, int b) { return (a b - 1) / b; }void block_FW(int B)
{int round ceil(n, B);for (int r 0; r round; r){printf(%d %d\n, r, round);fflush(stdout);/* Phase 1 */cal(B, r, r, r, 1, 1);/* Phase 2 */cal(B, r, r, 0, r, 1);cal(B, r, r, r 1, round - r - 1, 1);cal(B, r, 0, r, 1, r);cal(B, r, r 1, r, 1, round - r - 1);/* Phase 3 */cal(B, r, 0, 0, r, r);cal(B, r, 0, r 1, round - r - 1, r);cal(B, r, r 1, 0, r, round - r - 1);cal(B, r, r 1, r 1, round - r - 1, round - r - 1);}
}void cal(int B, int Round, int block_start_x, int block_start_y, int block_width, int block_height)
{int block_end_x block_start_x block_height;int block_end_y block_start_y block_width;for (int b_i block_start_x; b_i block_end_x; b_i){for (int b_j block_start_y; b_j block_end_y; b_j){// To calculate B*B elements in the block (b_i, b_j)// For each block, it need to compute B timesfor (int k Round * B; k (Round 1) * B k n; k){// To calculate original index of elements in the block (b_i, b_j)// For instance, original index of (0,0) in block (1,2) is (2,5) for V6,B2int block_internal_start_x b_i * B;int block_internal_end_x (b_i 1) * B;int block_internal_start_y b_j * B;int block_internal_end_y (b_j 1) * B;if (block_internal_end_x n)block_internal_end_x n;if (block_internal_end_y n)block_internal_end_y n;for (int i block_internal_start_x; i block_internal_end_x; i){for (int j block_internal_start_y; j block_internal_end_y; j){if (Dist[i][k] Dist[k][j] Dist[i][j]){Dist[i][j] Dist[i][k] Dist[k][j];}}}}}}
}
单GPU的CUDA代码
#include math.h
#include stdio.h
#include stdlib.h
#include time.hconst int INF (1 30) - 1;
int vertex_num, edge_num, matrix_size;
int *dist;double cal_time(struct timespec start, struct timespec end)
{struct timespec temp;if ((end.tv_nsec - start.tv_nsec) 0){temp.tv_sec end.tv_sec - start.tv_sec - 1;temp.tv_nsec 1000000000 end.tv_nsec - start.tv_nsec;}else{temp.tv_sec end.tv_sec - start.tv_sec;temp.tv_nsec end.tv_nsec - start.tv_nsec;}return temp.tv_sec (double)temp.tv_nsec / 1000000000.0;
}__device__ __host__ size_t index_convert(int i, int j, int row_size)
{return i * row_size j;
}void input(char *input_file_path, int block_factor)
{FILE *input_file fopen(input_file_path, rb);fread(vertex_num, sizeof(int), 1, input_file);fread(edge_num, sizeof(int), 1, input_file);matrix_size ceil((double)vertex_num / (double)block_factor) * block_factor;cudaMallocHost((void **)dist, matrix_size * matrix_size * sizeof(int));for (int i 0; i matrix_size; i){for (int j 0; j matrix_size; j){if (i ! j)dist[index_convert(i, j, matrix_size)] INF;else if (i vertex_num)dist[index_convert(i, j, matrix_size)] 0;elsedist[index_convert(i, j, matrix_size)] INF;}}int data[3];for (int i 0; i edge_num; i){fread(data, sizeof(int), 3, input_file);dist[index_convert(data[0], data[1], matrix_size)] data[2];}fclose(input_file);
}void output(char *output_file_path)
{FILE *output_file fopen(output_file_path, w);for (int i 0; i vertex_num; i){fwrite(dist[index_convert(i, 0, matrix_size)], sizeof(int), vertex_num, output_file);}fclose(output_file);
}__constant__ int size[3]; //matrix size, block_factor, grid_size__global__ void phase1(int *d_dist, int round)
{__shared__ int share[4 * 1024];int i threadIdx.y;int j threadIdx.x;int i_offset size[1] * round;int j_offset size[1] * round;share[index_convert(j, i, size[1])] d_dist[index_convert(i_offset i, j_offset j, size[0])];
#pragma unroll 32for (int k 0; k size[1]; k){__syncthreads();if (share[index_convert(j, i, size[1])] share[index_convert(j, k, size[1])] share[index_convert(k, i, size[1])])share[index_convert(j, i, size[1])] share[index_convert(j, k, size[1])] share[index_convert(k, i, size[1])];}d_dist[index_convert(i_offset i, j_offset j, size[0])] share[index_convert(j, i, size[1])];
}__global__ void phase2(int *d_dist, int round)
{__shared__ int share[3 * 4 * 1024];int i threadIdx.y;int j threadIdx.x;int i_offset, j_offset;if (blockIdx.x 0){i_offset size[1] * ((round blockIdx.y 1) % size[2]);j_offset size[1] * round;share[index_convert(i, j, size[1])] d_dist[index_convert(i_offset i, j_offset j, size[0])];share[index_convert(i size[1], j, size[1])] share[index_convert(i, j, size[1])];share[index_convert(i 2 * size[1], j, size[1])] d_dist[index_convert(j_offset i, j_offset j, size[0])];}else{i_offset size[1] * round;j_offset size[1] * ((round blockIdx.y 1) % size[2]);share[index_convert(i, j, size[1])] d_dist[index_convert(i_offset i, j_offset j, size[0])];share[index_convert(i size[1], j, size[1])] d_dist[index_convert(i_offset i, i_offset j, size[0])];share[index_convert(i 2 * size[1], j, size[1])] share[index_convert(i, j, size[1])];}#pragma unroll 32for (int k 0; k size[1]; k){__syncthreads();if (share[index_convert(i, j, size[1])] share[index_convert(i size[1], k, size[1])] share[index_convert(k 2 * size[1], j, size[1])])share[index_convert(i, j, size[1])] share[index_convert(i size[1], k, size[1])] share[index_convert(k 2 * size[1], j, size[1])];}d_dist[index_convert(i_offset i, j_offset j, size[0])] share[index_convert(i, j, size[1])];
}__global__ void phase3(int *d_dist, int round)
{__shared__ int share[3 * 4 * 1024];int i threadIdx.y;int j threadIdx.x;int i_offset size[1] * ((round blockIdx.y 1) % size[2]);int j_offset size[1] * ((round blockIdx.x 1) % size[2]);int r_offset size[1] * round;share[index_convert(i, j, size[1])] d_dist[index_convert(i_offset i, j_offset j, size[0])];share[index_convert(i size[1], j, size[1])] d_dist[index_convert(i_offset i, r_offset j, size[0])];share[index_convert(i 2 * size[1], j, size[1])] d_dist[index_convert(r_offset i, j_offset j, size[0])];
#pragma unroll 32for (int k 0; k size[1]; k){__syncthreads();if (share[index_convert(i, j, size[1])] share[index_convert(i size[1], k, size[1])] share[index_convert(k 2 * size[1], j, size[1])])share[index_convert(i, j, size[1])] share[index_convert(i size[1], k, size[1])] share[index_convert(k 2 * size[1], j, size[1])];}d_dist[index_convert(i_offset i, j_offset j, size[0])] share[index_convert(i, j, size[1])];
}int main(int argc, char **argv)
{double total_time, bfd_time;timespec total_time1, total_time2, bfd_time1, bfd_time2;clock_gettime(CLOCK_MONOTONIC, total_time1);cudaSetDevice(0); // 设置运行的为第0块GPUint block_factor 32;if (argc 4)block_factor atoi(argv[3]);input(argv[1], block_factor); // 读取数据并初始化distint grid_size matrix_size / block_factor; // 划分后的网格大小N [n / b]int size_info[3] {matrix_size, block_factor, grid_size}; // n, b, N [n / b]cudaMemcpyToSymbol(size, size_info, 3 * sizeof(int)); // 将矩阵大小、块大小和网格大小的信息传递给CUDA设备int *d_dist;clock_gettime(CLOCK_MONOTONIC, bfd_time1);cudaMalloc(d_dist, (size_t)sizeof(int) * matrix_size * matrix_size); // 在GPU上分配内存// 在GPU上分配和复制内存将距离矩阵dist从主机CPU内存拷贝到设备GPU内存cudaMemcpy(d_dist, dist, (size_t)sizeof(int) * matrix_size * matrix_size, cudaMemcpyHostToDevice);// 定义了CUDA的线程块和网格的维度dim3 block(block_factor, block_factor); // (b, b)dim3 grid2(2, grid_size - 1); // (2, N - 1)dim3 grid3(grid_size - 1, grid_size - 1); // (N - 1, N - 1)for (int r 0; r grid_size; r){phase11, block(d_dist, r);phase2grid2, block(d_dist, r);phase3grid3, block(d_dist, r);}cudaMemcpy(dist, d_dist, (size_t)sizeof(int) * matrix_size * matrix_size, cudaMemcpyDeviceToHost);clock_gettime(CLOCK_MONOTONIC, bfd_time2);output(argv[2]);cudaFree(d_dist);cudaFree(dist);clock_gettime(CLOCK_MONOTONIC, total_time2);bfd_time cal_time(bfd_time1, bfd_time2);total_time cal_time(total_time1, total_time2);printf( vertex: %d\n, vertex_num);printf( I/O time: %.5f\n, total_time - bfd_time);printf( cal time: %.5f\n, bfd_time);printf( runtime: %.5f\n, total_time);return 0;
}2个GPU代码
#include math.h
#include omp.h
#include stdio.h
#include stdlib.h
#include time.hconst int INF (1 30) - 1;
int vertex_num, edge_num, matrix_size;
int *dist;double cal_time(struct timespec start, struct timespec end)
{struct timespec temp;if ((end.tv_nsec - start.tv_nsec) 0){temp.tv_sec end.tv_sec - start.tv_sec - 1;temp.tv_nsec 1000000000 end.tv_nsec - start.tv_nsec;}else{temp.tv_sec end.tv_sec - start.tv_sec;temp.tv_nsec end.tv_nsec - start.tv_nsec;}return temp.tv_sec (double)temp.tv_nsec / 1000000000.0;
}__device__ __host__ size_t index_convert(int i, int j, int row_size)
{return i * row_size j;
}void input(char *input_file_path, int block_factor)
{FILE *input_file fopen(input_file_path, rb);fread(vertex_num, sizeof(int), 1, input_file);fread(edge_num, sizeof(int), 1, input_file);matrix_size ceil((double)vertex_num / (double)block_factor) * block_factor;cudaMallocHost((void **)dist, matrix_size * matrix_size * sizeof(int));for (int i 0; i matrix_size; i){for (int j 0; j matrix_size; j){if (i ! j)dist[index_convert(i, j, matrix_size)] INF;else if (i vertex_num)dist[index_convert(i, j, matrix_size)] 0;elsedist[index_convert(i, j, matrix_size)] INF;}}int data[3];for (int i 0; i edge_num; i){fread(data, sizeof(int), 3, input_file);dist[index_convert(data[0], data[1], matrix_size)] data[2];}fclose(input_file);
}void output(char *output_file_path)
{FILE *output_file fopen(output_file_path, w);for (int i 0; i vertex_num; i){fwrite(dist[index_convert(i, 0, matrix_size)], sizeof(int), vertex_num, output_file);}fclose(output_file);
}__constant__ int size[3]; //matrix size, block_factor, grid_size__global__ void phase1(int *d_dist, int round)
{__shared__ int pivot[1024];int i threadIdx.y;int j threadIdx.x;int i_offset 32 * round;int j_offset 32 * round;pivot[index_convert(i, j, 32)] d_dist[index_convert(i_offset i, j_offset j, size[0])];
#pragma unroll 32for (int k 0; k 32; k){__syncthreads();if (pivot[index_convert(i, j, 32)] pivot[index_convert(i, k, 32)] pivot[index_convert(k, j, 32)])pivot[index_convert(i, j, 32)] pivot[index_convert(i, k, 32)] pivot[index_convert(k, j, 32)];}d_dist[index_convert(i_offset i, j_offset j, size[0])] pivot[index_convert(i, j, 32)];
}__global__ void phase2(int *d_dist, int round)
{__shared__ int self[1024], pivot[1024];int i threadIdx.y;int j threadIdx.x;int i_offset, j_offset;if (blockIdx.x 0 blockIdx.y ! round){i_offset 32 * blockIdx.y;j_offset 32 * round;self[index_convert(i, j, 32)] d_dist[index_convert(i_offset i, j_offset j, size[0])];pivot[index_convert(i, j, 32)] d_dist[index_convert(j_offset i, j_offset j, size[0])];
#pragma unroll 32for (int k 0; k 32; k){__syncthreads();if (self[index_convert(i, j, 32)] self[index_convert(i, k, 32)] pivot[index_convert(k, j, 32)])self[index_convert(i, j, 32)] self[index_convert(i, k, 32)] pivot[index_convert(k, j, 32)];}d_dist[index_convert(i_offset i, j_offset j, size[0])] self[index_convert(i, j, 32)];}else if (blockIdx.y ! round){i_offset 32 * round;j_offset 32 * blockIdx.y;self[index_convert(i, j, 32)] d_dist[index_convert(i_offset i, j_offset j, size[0])];pivot[index_convert(i, j, 32)] d_dist[index_convert(i_offset i, i_offset j, size[0])];
#pragma unroll 32for (int k 0; k 32; k){__syncthreads();if (self[index_convert(i, j, 32)] pivot[index_convert(i, k, 32)] self[index_convert(k, j, 32)])self[index_convert(i, j, 32)] pivot[index_convert(i, k, 32)] self[index_convert(k, j, 32)];}d_dist[index_convert(i_offset i, j_offset j, size[0])] self[index_convert(i, j, 32)];}
}__global__ void phase3(int *d_dist, int round, int grid_offset)
{__shared__ int col[1024], row[1024];int self;int block_i grid_offset blockIdx.y;int block_j blockIdx.x;if (block_i round || block_j round)return;int i threadIdx.y;int j threadIdx.x;int i_offset 32 * block_i;int j_offset 32 * block_j;int r_offset 32 * round;self d_dist[index_convert(i_offset i, j_offset j, size[0])];col[index_convert(i, j, 32)] d_dist[index_convert(i_offset i, r_offset j, size[0])];row[index_convert(i, j, 32)] d_dist[index_convert(r_offset i, j_offset j, size[0])];#pragma unroll 32for (int k 0; k 32; k){__syncthreads();if (self col[index_convert(i, k, 32)] row[index_convert(k, j, 32)])self col[index_convert(i, k, 32)] row[index_convert(k, j, 32)];}d_dist[index_convert(i_offset i, j_offset j, size[0])] self;
}int main(int argc, char **argv)
{const int block_factor 32, device_num 2;input(argv[1], block_factor);int grid_size matrix_size / block_factor;int *d_dist[2];
#pragma omp parallel num_threads(device_num){int device_id omp_get_thread_num();cudaSetDevice(device_id);int size_info[3] {matrix_size, block_factor, grid_size};cudaMemcpyToSymbol(size, size_info, 3 * sizeof(int));int grid_partition grid_size / device_num;int grid_offset device_id * grid_partition;int grid_count grid_partition;if (device_id device_num - 1)grid_count grid_size % device_num;size_t grid_start grid_offset * block_factor * matrix_size;cudaMalloc((d_dist[device_id]), (size_t)sizeof(int) * matrix_size * matrix_size);
#pragma omp barriercudaMemcpy((d_dist[device_id][grid_start]), (dist[grid_start]),(size_t)sizeof(int) * block_factor * grid_count * matrix_size, cudaMemcpyHostToDevice);dim3 block(block_factor, block_factor);dim3 grid2(2, grid_size);dim3 grid3(grid_size, grid_count);for (int r 0; r grid_size; r){if (grid_offset r r grid_offset grid_count){size_t copy_start r * block_factor * matrix_size;if (device_id 0)cudaMemcpy((d_dist[1][copy_start]), (d_dist[0][copy_start]),(size_t)sizeof(int) * block_factor * matrix_size, cudaMemcpyDeviceToDevice);elsecudaMemcpy((d_dist[0][copy_start]), (d_dist[1][copy_start]),(size_t)sizeof(int) * block_factor * matrix_size, cudaMemcpyDeviceToDevice);}
#pragma omp barrierphase11, block(d_dist[device_id], r);phase2grid2, block(d_dist[device_id], r);phase3grid3, block(d_dist[device_id], r, grid_offset);}cudaMemcpy((dist[grid_start]), (d_dist[device_id][grid_start]),(size_t)sizeof(int) * block_factor * grid_count * matrix_size, cudaMemcpyDeviceToHost);cudaFree(d_dist[omp_get_thread_num()]);
#pragma omp barrier}output(argv[2]);cudaFree(dist);return 0;
}Reference
https://zh.wikipedia.org/wiki/Floyd-Warshall%E7%AE%97%E6%B3%95Blocked United Algorithm for the All-Pairs Shortest Paths Problem on Hybrid CPU-GPU Systemshttps://github.com/EricLu1218/Parallel_Programming