f# - Floyd Warshall 使用 Alea GPU-6ren

f# - Floyd Warshall 使用 Alea GPU

转载作者：行者123 更新时间：2023-12-03 11:48:08

我一直在尝试使用 Alea GPU 在 F# 中编写并行 Floyd-Warshall 算法，并以另一个用户在此处提供的 CUDA 代码为基础

The Floyd-Warshall algorithm in CUDA

我写了以下简单的实现

type FWModule<'T>(target:GPUModuleTarget, tileDim:int) =
inherit GPUModule(target)

    [<Kernel;ReflectedDefinition>]
    member this.FloydWKernel (width:int) (k:int) (data:deviceptr<float>) =

        let col = blockIdx.x * blockDim.x + threadIdx.x
        let row = blockIdx.y

        if col >= width then ()  //out of bounds
        let index  = width * row + col
        let best = __shared__.Variable<float>()

        if threadIdx.x = 0 then best := data.[width*row+k]
        __syncthreads()

        let tmp = data.[k*width+col]

        let candidate = !best + tmp
        data.[index] <- min data.[index] candidate

    member this.LaunchParams width =
        let blockdim = dim3(tileDim)
        let griddim = dim3(divup width tileDim, width)
        LaunchParam(griddim, blockdim)

    member this.FloydW (width:int) (k:int) (data:deviceptr<float>) =
        let lp = this.LaunchParams width
        this.GPULaunch <@ this.FloydWKernel @> lp width k idata odata

    member this.FW(size:int, A:float[])=
        use deviceArr = this.GPUWorker.Malloc(A)
        for k in 0 .. size-1 do
           this.FloydW size k deviceArr.Ptr deviceArr.Ptr
        deviceArr.Gather()

let tileDim = 256

let apsp = new FWModule<float>(GPUModuleTarget.DefaultWorker, tileDim)

但是，当以下几行在 fsi 中运行时

let m = [|0.0     ; 5.0     ; 9.0     ; infinity;
          infinity; 0.0     ; 1.0     ; infinity;
          infinity; infinity; 0.0     ; 2.0;
          infinity; 3.0     ; infinity; 0.0|];;

apsp.FW (4,m);;

输出是

[|0.0; 5.0; 6.0; 8.0;
  4.0; 0.0; 1.0; 3.0;
  3.0; 3.0; 0.0; 1.0;
  1.0; 1.0; 1.0; 0.0|]

其中不应该给出通常的迭代，顺序 floydwarshall

let floydwarshall (l:int, mat:float[]) =
   let a = Array.copy mat
   for k in 0 .. (l-1) do
      for i in 0 .. (l-1) do
         for j in 0 .. (l-1) do
            a.[i*l+j] <- min a.[i*l+j] (a.[i*l+k] + a.[k*l+j])
   a

给我

floydwarshall (4,m);;
[|0.0     ; 5.0; 6.0; 8.0;
  infinity; 0.0; 1.0; 3.0;
  infinity; 5.0; 0.0; 2.0;
  infinity; 3.0; 4.0; 0.0|]

我的问题是，发生了什么？

最佳答案

这是我们示例库中的一些源代码片段，您可以在 Alea GPU 示例库中找到 http://www.quantalea.com/gallery .

这是单阶段算法。它不是最快的，但相当容易理解。

public static class FloydWarshallSingleStage
{
    const int BlockWidth = 16;

    /// <summary>
    /// Kernel for parallel Floyd Warshall algorithm on GPU.
    /// </summary>
    /// <param name="u">Number vertex of which is performed relaxation paths [v1, v2]</param>
    /// <param name="n">Number of vertices in the graph G:=(V,E), n := |V(G)|</param>
    /// <param name="d">Matrix of shortest paths d(G)</param>
    /// <param name="p">Matrix of predecessors p(G)</param>
    public static void KernelSingleStage(int u, int[,] d, int[,] p)
    {
        var n = d.GetLength(0);
        var v1 = blockDim.y * blockIdx.y + threadIdx.y;
        var v2 = blockDim.x * blockIdx.x + threadIdx.x;

        if (v1 < n && v2 < n)
        {
            var newPath = d[v1, u] + d[u, v2];
            var oldPath = d[v1, v2];
            if (oldPath > newPath)
            {
                d[v1, v2] = newPath;
                p[v1, v2] = p[u, v2];
            }
        }
    }

    [GpuManaged]
    public static void Run(Gpu gpu, int[,] d, int[,] p)
    {
        var n = d.GetLength(0);
        var gridDim = new dim3((n - 1) / BlockWidth + 1, (n - 1) / BlockWidth + 1, 1);
        var blockDim = new dim3(BlockWidth, BlockWidth, 1);
        var lp = new LaunchParam(gridDim, blockDim);
        for (var u = 0; u < n; u++)
        {
            gpu.Launch(KernelSingleStage, lp, u, d, p);
        }
    }
}

多阶段比较复杂，上市时间较长。在这里，我粘贴了一个使用自动内存管理的版本，它大大简化了代码，但也有一些性能影响。多阶段版本使用三个内核来完成工作，并使用平铺来改善内存访问。

public class FloydWarshallMultiStage
{
    private const int None = -1;
    private const int Inf = 1061109567;

    //[GpuParam]
    //private readonly Constant<int> BlockSize;
    //[GpuParam]
    //private readonly Constant<int> ThreadSize;
    //[GpuParam]
    //private readonly Constant<int> VirtualBlockSize;

    private const int BlockSize = 16;
    private const int ThreadSize = 2;
    private const int VirtualBlockSize = BlockSize*ThreadSize;

    public FloydWarshallMultiStage(int blockSize, int threadSize)
    {
        //BlockSize = new Constant<int>(blockSize);
        //ThreadSize = new Constant<int>(threadSize);
        //VirtualBlockSize = new Constant<int>(blockSize * threadSize);
    }

    /// <summary>
    /// Kernel for parallel Floyd Warshall algorithm on GPU computing independent blocks.
    /// </summary>
    /// <param name="block">Number block of which is performed relaxation paths [v1, v2]</param>
    /// <param name="n">Number of vertices in the graph G:=(V,E), n := |V(G)|</param>
    /// <param name="pitch">Width to get to next row in number of int</param>
    /// <param name="d">Matrix of shortest paths d(G)</param>
    /// <param name="p">Matrix of predecessors p(G)</param>
    public void KernelPhaseOne(int block, int n, int pitch, int[,] d, int[,] p)
    {
        var newPred = 0;

        var tx = threadIdx.x;
        var ty = threadIdx.y;

        var v1 = VirtualBlockSize*block + ty;
        var v2 = VirtualBlockSize*block + tx;

        var primaryD = __shared__.Array2D<int>(VirtualBlockSize, VirtualBlockSize);
        var primaryP = __shared__.Array2D<int>(VirtualBlockSize, VirtualBlockSize);

        if (v1 < n && v2 < n)
        {
            primaryD[ty, tx] = d[v1, v2];
            primaryP[ty, tx] = p[v1, v2];
            newPred = primaryP[ty, tx];
        }
        else
        {
            primaryD[ty, tx] = Inf;
            primaryP[ty, tx] = None;
        }

        DeviceFunction.SyncThreads();

        for (var i = 0; i < VirtualBlockSize; i++)
        {
            var newPath = primaryD[ty, i] + primaryD[i, tx];

            DeviceFunction.SyncThreads();

            if (newPath < primaryD[ty, tx])
            {
                primaryD[ty, tx] = newPath;
                newPred = primaryP[i, tx];
            }

            DeviceFunction.SyncThreads();

            primaryP[ty, tx] = newPred;
        }

        if (v1 < n && v2 < n)
        {
            d[v1, v2] = primaryD[ty, tx];
            p[v1, v2] = primaryP[ty, tx];
        }
    }

    /// <summary>
    /// Kernel for parallel Floyd Warshall algorithm on GPU to compute block depending on a single independent block.
    /// </summary>
    /// <param name="block">Number block of which is performed relaxation paths [v1, v2]</param>
    /// <param name="n">Number of vertices in the graph G:=(V,E), n := |V(G)|</param>
    /// <param name="pitch"></param>
    /// <param name="d">Matrix of shortest paths d(G)</param>
    /// <param name="p">Matrix of predecessors p(G)</param>
    public void KernelPhaseTwo(int block, int n, int pitch, int[,] d, int[,] p)
    {
        if (blockIdx.x == block) return;
        var newPath = 0;
        var newPred = 0;

        var tx = threadIdx.x;
        var ty = threadIdx.y;

        var v1 = VirtualBlockSize*block + ty;
        var v2 = VirtualBlockSize*block + tx;

        var primaryD = __shared__.Array2D<int>(VirtualBlockSize, VirtualBlockSize);
        var currentD = __shared__.Array2D<int>(VirtualBlockSize, VirtualBlockSize);
        var primaryP = __shared__.Array2D<int>(VirtualBlockSize, VirtualBlockSize);
        var currentP = __shared__.Array2D<int>(VirtualBlockSize, VirtualBlockSize);

        if (v1 < n && v2 < n)
        {
            primaryD[ty, tx] = d[v1, v2];
            primaryP[ty, tx] = p[v1, v2];
        }
        else
        {
            primaryD[ty, tx] = Inf;
            primaryP[ty, tx] = None;
        }

        // load i-aligned singly dependent blocks
        if (blockIdx.y == 0)
        {
            v1 = VirtualBlockSize*block + ty;
            v2 = VirtualBlockSize*blockIdx.x + tx;
        }
        // load j-aligned singly dependent blocks
        else
        {
            v1 = VirtualBlockSize*blockIdx.x + ty;
            v2 = VirtualBlockSize*block + tx;
        }

        if (v1 < n && v2 < n)
        {
            currentD[ty, tx] = d[v1, v2];
            currentP[ty, tx] = p[v1, v2];
            newPred = currentP[ty, tx];
        }
        else
        {
            currentD[ty, tx] = Inf;
            currentP[ty, tx] = None;
        }

        DeviceFunction.SyncThreads();

        // compute i-aligned singly dependent blocks
        if (blockIdx.y == 0)
        {
            for (var i = 0; i < VirtualBlockSize; i++)
            {
                newPath = primaryD[ty, i] + currentD[i, tx];

                DeviceFunction.SyncThreads();

                if (newPath < currentD[ty, tx])
                {
                    currentD[ty, tx] = newPath;
                    newPred = currentP[i, tx];
                }

                DeviceFunction.SyncThreads();

                currentP[ty, tx] = newPred;
            }
        }
        // compute j-aligned singly dependent blocks
        else
        {
            for (var i = 0; i < VirtualBlockSize; i++)
            {
                newPath = currentD[ty, i] + primaryD[i, tx];

                DeviceFunction.SyncThreads();
                if (newPath < currentD[ty, tx])
                {
                    currentD[ty, tx] = newPath;
                    currentP[ty, tx] = primaryP[i, tx];
                }

                DeviceFunction.SyncThreads();
            }
        }

        if (v1 < n && v2 < n)
        {
            d[v1, v2] = currentD[ty, tx];
            p[v1, v2] = currentP[ty, tx];
        }
    }

    /// <summary>
    /// Kernel for parallel Floyd Warshall algorithm on GPU to compute dependent block depending on the singly dependent blocks.
    /// </summary>
    /// <param name="block">Number block of which is performed relaxation paths [v1, v2]</param>
    /// <param name="n">Number of vertices in the graph G:=(V,E), n := |V(G)|</param>
    /// <param name="pitch"></param>
    /// <param name="d">Matrix of shortest paths d(G)</param>
    /// <param name="p">Matrix of predecessors p(G)</param>
    public void KernelPhaseThree(int block, int n, int pitch, int[,] d, int[,] p)
    {
        if (blockIdx.x == block || blockIdx.y == block) return;

        var tx = threadIdx.x*ThreadSize;
        var ty = threadIdx.y*ThreadSize;

        var v1 = blockDim.y*blockIdx.y*ThreadSize + ty;
        var v2 = blockDim.x*blockIdx.x*ThreadSize + tx;

        var primaryRowD = __shared__.Array2D<int>(BlockSize*ThreadSize, BlockSize*ThreadSize);
        var primaryColD = __shared__.Array2D<int>(BlockSize*ThreadSize, BlockSize*ThreadSize);
        var primaryRowP = __shared__.Array2D<int>(BlockSize*ThreadSize, BlockSize*ThreadSize);

        var v1Row = BlockSize*block*ThreadSize + ty;
        var v2Col = BlockSize*block*ThreadSize + tx;

        // load data for virtual block
        for (var i = 0; i < ThreadSize; i++)
        {
            for (var j = 0; j < ThreadSize; j++)
            {
                var idx = tx + j;
                var idy = ty + i;

                if (v1Row + i < n && v2 + j < n)
                {
                    primaryRowD[idy, idx] = d[v1Row + i, v2 + j];
                    primaryRowP[idy, idx] = p[v1Row + i, v2 + j];
                }
                else
                {
                    primaryRowD[idy, idx] = Inf;
                    primaryRowP[idy, idx] = None;
                }

                if (v1 + i < n && v2Col + j < n)
                {
                    primaryColD[idy, idx] = d[v1 + i, v2Col + j];
                }
                else
                {
                    primaryColD[idy, idx] = Inf;
                }
            }
        }

        DeviceFunction.SyncThreads();

        // compute data for virtual block
        for (var i = 0; i < ThreadSize; i++)
        {
            for (var j = 0; j < ThreadSize; j++)
            {
                if (v1 + i < n && v2 + j < n)
                {
                    var path = d[v1 + i, v2 + j];
                    var predecessor = p[v1 + i, v2 + j];

                    var idy = ty + i;
                    var idx = tx + j;

                    for (var k = 0; k < BlockSize*ThreadSize; k++)
                    {
                        var newPath = primaryColD[idy, k] + primaryRowD[k, idx];
                        if (path > newPath)
                        {
                            path = newPath;
                            predecessor = primaryRowP[k, idx];
                        }
                    }
                    d[v1 + i, v2 + j] = path;
                    p[v1 + i, v2 + j] = predecessor;
                }
            }
        }
    }

    /// <summary>
    /// Parallel multi-stage Floyd Warshall algorithm on GPU.
    /// </summary>
    /// <param name="gpu">The GPU on which the kernels should run</param>
    /// <param name="n">Number of vertices in the graph G:=(V,E), n := |V(G)|</param>
    /// <param name="g">The graph G:=(V,E)</param>
    /// <param name="d">Matrix of shortest paths d(G)</param>
    /// <param name="p">Matrix of predecessors p(G)</param>
    public void Run(Gpu gpu, int[,] d, int[,] p, bool verbose = false)
    {
        var n = d.GetLength(0);
        var gridDim1 = new dim3(1, 1, 1);
        var gridDim2 = new dim3((n - 1)/VirtualBlockSize + 1, 2, 1);
        var gridDim3 = new dim3((n - 1)/VirtualBlockSize + 1, (n - 1)/VirtualBlockSize + 1, 1);

        var blockDim1 = new dim3(VirtualBlockSize, VirtualBlockSize, 1);
        var blockDim2 = new dim3(VirtualBlockSize, VirtualBlockSize, 1);
        var blockDim3 = new dim3(BlockSize, BlockSize, 1);

        var numOfBlock = (n - 1)/VirtualBlockSize + 1;
        var pitchInt = n;

        if (verbose)
        {
            Console.WriteLine($"|V| {n}");
            Console.WriteLine($"Phase 1: grid dim {gridDim1} block dim {blockDim1}");
            Console.WriteLine($"Phase 2: grid dim {gridDim2} block dim {blockDim2}");
            Console.WriteLine($"Phase 3: grid dim {gridDim3} block dim {blockDim3}");
        }

        for (var block = 0; block < numOfBlock; block++)
        {
            gpu.Launch(KernelPhaseOne, new LaunchParam(gridDim1, blockDim1), block, n, pitchInt, d, p);
            gpu.Launch(KernelPhaseTwo, new LaunchParam(gridDim2, blockDim2), block, n, pitchInt, d, p);
            gpu.Launch(KernelPhaseThree, new LaunchParam(gridDim3, blockDim3), block, n, pitchInt, d, p);
        }
    }
}

带有显式内存管理的版本你最好从 http://www.quantalea.com/gallery 下载示例并搜索 Floyd-Warshall。

希望能回答这个问题。

实现基于以下论文:

Ben Lund、Justin W Smith，Floyd-Warshall 的多阶段 CUDA 内核，2010 年。

https://arxiv.org/abs/1001.4108

关于f# - Floyd Warshall 使用 Alea GPU，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/36482098/

文章推荐： powershell - 如何切换到另一个域和 get-aduser

文章推荐： Jenkins Slack 集成

文章推荐： jasmine - 有条件地忽略 Karma/Jasmine 的单个测试

文章推荐： angularjs - AngularJS 中 "href"和 "ng-href"的区别

f# - alea.cuBase 和 CUBLAS
我正在踏上激动人心的 GPU 编程之路，如果我要进行一些重量级的数字运算，我想使用现有的最佳库。我特别想在 F# 环境中使用 cuBLAS。 CUDAfy 提供了他们解决方案中的全套驱动程序，我也一直
c# - 使用 Alea GPU 加速嵌套循环和按位运算
我正在尝试使用 Alea 来加速我正在处理的程序，但我需要一些帮助。我需要做的是对存储在两个数组中的值进行大量的位计数和按位运算。对于第一个数组的每个元素，我必须对第二个数组的每个元素执行按位 &
f# - Floyd Warshall 使用 Alea GPU
我一直在尝试使用 Alea GPU 在 F# 中编写并行 Floyd-Warshall 算法，并以另一个用户在此处提供的 CUDA 代码为基础 The Floyd-Warshall algorithm
c# - 如何在 C# 中对 ALEA 库使用递归
我正在尝试使用 ALEA 库将递归算法从 CPU 转换为 GPU。如果我构建代码，我会收到以下错误: “Fody/Alea.CUDA:AOTCompileServer 意外退出，退出代码为 -1073
c# - Alea Gpu 运行时错误在 Rhino/GrassHopper 项目中找不到文件或依赖项
我有一个关于 Alea Gpu 的运行时问题，它找不到所有相关文件(我猜是 dll)。作为 .NET 的新手，当它不能开箱即用时，我很难找出正确的路径，所以如果有人可以根据我在下面给出的路径给我一些
c# - 在 Alea.GPU 中传递超过 16 个内核参数
我正在尝试编写一个相当复杂的内核。事实证明，我需要传递超过 16 个参数，显然 Alea GPU 有 16 个参数的限制。 ( http://quantalea.com/static/app/manu
gpgpu - 在 Alea.Gpu.Default.For 中访问 IList
我正在尝试访问 System.Collections.Generic.IList 的值在Alea.Gpu.Default.For 之外声明. [GpuManaged] private void Eva
c# - 抛出异常 : 'System.Exception' in Alea. dll“i32 不是结构类型
我正在尝试使用 Alea GPU 库中的 Gpu.Default.For，但我一直收到异常: i32 is not a struct type error. 这个错误是什么意思，为什么我用这个简单的
c# - Alea GPU 教程未使用 FSharp.Core 4.4.0.0 在 VS 2015 Update 2 上编译
我在编译 Alea GPU 教程时遇到错误:https://github.com/quantalea/AleaGPUTutorial在 .Net 4.5 运行时上使用 FSharp.Core 版本 4

行者123

个人简介

我是一名优秀的程序员,十分优秀！

作者热门文章

滴滴打车优惠券免费领取

全站热门文章

首页

博学

6Ren·AI

商城

f# - Floyd Warshall 使用 Alea GPU