fortran - OpenBLAS 比内部函数 dot

fortran - OpenBLAS 比内部函数 dot_product 慢

转载作者：行者123 更新时间：2023-12-02 10:26:30

24

4

我需要用 Fortran 制作一个点积。我可以使用内在函数 dot_product来自 Fortran 或使用 ddot来自 OpenBLAS。问题是 ddot 速度较慢。这是我的代码:

使用BLAS:

program VectorBLAS
! time VectorBlas.e = 0.30s
implicit none
double precision, dimension(3)  :: b
double precision                :: result
double precision, external      :: ddot
integer, parameter              :: LargeInt_K = selected_int_kind (18)
integer (kind=LargeInt_K)        :: I

DO I = 1, 10000000
   b(:) = 3
   result = ddot(3, b, 1, b, 1)
END DO
end program VectorBLAS

使用dot_product

program VectorModule
! time VectorModule.e = 0.19s
implicit none
double precision, dimension (3)  :: b
double precision                 :: result
integer, parameter              :: LargeInt_K = selected_int_kind (18)
integer (kind=LargeInt_K)        :: I

DO I = 1, 10000000
  b(:) = 3
  result = dot_product(b, b)
END DO
end program VectorModule

这两个代码的编译使用:

gfortran file_name.f90 -lblas -o file_name.e

我做错了什么？ BLAS不是要更快吗？

最佳答案

虽然 BLAS(尤其是优化版本)对于较大数组来说通常更快，但内置函数对于较小尺寸来说更快。

这从 ddot 的链接源代码中尤其明显，其中额外的工作花费在更多功能上(例如，不同的增量)。对于较小的数组长度，此处完成的工作超过了优化的性能增益。

如果你让向量变得更大，优化后的版本应该会更快。

这里有一个例子来说明这一点:

program test
  use, intrinsic :: ISO_Fortran_env, only: REAL64
  implicit none
  integer                   :: t1, t2, rate, ttot1, ttot2, i
  real(REAL64), allocatable :: a(:),b(:),c(:)
  real(REAL64), external    :: ddot

  allocate( a(100000), b(100000), c(100000) )
  call system_clock(count_rate=rate)

  ttot1 = 0 ; ttot2 = 0
  do i=1,1000
    call random_number(a)
    call random_number(b)

    call system_clock(t1)
    c = dot_product(a,b)
    call system_clock(t2)
    ttot1 = ttot1 + t2 - t1

    call system_clock(t1)
    c = ddot(100000,a,1,b,1)
    call system_clock(t2)
    ttot2 = ttot2 + t2 - t1
  enddo
  print *,'dot_product: ', real(ttot1)/real(rate) 
  print *,'BLAS, ddot:  ', real(ttot2)/real(rate) 
end program

BLAS 例程在这里要快得多:

OMP_NUM_THREADS=1 ./a.out 
 dot_product:   0.145999998    
 BLAS, ddot:    0.100000001

关于fortran - OpenBLAS 比内部函数 dot_product 慢，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/36221809/

24

4

0

文章推荐： firefox - Drupal 中的 CSRF 验证失败

文章推荐： ms-word - 是否有 Microsoft Graph MS Word API？

文章推荐： asp.net - Asp.Net Core 中的反向代理和进程内 HTTP 服务器

文章推荐： ASP.NET Click() 事件在第二次回发时不会触发

fortran - OpenBLAS 比内部函数 dot_product 慢
我需要用 Fortran 制作一个点积。我可以使用内在函数 dot_product来自 Fortran 或使用 ddot来自 OpenBLAS。问题是 ddot 速度较慢。这是我的代码: 使用BLAS
parallel-processing - 内部 dot_product 比 a*a+b*b+c*c 慢？
最近我测试了显式求和和内函数计算点积的运行时差异。令人惊讶的是，朴素的显式书写速度更快。 program test real*8 , dimension(3) :: idmat real*

首页

博学

6Ren·AI

商城

fortran - OpenBLAS 比内部函数 dot_product 慢