gpt4 book ai didi

c++ - 如何并行化将矩阵的行随机复制到内存中的另一个矩阵的过程?

转载 作者:行者123 更新时间:2023-11-28 01:13:59 24 4
gpt4 key购买 nike

<分区>

我有一个矩阵,称它为small_matrix由大约 100000 行和 128 列组成,存储为一个数组(我将它用于 CUDA 计算,因此需要节省空间)。我有一个更大的矩阵,称之为 large_matrix ,行数增加 10 倍,行长与 small_matrix 相同我想用 small_matrix 中的行填充它的行.但是,填充过程不是 1:1。有一个 map映射 large_matrix 中每一行的数组在 small_matrix 中连续. small_matrix 中的单行可以映射到 large_matrix 中的多行.我们可以假设 map 数组是随机生成的。此外,large_matrix 中的一行的可能性很小(假设为 1%)将具有随机值而不是实际值。

我正在尝试通过在 C++ 上使用 OMP 的并行性来优化此过程,但我似乎做不到。到目前为止,我所做的任何尝试都只会导致使用更多线程来增加运行时间而不是减少运行时间。这是问题的代码,我正在尝试优化 expand_matrix:

#include <stdio.h>
#include <omp.h>
#include <random>
#include <stdlib.h>
#include <cstddef>
#include <ctime>
#include <cstring>
using namespace std;

inline void* aligned_malloc(size_t size, size_t align){
void *result;
#ifdef _MSC_VER
result = _aligned_malloc(size, align);
#else
if(posix_memalign(&result, align, size)) result = 0;
#endif
return result;
}
inline void aligned_free(void *ptr) {
#ifdef _MSC_VER
_aligned_free(ptr);
#else
free(ptr);
#endif

}

void expand_matrix(int num_rows_in_large_matrix, int row_length, long long* map, float*small_matrix, float* large_matrix, const int num_threads);


int main(){
int row_length = 128;
long long small_matrix_rows = 100000;
long long large_matrix_rows = 1000000;
long long *map = new long long [large_matrix_rows];
float *small_matrix = (float*)aligned_malloc(small_matrix_rows*128*sizeof(float), 128);
float *large_matrix = (float*)aligned_malloc(large_matrix_rows*128*sizeof(float), 128);

minstd_rand gen(std::random_device{}()); //NOTE: Valgrind will give an error saying: vex amd64->IR: unhandled instruction bytes: 0xF 0xC7 0xF0 0x89 0x6 0xF 0x42 0xC1 :: look: https://bugs.launchpad.net/ubuntu/+source/valgrind/+bug/
uniform_real_distribution<double> values_dist(0, 1);
uniform_int_distribution<long long> map_dist(0,small_matrix_rows);
for (long long i = 0; i<small_matrix_rows*row_length;i++){
small_matrix[i] = values_dist(gen)-0.5;
}
for (long long i=0; i<large_matrix_rows;i++){
if (values_dist(gen)<0.99)
map[i] = map_dist(gen);
}
clock_t start, end;
int num_threads =4;
printf("Populated matrix and generated map\n");
start = clock();
expand_matrix(large_matrix_rows, row_length, map, small_matrix, large_matrix, num_threads);
end = clock();
printf("Time to expand using %d threads = %f\n", num_threads, double(end-start)/CLOCKS_PER_SEC);
return 0;
}



void expand_matrix(int num_rows_in_large_matrix, int row_length, long long* map, float*small_matrix, float* large_matrix, const int num_threads){

#pragma omp parallel num_threads(num_threads)
{
#pragma omp for schedule(guided, 4)
for(unsigned int i = 0; i < num_rows_in_large_matrix; i++ ){
long long sml = map[i];
if(sml == -1){
for (int j = 0; j < row_length; j++)
large_matrix[i * row_length + j] = 0.5;
}
else{
memcpy(large_matrix+i*row_length, small_matrix+sml*row_length, row_length*sizeof(float));
}
}
}
}

以下是一些运行时:

Time to expand using 1 threads = 0.402949
Time to expand using 2 threads = 0.530361
Time to expand using 4 threads = 0.608085
Time to expand using 8 threads = 0.667806
Time to expand using 16 threads = 0.999886

我已确保矩阵与内存对齐,我已尝试使用非临时指令进行复制,但我很困惑。我不知道该去哪里找了。非常感谢任何帮助。

一些硬件信息:

CPU: Intel(R) Xeon(R) CPU E5-2620 v4 @ 2.10GHz
L1d cache: 32K
L1i cache: 32K
L2 cache: 256K
L3 cache: 20480K

使用 Ubuntu 16.04 和 gcc 版本 5.5.0 20171010 (Ubuntu 5.5.0-12ubuntu1~16.04)。

24 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com