gpt4 book ai didi

c++ - 计算序列似然的更快方法?

转载 作者:搜寻专家 更新时间:2023-10-31 02:16:12 25 4
gpt4 key购买 nike

这是我上一个问题的第二个问题
Faster way to do multi dimensional matrix addition?在遵循@Peter Cordes 的建议后,我将我的代码矢量化,现在速度提高了 50 倍。然后我再次执行 gprof 并发现此功能占用了大部分时间。

Each sample counts as 0.01 seconds.  %   cumulative   self              self     total            time   seconds   seconds    calls  Ts/call  Ts/call  name     69.97      1.53     1.53                             cal_score(int, std::string, int const*, int, double)
double cal_score(int l, string seq, const int *__restrict__ pw,int cluster,double alpha)
{
const int cols =4;
const int *__restrict__ pwcluster = pw + ((long)cluster) * l * cols;
double score = 0;
char s;
string alphabet="ACGT";
int count=0;
for(int k=0;k<cols;k++)
count=count+pwcluster[k];

for (int i = 0; i < l; i++){
long row_offset = cols*i;
s=seq[i];
//#pragma omp simd
for(int k=0;k<cols;k++) {
if (s==alphabet[k])
score=score+log( ( pwcluster[row_offset+k]+alpha )/(count+4*alpha) );
}
}
return score;
}

我是第一次做代码优化,所以不知道如何进行。那么有什么办法可以更好地编写此功能。所以我可以获得更快的速度。输入 seq 是长度为 l 的字符“ACGT”的序列。pw 是大小为 2*l*4 或 [p][q][r] 的一维数组,簇是 p。

最佳答案

这是重写它的另一种方法。这会使用查找表而不是搜索来翻译字符串,并且调用 log 的次数减少了 10 倍。

这还将 seq 更改为通过引用传递的 const char*,而不是通过值传递的 std::string。 (这将复制整个字符串)。

unsigned char transTable[128];

void InitTransTable(){
memset(transTable, 0, sizeof(transTable));
transTable['A'] = 0;
transTable['C'] = 1;
transTable['G'] = 2;
transTable['T'] = 3;
}

static int tslen = 0; // static instead of global lets the compiler keep tseq in a register inside the loop
static unsigned char* tseq = NULL; // reusable buffer for translations. Not thread-safe

double cal_score(
int l
, const unsigned char* seq // if you want to pass a std::string, do it by const &, not by value
, const int *__restrict__ pw
, int cluster
, double alpha
)
{
int i, j, k;
// make sure tseq is big enough
if (tseq == NULL){
tslen = std::max(4096, l+1024);
tseq = new unsigned char[tslen];
memset(tseq, 0, tslen);
} else if (l > tslen-1){
delete tseq;
tslen = l + 4096;
tseq = new unsigned char[tslen];
memset(tseq, 0, tslen);
}
// translate seq into tseq
// (decrementing i so the beginning of tseq will be hot in cache when we're done)
for (i = l; --i >= 0;) tseq[i] = transTable[seq[i]];

const int cols = 4;
const int *__restrict__ pwcluster = pw + ((long)cluster) * l * cols;
double score = 0;
// count up pwcluster
int count=0;
for(k = 0; k < cols; k++) count += pwcluster[k];

double count4alpha = (count + 4*alpha);
long row_offset = 0;
for (i = 0; i < l;){
double product = 1;
for (j = 0; j < 10 && i < l; j++, i++, row_offset += cols){
k = tseq[i];
product *= (pwcluster[row_offset + k] + alpha) / count4alpha;
}
score += log(product);
}
return score;
}

compiles to fairly good code , 但如果没有 -ffast-math 则不能用乘法代替除法。

它不会自动矢量化,因为我们只加载 pwcluster 的每四个元素之一。

关于c++ - 计算序列似然的更快方法?,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/37254001/

25 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com