cuda - How to calculate the required heap size for dynamic memory allocation in a kernel? -
i facing problem if set cuda heap size total amount of memory need allocate in kernel, heap still not big enough serve allocations.
this minimal example represents use case:
#include <stdio.h> #define narr 8 __global__ void heaptest(int n){ double* arr[narr]; __shared__ double* arrs[narr]; if(threadidx.x == 0){ for(int = 0; < narr; i++){ arrs[i] = (double*) malloc(sizeof(double) * n); if(arrs[i] == null) printf("block %d, array %d null\n", blockidx.x, i); } } __syncthreads(); for(int = 0; < narr; i++){ arr[i] = arrs[i]; } } size_t getheapsizeperblock(int n){ return sizeof(double) * n * narr; } int main(){ int n = 4000 * 18; int nblocks = 1; size_t myheapsize = getheapsizeperblock(n) * nblocks; printf("set heap size %lu\n", myheapsize); cudadevicesetlimit(cudalimitmallocheapsize, myheapsize); size_t a; cudadevicegetlimit(&a, cudalimitmallocheapsize); printf("heap size %lu\n", a); heaptest<<<nblocks, 128>>>(n); cudadevicesynchronize(); cudadevicereset(); return 0; } i compile nvcc v8.0.61.
nvcc -arch=sm_60 heaptest.cu -o heaptest
program output is
set heap size 4608000 heap size 4653056 block 0, array 7 null so, if heap size larger required size, not large enough. how correctly calculate required size in case?
you may not able compute exact size required application's heap, since don't have control on cuda's memory manager. when allocating cpu memory have os' memory manager, cuda has it's own memory manager. when allocate multiple arrays in heap, have no guarantee fit in size of heap, there may exist overhead.
to exemplify, did small modification code print memory address returned malloc:
printf("block %d, array %d %p\n", blockidx.x, i, arrs[i]); here's on gtx 1070:
block 0, array 0 0x102059a8d20 block 0, array 1 0x10205600120 block 0, array 2 0x1020568f280 block 0, array 3 0x10205738520 block 0, array 4 0x102057c7680 block 0, array 5 0x10205870920 block 0, array 6 0x102058ffa80 block 0, array 7 (nil) first thing note memory locations not (always) contiguous/increasing (e.g., array 0 > array 6 > ... > array 1), that's not important us. also, if subtract memory addresses in decreasing order, not size passed malloc(), in case sizeof(double) * n, or 576000 bytes. example:
0x1020568f280 - 0x10205600120 = 586080 bytes (array 1)
0x10205738520 - 0x1020568f280 = 692896 bytes (array 2)
since these blocks contiguous in memory respect block size passed malloc(), can verify there indeed memory chunks cannot allocate blocks of 576000 bytes. between arrays 1 , 2 have 10080 bytes, , between arrays 2 , 3, 116896 bytes (that's more 20% of block size!).
what avoid allocating memory dynamically on heap, , instead allocate during host-code execution. if need reason, suggest setting heap size overhead margin, testing before seems enough. @ least expect existing overhead heap allocation, shouldn't big, maybe start allocating 10% , go there, if necessary.
Comments
Post a Comment