why can't I get the right sum of 1D array with numba (cuda python)? -
i try use cuda python numba. code calculate sum of 1d array follows, don't know how 1 value result rather 3 values.
python3.5 numba + cuda8.0
import os,sys,time import pandas pd import numpy np numba import cuda, float32 os.environ['numbapro_nvvm']=r'd:\nvidia gpu computing toolkit\cuda\v8.0\nvvm\bin\nvvm64_31_0.dll' os.environ['numbapro_libdevice']=r'd:\nvidia gpu computing toolkit\cuda\v8.0\nvvm\libdevice' bpg = (1,1) tpb = (1,3) @cuda.jit def calcu_sum(d,t): ty = cuda.threadidx.y bh = cuda.blockdim.y index_i = ty l = len(d) su = 0 while index_i<l: su +=d[index_i] index_i +=bh print('su:',su) t[0,0]=su print('t:',t[0,0]) d = np.array([ 0.42487645,0.41607881,0.42027071,0.43751907,0.43512794,0.43656972, 0.43940639,0.43864551,0.43447691,0.43120232], dtype=np.float32) t = np.empty([1,1]) print('d: ',d) stream = cuda.stream() stream.auto_synchronize(): dd = cuda.to_device(d, stream) dt= cuda.to_device(te, stream) calcu_sum[bpg, tpb, stream](dd,dt)
the output is:
d: [ 0.42487645 0.41607881 0.42027071 0.43751907 0.43512794 0.43656972 0.43940639 0.43864551 0.43447691 0.43120232] su: 1.733004 su: 1.289852 su: 1.291317 t: 1.733004 t: 1.289852 t: 1.291317
why can't output "4.31417383" rather "1.733004 1.289852 1.291317" ? 1.733004+1.289852+1.291317=4.314173.
i'm new numba, read numba documentation, don't know how it. can give advice ?
the reason don't sum expect because haven't written code produce sum.
the basic cuda programming model (whether use cuda c, fortran or python language) write kernel code executed each thread. have written code each thread read , sum part of input array. have not written code threads share , sum individual partial sums final sum.
there extremely described algorithm doing -- called parallel reduction. can find introduction algorithm in pdf ships in examples of every version of cuda toolkit, or download presentation here. can read more modern version of algorithm uses newer features of cuda (warp shuffle instructions , atomic transactions) here.
after have studied reduction algorithm, need adapt standard cuda c kernel code numba python kernel dialect. @ bare minimum, this:
tpb = (1,3) @cuda.jit def calcu_sum(d,t): ty = cuda.threadidx.y bh = cuda.blockdim.y index_i = ty sbuf = cuda.shared.array(tpb, float32) l = len(d) su = 0 while index_i < l: su += d[index_i] index_i +=bh print('su:',su) sbuf[0,ty] = su cuda.syncthreads() if ty == 0: t[0,0] = 0 in range(0, bh): t[0,0] += sbuf[0,i] print('t:',t[0,0])
will want, although still long way optimal parallel shared memory reduction, see when read material provided links to.
Comments
Post a Comment