c++ - Where to define CUDA kernels in a program with multiple source files -
see question in bold below.
i have functional c++ program want rewrite using cuda. have acquired reasonable understanding of how use cuda various nvidia tutorials , udacity course. problem is: in examples in such tutorials, use example program simple structure. usually, it's 1 single .cu
file contains various kernel definitions followed main()
function stuff, allocates device memory, , runs kernels. while these simple examples me understand how use cuda, don't me understand how integrate cuda code more complex program containing classes. question how structure cuda program.
let's concrete:
i have serial particle filter program consisting of following source files:
main.cpp
run main programparticle_filter.h
,particle_filter.cpp
contain class contains logic of particle filter- some other header files irrelevant question
a lot of computation happening in particle filter class perfect use case parallelization. inside many of methods of particle filter class, want replace loops kernel calls.
my question is:
where should definitions of kernels go?
thanks help
as per comment below, here code of 1 method defined in particle_filter.cpp
. method initializes particle filter object. want replace for
loop inside method kernel call. define kernel? kernel definition become method of class? or should define kernel elsewhere? should kernels defined within same source file or in separate one? know me, best practices here?
void particlefilter::init(double x, double y, double theta, double std[]) { // set number of particles num_particles = 100; // declare random generator default_random_engine gen; // extract standard deviations x, y, , theta double std_x = std[0]; double std_y = std[1]; double std_theta = std[2]; // creates normal distributions x, y , theta. normal_distribution<double> dist_x(x, std_x); normal_distribution<double> dist_y(y, std_y); normal_distribution<double> dist_theta(theta, std_theta); // create vector contain `num_particles` particles particles = vector<particle>(num_particles); // create vector contain weight each particle weights = vector<double>(num_particles); // loop on particle vector , initialize each particle initial (x,y,theta) position // passed in arguments added gaussian noise , weight 1 (int = 0; < num_particles; i++) { particles[i].id = i; // set particle's id particles[i].x = dist_x(gen); // generate random value `dist_x` set particle's x position particles[i].y = dist_y(gen); // generate random value `dist_y` set particle's y position particles[i].theta = dist_theta(gen); // generate random value `dist_theta` set particle's orientation theta particles[i].weight = 1.0; // set initial weight particles 1 } is_initialized = true; }
the program describe still simple (which why i'm able venture answer... ignores code).
what think need following:
- determine parts of program involve lot of parallelizable work (irrespective of how it's structured now; think of entirety of work abstraction done.)
determine whether data fits entirely in gpu global memory.
2.1 if does, initializations might relevant on gpu well.
2.2 if doesn't, it's less worth initialize things on gpu, if on cpu still needs effective , multithreaded, otherwise maybe gpu relevant anyway simplicity's sake.
- have
.cu
file each kernel (and possibly.cuh
device-side functions kernel calls), or perhaps closely related group of kernels. (also remember same functionality different types = single templated kernel in single file.) - have wrapper/bridge/whatever piece of code, purely-c++ header cuda implementation, launches stuff (you have cross normal c++ cuda somewhere). now, it's other people have implemented use, write simple yourself. may need such wrapper exist per-kernel or central.
- your main.cpp includes wrapper header(s), , launches kernels using it; works because link cuda-compiled , host-compiler-compiled code together.
i hope i'm making sense without concrete examples.
partial concrete example
the trickiest point in above advice no. 4. here's use in own code "bridge" normally-compiled , nvcc-compiled code:
- a launch mechanism compiles different regular compiler , nvcc.
- a kernel wrapper uses launch mechanism, , usable regular c++ code.
note first file part of lightweight modern-c++-ish cuda runtime api wrapper library of mine, makes host-side coding more convenient in several ways.
Comments
Post a Comment