c++ - Where to define CUDA kernels in a program with multiple source files -

see question in bold below.

i have functional c++ program want rewrite using cuda. have acquired reasonable understanding of how use cuda various nvidia tutorials , udacity course. problem is: in examples in such tutorials, use example program simple structure. usually, it's 1 single .cu file contains various kernel definitions followed main() function stuff, allocates device memory, , runs kernels. while these simple examples me understand how use cuda, don't me understand how integrate cuda code more complex program containing classes. question how structure cuda program.

let's concrete:

i have serial particle filter program consisting of following source files:

main.cpp run main program
particle_filter.h , particle_filter.cpp contain class contains logic of particle filter
some other header files irrelevant question

a lot of computation happening in particle filter class perfect use case parallelization. inside many of methods of particle filter class, want replace loops kernel calls.

my question is:

where should definitions of kernels go?

thanks help

as per comment below, here code of 1 method defined in particle_filter.cpp. method initializes particle filter object. want replace for loop inside method kernel call. define kernel? kernel definition become method of class? or should define kernel elsewhere? should kernels defined within same source file or in separate one? know me, best practices here?

void particlefilter::init(double x, double y, double theta, double std[]) {    // set number of particles   num_particles = 100;    // declare random generator   default_random_engine gen;    // extract standard deviations x, y, , theta   double std_x = std[0];   double std_y = std[1];   double std_theta = std[2];    // creates normal distributions x, y , theta.   normal_distribution<double> dist_x(x, std_x);   normal_distribution<double> dist_y(y, std_y);   normal_distribution<double> dist_theta(theta, std_theta);    // create vector contain `num_particles` particles   particles = vector<particle>(num_particles);    // create vector contain weight each particle   weights = vector<double>(num_particles);    // loop on particle vector , initialize each particle initial (x,y,theta) position   // passed in arguments added gaussian noise , weight 1   (int = 0; < num_particles; i++) {      particles[i].id = i; // set particle's id     particles[i].x = dist_x(gen); // generate random value `dist_x` set particle's x position     particles[i].y = dist_y(gen); // generate random value `dist_y` set particle's y position     particles[i].theta = dist_theta(gen); // generate random value `dist_theta` set particle's orientation theta     particles[i].weight = 1.0; // set initial weight particles 1    }    is_initialized = true;  }

the program describe still simple (which why i'm able venture answer... ignores code).

what think need following:

determine parts of program involve lot of parallelizable work (irrespective of how it's structured now; think of entirety of work abstraction done.)
determine whether data fits entirely in gpu global memory.

2.1 if does, initializations might relevant on gpu well.

2.2 if doesn't, it's less worth initialize things on gpu, if on cpu still needs effective , multithreaded, otherwise maybe gpu relevant anyway simplicity's sake.
have .cu file each kernel (and possibly .cuh device-side functions kernel calls), or perhaps closely related group of kernels. (also remember same functionality different types = single templated kernel in single file.)
have wrapper/bridge/whatever piece of code, purely-c++ header cuda implementation, launches stuff (you have cross normal c++ cuda somewhere). now, it's other people have implemented use, write simple yourself. may need such wrapper exist per-kernel or central.
your main.cpp includes wrapper header(s), , launches kernels using it; works because link cuda-compiled , host-compiler-compiled code together.

i hope i'm making sense without concrete examples.

partial concrete example

the trickiest point in above advice no. 4. here's use in own code "bridge" normally-compiled , nvcc-compiled code:

a launch mechanism compiles different regular compiler , nvcc.
a kernel wrapper uses launch mechanism, , usable regular c++ code.

_{note first file part of lightweight modern-c++-ish cuda runtime api wrapper library of mine, makes host-side coding more convenient in several ways.}

Search This Blog

Force Net

c++ - Where to define CUDA kernels in a program with multiple source files -

partial concrete example

Comments

Post a Comment

Popular posts from this blog

ubuntu - PHP script to find files of certain extensions in a directory, returns populated array when run in browser, but empty array when run from terminal -

php - How can i create a user dashboard -

javascript - How to detect toggling of the fullscreen-toolbar in jQuery Mobile? -