gpu programming - Errors in Polynomial fitting problem on CUDA -


I tried to use CUDA to make some simple loops on the device, but it seems difficult to understand Qyda is. I'm getting 0 from every function call, when I use the CUDA kernel function with the normal C code. Original code:

  Dual evaluation (int de, double tmp [], tall * enfwel) * Polynomial fitting problem * / int i, j; Intint M = 60; Double px, x = -1, dx = (double) m, result = 0; (* Nfeval) ++; DX = 2 / dx; (I = 0; i & lt; = M; i ++) for {px = tmp [0]; For (J = 1; J & L; D, J ++) {PX = X * PX + TMP [J]; } If (px and lt; -1; px & gt; 1) result + = (1-px) * (1-px); X + = dx; } Px = tmp [0]; For (J = 1; J & L; D; J ++) PX = 1.2 * PX + TMP [J]; Px = px-72.661; If (px <0) result + px * px; Px = tmp [0]; For (j = 1; j & l; d; j ++) px = -1.2 * px + tmp [ja]; Px = px-72.661; If (px <0) result + px * px; Return result; }   

I first wanted to loop on CUDA:

  Double evaluation _guppu (int de, double tmp [], tall * napvel) {/ * Multilateral fitting problem * / et j; Intint M = 60; Double px, dx = (double) m, result = 0; (* Nfeval) ++; DX = 2 / dx; Int N = M; Double * device_tmp = NULL; Size_t size_tmp = sizeof tmp; CudaMalloc ((double **) and device_tmp, size_tmp); CudaMemcpy (device_tmp, tmp, size_tmp, cudaMemcpyHostToDevice); Int block_size = 4; Int n_blocks = N / block_size + (n% block_size == 0? 0: 1); CEvaluate & lt; & Lt; & Lt; N_blocks, block_size & gt; & Gt; & Gt; (Device_tmp, result, D); // cudaMemcpy (Results, Results, Dimensions_groups, cudaMemcpyDeviceToHost); Px = tmp [0]; For (J = 1; J & L; D; J ++) PX = 1.2 * PX + TMP [J]; Px = px-72.661; If (px <0) result + px * px; Px = tmp [0]; For (j = 1; j & l; d; j ++) px = -1.2 * px + tmp [ja]; Px = px-72.661; If (px <0) result + px * px; Return result; }   

where device function looks:

  __global__ zero cEvaluate_temp (double * tmp, double result, int d) {int M = 60; Double px; Double x = -1; Double dx = (double) m; Int j; DX = 2 / dx; Int idx = blockIdx.x * blockDim.x + threadIdx.x; If (idx & lt; 60) // & lt; ==> If (idx  {px = tmp [0]; For (J = 1; J & L; D, J ++) {PX = X * PX + TMP [J]; } If (px and lt; -1; px & gt; 1) {__ irrigation (); Results + = (1-pixel) * (1-pixels); // + =} x + = dx; }}   

I know that I have not specified this problem, but it appears that I have more than one.

I do not know when to copy the variable device, and when it will be copied 'automatically' Now, I am using CUDA 3.2 and have problems with emulation (I printf I would like to use), when I make NVCC to em = 1, then when I use printf, there is no error, but I do not even get any output

is the simplest version of the device function , I have tested. Can anyone explain what will happen with the result value after the increase in parallel? I think I should use device sharing memory and synchronization to do STH like "+ =".

  __global__ zero cEvaluate (double * tmp, double result, int d) {int idx = blockIdx.x * blockDim.x + threadIdx.x; If (idx & lt; 60) // & lt; ==> If (idx & lt; m) {Results + = 1; Printf ("res =% f", result); // - deviceemu, emu = 1}}    

no, variable result is not shared on many threads.

What I recommend is that the result of the shared memory is to have a matrix of values, a result for each thread, calculate each value and reduce it, a single value.

  __global__ zero cEvaluate_temp (double * tmp, double * global_view, int d) {int M = 60; Double px; Double x = -1; Double dx = (double) m; Int j; DX = 2 / dx; Int idx = blockIdx.x * blockDim.x + threadIdx.x; __shared___ Share_Salt [Blocks]; Return (IDX> = 60); Px = tmp [0]; For (J = 1; J & L; D, J ++) {PX = X * PX + TMP [J]; } If (px and lt; -1; px & gt; 1) {result [thread idx] + = (1-px) * (1-px); } X + = dx; } __syncthreads (); If (threadIdx.x == 0) {total_result = 0 (IDX in blocks) {total_result + = result [idx]; } Global_result [0] = Total_Loc; }   

In addition to this you need cudaMemcpy after the Kadal Orientation. Kernel is asynchronous and requires a sync function.

Also use the error check function on each CUDA API invocation.

Comments

Popular posts from this blog

qt - switch/case statement in C++ with a QString type -

python - sqlite3.OperationalError: near "REFERENCES": syntax error - foreign key creating -

Python's equivalent for Ruby's define_method? -