logo
Tags down

shadow

high performance calculations and saving of the threads identificators


By : Timmy
Date : September 15 2020, 01:00 PM
will help you If your output is relatively dense (i.e. a lot of indices and relatively few zeros), then the stream compaction approach suggested in comments is a good solution. There are a lot of read-to-go stream compaction implementations which you can probably adapt to your purposes.
If your output is sparse, so you need to save relatively few indices for a lot of inputs, then stream compaction isn't such a great solution because it will waste a lot of GPU memory. In that case (and you can roughly estimate an upper bound of the number of output indices) something like this:
code :
template <typename T>
struct Array 
{
    T*  p;
    int Nmax;
    int* next;  

    Array() = default;

    __host__ __device__ 
    Array(T* _p, int _Nmax, int* _next) : p(_p), Nmax(_Nmax), next(_next) {};

    __device__
    int append(T& val)
    {
        int pos = atomicAdd(next, 1);
        if (pos > Nmax) {
            atomicExch(next, Nmax);
            return -1;
        } else {           
            p[pos] = val;
            return pos;
        }
    };
};
$ cat append.cu 

#include <iostream>
#include <thrust/device_ptr.h>
#include <thrust/device_vector.h>
#include <thrust/iterator/counting_iterator.h>
#include <thrust/copy.h>

namespace AppendArray
{
    template <typename T>
    struct Array 
    {
        T*  p;
        int Nmax;
        int* next;  

        Array() = default;

        __host__ __device__ 
        Array(T* _p, int _Nmax, int* _next) : p(_p), Nmax(_Nmax), next(_next) {};

        __device__
        int append(T& val)
        {
            int pos = atomicAdd(next, 1);
            if (pos > Nmax) {
                atomicExch(next, Nmax);
                return -1;
            } else {           
                p[pos] = val;
                return pos;
            }
        };
    };
}

    __global__ 
void kernelfind(int* input, int N, AppendArray::Array<int> indices)
{
    int idx = threadIdx.x + blockIdx.x * blockDim.x;
    for(; idx < N; idx += gridDim.x*blockDim.x) {
        if (input[idx] % 10000 == 0) {
            if (indices.append(idx) < 0) return;
        }
    }
}

int main()
{
    const int Ninputs =  1 << 20;
    thrust::device_vector<int> inputs(Ninputs);
    thrust::counting_iterator<int> vals(1);
    thrust::copy(vals, vals + Ninputs, inputs.begin());
    int* d_input = thrust::raw_pointer_cast(inputs.data());

    int Nindices =  Ninputs >> 12;
    thrust::device_vector<int> indices(Nindices);
    int* d_indices = thrust::raw_pointer_cast(indices.data());

    int* pos; cudaMallocManaged(&pos, sizeof(int)); *pos = 0;

    AppendArray::Array<int> index(d_indices, Nindices-1, pos);

    int gridsize, blocksize;
    cudaOccupancyMaxPotentialBlockSize(&gridsize, &blocksize, kernelfind, 0, 0);

    kernelfind<<<gridsize, blocksize>>>(d_input, Ninputs, index);
    cudaDeviceSynchronize();

    for(int i = 0; i < *pos; ++i) {
        int idx = indices[i];
        std::cout << i << " " << idx << "  " << inputs[idx] << std::endl;   
    }
    return 0;
}

$ nvcc -std=c++11 -arch=sm_52 -o append append.cu

$ ./append
0 9999  10000
1 19999  20000
2 29999  30000
3 39999  40000
4 49999  50000
5 69999  70000
6 79999  80000
7 59999  60000
8 89999  90000
9 109999  110000
10 99999  100000
11 119999  120000
12 139999  140000
13 129999  130000
14 149999  150000
15 159999  160000
16 169999  170000
17 189999  190000
18 179999  180000
19 199999  200000
20 209999  210000
21 219999  220000
22 239999  240000
23 249999  250000
24 229999  230000
25 279999  280000
26 269999  270000
27 259999  260000
28 319999  320000
29 329999  330000
30 289999  290000
31 299999  300000
32 339999  340000
33 349999  350000
34 309999  310000
35 359999  360000
36 379999  380000
37 399999  400000
38 409999  410000
39 369999  370000
40 429999  430000
41 419999  420000
42 389999  390000
43 439999  440000
44 459999  460000
45 489999  490000
46 479999  480000
47 449999  450000
48 509999  510000
49 539999  540000
50 469999  470000
51 499999  500000
52 569999  570000
53 549999  550000
54 519999  520000
55 589999  590000
56 529999  530000
57 559999  560000
58 619999  620000
59 579999  580000
60 629999  630000
61 669999  670000
62 599999  600000
63 609999  610000
64 699999  700000
65 639999  640000
66 649999  650000
67 719999  720000
68 659999  660000
69 679999  680000
70 749999  750000
71 709999  710000
72 689999  690000
73 729999  730000
74 779999  780000
75 799999  800000
76 809999  810000
77 739999  740000
78 849999  850000
79 759999  760000
80 829999  830000
81 789999  790000
82 769999  770000
83 859999  860000
84 889999  890000
85 879999  880000
86 819999  820000
87 929999  930000
88 869999  870000
89 839999  840000
90 909999  910000
91 939999  940000
92 969999  970000
93 899999  900000
94 979999  980000
95 959999  960000
96 949999  950000
97 1019999  1020000
98 1009999  1010000
99 989999  990000
100 1029999  1030000
101 919999  920000
102 1039999  1040000
103 999999  1000000


Share : facebook icon twitter icon

High Performance Math Library for Vector And Matrix Calculations


By : user3049938
Date : March 29 2020, 07:55 AM
Hope that helps Edit: Sorry, I'm not high enough to add comments anywhere but just noticed you mentioned it is for games, unless you are writing your engine in software without any hardware acceleration, then the benefits you will get from the matrix manipulations (assuming that they are for TCL) are about nil in comparison to the rest of your code.

High performance calculations with Ruby?


By : user3315820
Date : March 29 2020, 07:55 AM
This might help you There's one option you didn't include: It's rather easy to extend Ruby in C, so if you have a limited set of operations that need to be fast you can write a C extension and then use that from Ruby.
http://people.apache.org/~rooneg/talks/ruby-extensions/ruby-extensions.html

linux high performance message passing between threads in c++


By : Mohamed Ebrahim Atia
Date : March 29 2020, 07:55 AM
With these it helps The most efficient method that I can think of would use one linked list, one mutex, and one condition variable:

High performance unique timestamp id for multiple threads in Haskell


By : Vinay_123456
Date : March 29 2020, 07:55 AM
Hope that helps You can use atomicModifyIORef to implement an atomic counter. With GHC, it's implemented using atomic operations, not locks.
code :
import Data.IORef
import System.IO.Unsafe

counter :: IO Int
counter = unsafePerformIO $ newIORef 0

getUnique :: IO Int
getUnique = atomicModifyIORef counter $ \x -> let y = x + 1 in (y, y)

Performance degradation for high numbers of threads in Rust


By : Surgstorm
Date : March 29 2020, 07:55 AM
Any of those help If all your threads spend all their time hammering on your lock-free data structure, yes you'll get contention once you have enough threads. With enough writers they'll contend for the same cache line in the table more often. (Also, possibly time spent in the PRNG doesn't hide contention for shared bandwidth to cache or DRAM).
Instead of just a plateau, you'll maybe start hitting more CAS retries and stuff like that, including any contention backoff mechanism. Also, threads will suffer cache misses and even memory-order mis-speculation pipeline clears from some atomic reads; not everything will be atomic RMWs or writes.
Related Posts Related Posts :
  • par_unseq and "vectorization-unsafe" functions
  • libgphoto2 and Visual Studio 2019
  • What's the difference between "double* grade", "double *grade", and "double* fn()"?
  • How to license C++ software
  • Meaning of declaration float ***c
  • character array validation in C++
  • Why can't casting an address to int* be an lvalue but casting to a struct pointer can?
  • Can names in unnamed namespaces in different C++ files refer to the same named thing?
  • How to get variable no of argument with its size in C++ using variadic template
  • Understanding SHT_NOTE section ".note.ABI-tag" of an ELF exectable
  • Displaying all prefixes of a word in C++
  • Does Erase deletes heap memory used by element of stl unordered map
  • Can you cast a vector<int64> to a vector<uint8>
  • Do function parameter variables always require a & or * operator?
  • File Append In Such A Way That It Ends Line After Each Save C++
  • Calling Derived class function from a Base class pointer after typecasting it to Derived class pointer
  • is there a std::optional_function like the given
  • Access array in main by pointers by method in class
  • Is there a method/function in c++ which later constant parameters are based on the first ones?
  • How to find out which functions the compiler generated?
  • Hourglass in C++ adding arguments
  • I'm trying to encrypt a message for my homework assignment
  • Object instantiation with curly braces and : symbols
  • c++ How to add value at the beginning of the array and index it?
  • Data structure that stays sorted, allows log N insertion time, and can return the index of an element that I look for in
  • What is diffrence between return reference instance and non reference instance(return mystr & vs mystr)?
  • CListCtrl is showing different theme for Unicode and Multi byte character set
  • Delete Inherited class Pointer Array in C++
  • error: constexpr variable 'struct2Var' must be initialized by a constant expression
  • Possible problem of gcc with sleep_for and sleep_until functions
  • why does argv remove some of my characters?
  • Should I call processEvents() on a thread?
  • Is it possible / desirable to create non-copyable shared pointer analogue (to enable weak_ptr tracking / borrow-type sem
  • pthread works fine only if the number of threads is small
  • In C++ given one std::variant type, can one add additional types to make another variant type?
  • C++17 post increment operation
  • How to check if variable is of string type in template class?
  • C++ temporary variable lifetime?
  • Remove blank line in c++
  • I have to find the maximum sum of 4 numbers from an array of 5. My code fails for bigger numbers
  • TicTacToe with MiniMax algorithm 4x4
  • What is the difference between std::stable_partition() and std::partition()?
  • Filter out breaks based on stack trace
  • stable_partition on forward iterators
  • Problem with creating and accesing a 4D vector in c++
  • Linking DirectX
  • How to change and delete a variable in stack?
  • are there other ways to write a multidimensional array in an array?
  • Partial Specialization using a qualified name
  • Warning C6385 in Visual Studio
  • 'this' cannot be used in a constant expression C++
  • Why is my print function not working? Linked list
  • fill vector with random elements
  • Difference between return 0 and -1
  • Is it possible to store lambdas with different signatures in a std::vector and execute them (with their respective argum
  • Why isn't my GetProcessID function compiling in VS?
  • Implementing a non-copyable C++ class
  • Understanding index++ in single line of code
  • Typedefs with tweaked alignment
  • Get memory address of an R data frame
  • shadow
    Privacy Policy - Terms - Contact Us © 35dp-dentalpractice.co.uk