This patch removes a lot of the crufty code which was necessary for an arbitrary sized vector reductions. The new plan going forward it to fix vector_length to a size such that vector loops don't require any synchronization after the loop have terminated. In the case of nvptx targets, vector_length = warp_sz, which is currently 32 threads. I'll follow up this patch with another one which adds support for tree reductions in vector loops in a later date.
This patch has been applied to gomp-4_0-branch. Cesar