[Bug libstdc++/57916] New: Improve std::sort paritioning by explicitly employing the pivot

2013-07-16 Thread alexey.tourbin at gmail dot com
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=57916

Bug ID: 57916
   Summary: Improve std::sort paritioning by explicitly employing
the pivot
   Product: gcc
   Version: unknown
Status: UNCONFIRMED
  Severity: enhancement
  Priority: P3
 Component: libstdc++
  Assignee: unassigned at gcc dot gnu.org
  Reporter: alexey.tourbin at gmail dot com

Created attachment 30515
  --> http://gcc.gnu.org/bugzilla/attachment.cgi?id=30515&action=edit
Move the median back to its pivot position. Exclude the pivot from further
partitioning.

In the original SGI code, partitioning does not use the concept of pivot
explicitly.  The routine __unguarded_partition(first, last, pivot) only cuts
the sequence into two halves: [ <= pivot | >= pivot].  Note that the actual
pivot element used for partitioning can end up anywhere within the sequence,
not necessarily adjacent to the cut point.  (Also, note that the name
"__unguarded" indicates that certain boundary checks are omitted, which places
some restrictions on how the pivot can be selected.)

In 2009, Chris Jefferson introduced a new helper function,
__unguarded_partition_pivot(first, last) which calls __unguarded_partition
using the median of 3 elements as a pivot.  The median is actually placed into
the first element of the sequence and excluded from partitioning: [pivot| <=
pivot | >= pivot].  (Since this pivot selection scheme satisfies the
restrictions placed by __unguarded_partition, the "__unguarded" part ought to
be excluded from the name of this helper function.)

However, when partitioning is finished, this helper routine does NOT install
the pivot back to its final position: [ <= pivot | pivot | >= pivot ]. 
Although it can be argued that the resulting code is still correct, it does not
follow the common practice and misses an opportunity for optimization.  Indeed,
by installing the pivot back to its final position, we can exclude it from
further partitioning stages.  Although the change is small, it propagates
throughout partitioning tree and produces a cumulative effect.  Measurements
show (see below) that this change improves std::sort running time by 2% to %3.

The same technique of excluding the pivot can also be applied to __introselect
helper function (used to implement the nth_element routine).  Actually we can
check whether the pivot returned by partitioning is exactly the nth element,
and return immediately in this case.  Although an improvement in running time,
if any, is hard to measure, there is an important special case: when the
nth_element is used to select the middle element of the sequence, and the
sequence is already sorted, only one partitioning stage will be executed.

I use the following program to evaluate the effect on the running time of
std::sort.

#include 
#include 
#include 
#include 
#define N (64 << 20)
int a[N];
int main()
{
for (size_t i = 0; i < N; i++)
a[i] = rand();
std::time_t start = std::clock();
std::sort(a, a + N);
std::cout << (clock() - start) / (double) CLOCKS_PER_SEC << std::endl;
return 0;
}

(before the change)
$ g++ -O2 test-sort.cc && ./a.out
9.48

(after the change)
$ g++ -I. -O2 test-sort.cc && ./a.out
9.29

Callgrind annotations also indicate an improvement - the number of instruction
reads has actually dropped by 10%!

(before this change)
16,344,046,167  PROGRAM TOTALS
10,249,743,434  ???:void std::__introsort_loop(int*, int*, long)'2
 2,263,595,692  ???:main
 1,742,665,662  /usr/src/debug/glibc-2.17-alt5/stdlib/random_r.c:random_r
 1,140,850,688  /usr/src/debug/glibc-2.17-alt5/stdlib/random.c:random
   677,198,672  ???:void std::__introsort_loop(int*, int*, long)
   268,435,456  /usr/src/debug/glibc-2.17-alt5/stdlib/rand.c:rand

(after the change)
14,976,731,729  PROGRAM TOTALS
9,180,509,971  ???:void std::__introsort_loop(int*, int*, long)'2
2,110,233,226  ???:main
1,742,665,662  /usr/src/debug/glibc-2.17-alt5/stdlib/random_r.c:random_r
1,140,850,688  /usr/src/debug/glibc-2.17-alt5/stdlib/random.c:random
  532,670,993  ???:void std::__introsort_loop(int*, int*, long)
  268,435,456  /usr/src/debug/glibc-2.17-alt5/stdlib/rand.c:rand


[Bug libstdc++/57916] Improve std::sort paritioning by explicitly employing the pivot

2013-07-17 Thread alexey.tourbin at gmail dot com
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=57916

--- Comment #2 from Alexey Tourbin  ---
> Thanks for the patch, do you have a GCC opyright assignment in place?

I don't have a copyright assignmet, but I'm willing to undergo the procedure
(and perhaps to sign an assignment for all future changes).  So how do I
proceed?


[Bug tree-optimization/57999] New: Missed constant propagation into trampolines

2013-07-27 Thread alexey.tourbin at gmail dot com
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=57999

Bug ID: 57999
   Summary: Missed constant propagation into trampolines
   Product: gcc
   Version: 4.8.1
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: tree-optimization
  Assignee: unassigned at gcc dot gnu.org
  Reporter: alexey.tourbin at gmail dot com

Created attachment 30562
  --> http://gcc.gnu.org/bugzilla/attachment.cgi?id=30562&action=edit
mergesort.c

The attached source file is a simplistic mergesort implementation (which
actually works, except for the deliberate assertion failure).  Its basic
structure is as follows.

static inline void msort(void *b, size_t n, const size_t s,
int (*cmp)(), void *t)
{
auto void msort_rec(char *b, size_t n);
void msort_rec(char *b, size_t n)
{  
/* Copying will be inlined */
assert(__builtin_constant_p(s));
...
/* Recursively sort a1 and a2 */
msort_rec(b1, n1);
msort_rec(b2, n2);
...
/* Merge */
...
if (cmp(b1, b2) <= 0) {
n1--;
memcpy(tmp, b1, s);
b1 += s;
...
}
msort_rec(b, n);
}

When the size of the element (size_t s) is a compile-time constant, the
resulting code would benefit greatly from inlining per-element memcpy() calls.
However, gcc fails to propagate the constant size into the trampoline - the
assertion actually fails, and otherwise the resulting code is roughly 2 times
slower.