Re: [C++ PATCH] Speed up inplace_merge algorithm & fix inefficient logic(PR libstdc++/83938)#

Jonathan Wakely via Gcc-patches Thu, 19 Nov 2020 04:08:57 -0800

On 16/07/19 18:40 +0200, FranÃ§ois Dumont wrote:

Hi
Â Â Â I eventually spent much more time working on the inplace_mergeperformance bench.
Â Â Â  And the results do not confirm the theory:

Before patch:
inplace_merge.ccÂ Â Â Â Â Â Â Â Â Â Â bench 1 / 1 memoryÂ Â Â Â Â Â Â Â Â Â 243rÂ 227uÂ Â 17sÂ Â Â Â Â 1216memÂ Â Â 5pfinplace_merge.ccÂ Â Â Â Â Â Â Â Â Â Â bench 1 / 4 memoryÂ Â Â Â Â Â Â Â Â Â 297r 278uÂ Â 18sÂ Â Â Â Â Â 480memÂ Â Â 0pfinplace_merge.ccÂ Â Â Â Â Â Â Â Â Â Â bench 1 /64 memoryÂ Â Â Â Â Â Â Â Â 373r 354uÂ Â 18sÂ Â Â Â Â Â 480memÂ Â Â 0pfinplace_merge.ccÂ Â Â Â Â Â Â Â Â Â Â bench 0 / 1 memoryÂ Â Â Â Â Â Â Â Â Â 12r 11uÂ Â Â Â 0sÂ Â Â Â Â Â 480memÂ Â Â 0pf
After the patch to reduce memory allocation:
inplace_merge.ccÂ Â Â Â Â Â Â Â Â Â Â bench 1 / 1 memoryÂ Â Â Â Â Â Â Â Â Â 245rÂ 227uÂ Â 18sÂ Â Â Â Â 1216memÂ Â Â 0pfinplace_merge.ccÂ Â Â Â Â Â Â Â Â Â Â bench 1 / 4 memoryÂ Â Â Â Â Â Â Â Â Â 292r 273uÂ Â 19sÂ Â Â Â Â Â 480memÂ Â Â 0pfinplace_merge.ccÂ Â Â Â Â Â Â Â Â Â Â bench 1 /64 memoryÂ Â Â Â Â Â Â Â Â 373r 356uÂ Â 17sÂ Â Â Â Â Â 480memÂ Â Â 0pfinplace_merge.ccÂ Â Â Â Â Â Â Â Â Â Â bench 0 / 1 memoryÂ Â Â Â Â Â Â Â Â Â 11r 11uÂ Â Â Â 0sÂ Â Â Â Â Â 480memÂ Â Â 0pf
With the __len1 > __len2 condition change:
inplace_merge.ccÂ Â Â Â Â Â Â Â Â Â Â bench 1 / 1 memoryÂ Â Â Â Â Â Â Â Â Â 244rÂ 225uÂ Â 20sÂ Â Â Â Â 1216memÂ Â Â 0pfinplace_merge.ccÂ Â Â Â Â Â Â Â Â Â Â bench 1 / 4 memoryÂ Â Â Â Â Â Â Â Â Â 379r 361uÂ Â 17sÂ Â Â Â Â Â 480memÂ Â Â 0pfinplace_merge.ccÂ Â Â Â Â Â Â Â Â Â Â bench 1 /64 memoryÂ Â Â Â Â Â Â Â Â Â 640r 625uÂ Â 16sÂ Â Â Â Â Â 480memÂ Â Â 0pfinplace_merge.ccÂ Â Â Â Â Â Â Â Â Â Â bench 0 / 1 memoryÂ Â Â Â Â Â Â Â Â Â 11r 11uÂ Â Â 0sÂ Â Â Â Â Â 480memÂ Â Â 0pf
When there is no memory limit the results are identical of course.Otherwise as soon as memory is limited performance starts to decreasewith the condition change on __len1 vs __len2.
Could you give the bench you use to demonstrate the enhancement ? Ialso wonder why your patch doesn't change consistently the samecondition in __merge_without_buffer ?
For the moment I'd like to propose the attached patch that is to saythe reduction on the amount of allocated memory and the new/modifiedbenches.
Note that as soon as I forbid any memory allocation I also reduce thesize of the range to merge cause the implementation rely onrecursivity and so could end-up in a stack overflow. Maybe I need tocheck for simulator targets like several other tests ? Unlesssimulators do not run the performance tests ?


The performance tests are never run by default.

I don't think we should spend too much time caring about performance
of sorting close to Out Of Memory conditions. We don't try to optimize
std::vector or other cases to work when close to OOM.

So I think just reducing the memory usage is the right approach here.

Regarding this stack overflow issue, is there some routine to find outhow many levels of function calls can be added before reaching thestack overflow ? I could perhaps call __builtin_alloca and check theresult but that smells. If I could find out this we could fallback onan iterative approach to complete the merge.


No, alloca is inherently unsafe. Please don't even consider it :-)

Â Â Â  PR libstdc++/83938
Â Â Â  * include/bits/stl_algo.h:
Â Â Â  (__inplace_merge): Take temporary buffer length from smallest range.
Â Â Â  (__stable_sort): Limit temporary buffer length.
Â Â Â  * include/bits/stl_tempbuf.h (get_temporary_buffer): Change __len
Â Â Â  computation in the loop.


Please use "Change __len computation in the loop to avoid truncation."

It's a bit more descriptive.

Â Â Â  * testsuite/25_algorithms/inplace_merge/1.cc (test3): Test all possible
Â Â Â  pivot positions.
Â Â Â  * testsuite/performance/25_algorithms/inplace_merge.cc: New.
Â Â Â  * testsuite/performance/25_algorithms/stable_sort.cc: Rework to allow
Â Â Â  execution with different memory constraints.

Ok to commit ?

FranÃ§ois

On 6/9/19 4:27 PM, FranÃ§ois Dumont wrote:
On 12/21/18 9:57 PM, Jonathan Wakely wrote:
On 29/10/18 07:06 +0100, FranÃ§ois Dumont wrote:
Hi

Â Â Â  Some feedback regarding this patch ?
Sorry this got missed, please resubmit during stage 1.

You haven't CC'd the original patch author (chang jc) to give them a
chance to comment on your proposed changes to the patch.

The attached PDF on PR libstdc++/83938 has extensive discussion of the
performance issue, but I don't see any for your version. Some
performance benchmarks for your version would be helpful.
Here is this patch again.
This time it is much closer to John's original one, I just kept mychange on the size of the temporary buffer which doesn't need to beas large as it is currently allocated, especially with John's patch.
The performance tests are showing the good improvements, attachedare the before/after. Surprisingly I do not see any change regardingallocated memory, I guess measures are not accurate enough. Howeverworking on them also show that current implementation is fragile. Ifyou reduce the allowed allocated memory too much you end up with astack overflow because of the recursive implementation.
I already have a patch to keep on trying to allocate memory as longas above a given threshold rather than 0 but I am also working onmaking implementation non-recursive. We'll see if even a buffer ofsize 1 is better than no buffer at all then.
Â Â Â  PR libstdc++/83938
Â Â Â  * include/bits/stl_algo.h:
Â Â Â (__merge_adaptive): When buffer too small consider smallestrange first
Â Â Â  and cut based on buffer size.
Â Â Â  (__inplace_merge): Take temporary buffer length from smallest range.
Â Â Â  (__stable_sort): Limit temporary buffer length.
Â Â Â  * include/bits/stl_tempbuf.h (get_temporary_buffer): Change function
Â Â Â  to reduce requested buffer length on allocation failure.
Â Â Â  * testsuite/performance/25_algorithms/inplace_merge.cc: New.
Â Â Â  * testsuite/performance/25_algorithms/stable_sort.cc: Rework to allow
Â Â Â  execution with different level of memory.

FranÃ§ois

diff --git a/libstdc++-v3/include/bits/stl_algo.h 
b/libstdc++-v3/include/bits/stl_algo.h
index ec651e2cc45..b396437443f 100644
--- a/libstdc++-v3/include/bits/stl_algo.h
+++ b/libstdc++-v3/include/bits/stl_algo.h
@@ -2530,7 +2530,7 @@ _GLIBCXX_BEGIN_NAMESPACE_VERSION
      const _DistanceType __len2 = std::distance(__middle, __last);

      typedef _Temporary_buffer<_BidirectionalIterator, _ValueType> _TmpBuf;
-      _TmpBuf __buf(__first, __len1 + __len2);
+      _TmpBuf __buf(__first, std::min(__len1, __len2));


Please add a comment before this saying something like:

      // __merge_adaptive will use a buffer for the smaller of
      // [first,middle) and [middle,last).

      if (__buf.begin() == 0)
        std::__merge_without_buffer
@@ -2738,6 +2738,7 @@ _GLIBCXX_BEGIN_NAMESPACE_VERSION
          std::__merge_sort_with_buffer(__first, __middle, __buffer, __comp);
          std::__merge_sort_with_buffer(__middle, __last, __buffer, __comp);
        }
+
      std::__merge_adaptive(__first, __middle, __last,
                            _Distance(__middle - __first),
                            _Distance(__last - __middle),
@@ -5023,8 +5024,11 @@ _GLIBCXX_BEGIN_NAMESPACE_ALGO
      typedef typename iterator_traits<_RandomAccessIterator>::difference_type
        _DistanceType;

+      if (__first == __last)
+       return;
+
      typedef _Temporary_buffer<_RandomAccessIterator, _ValueType> _TmpBuf;
-      _TmpBuf __buf(__first, std::distance(__first, __last));
+      _TmpBuf __buf(__first, (__last - __first + 1) / 2);


Please add a comment before this saying something like:

        // __stable_sort_adaptive sorts the range in two halves,
        // so the buffer only needs to fit half the range at once.

That explains why (last - first + 1)/2 is correct, and if we ever
change __stable_sort_adaptive we'll know to change this too.


      if (__buf.begin() == 0)
        std::__inplace_stable_sort(__first, __last, __comp);
diff --git a/libstdc++-v3/include/bits/stl_tempbuf.h 
b/libstdc++-v3/include/bits/stl_tempbuf.h
index fa78ee53eb4..b6ad9ee6a46 100644
--- a/libstdc++-v3/include/bits/stl_tempbuf.h
+++ b/libstdc++-v3/include/bits/stl_tempbuf.h
@@ -95,7 +95,7 @@ _GLIBCXX_BEGIN_NAMESPACE_VERSION
                                                        std::nothrow));
          if (__tmp != 0)
            return std::pair<_Tp*, ptrdiff_t>(__tmp, __len);
-         __len /= 2;
+         __len = __len == 1 ? 0 : ((__len + 1) / 2);
        }
      return std::pair<_Tp*, ptrdiff_t>(static_cast<_Tp*>(0), 0);
    }
diff --git a/libstdc++-v3/testsuite/25_algorithms/inplace_merge/1.cc 
b/libstdc++-v3/testsuite/25_algorithms/inplace_merge/1.cc
index 0b0c9c59640..9c393ce8fe3 100644
--- a/libstdc++-v3/testsuite/25_algorithms/inplace_merge/1.cc
+++ b/libstdc++-v3/testsuite/25_algorithms/inplace_merge/1.cc
@@ -28,7 +28,7 @@ using std::inplace_merge;

typedef test_container<int, bidirectional_iterator_wrapper> container;

-void+void

test1()
{
  int array[] = { 1 };
@@ -39,7 +39,7 @@ test1()
  inplace_merge(con2.begin(), con2.begin(), con2.end());
}

-void+void

test2()
{
  int array[] = { 0, 2, 4, 1, 3, 5 };
@@ -60,30 +60,34 @@ struct S
  { return a < _s.a; }
};

-void+void

test3()
{
  S s[8];
-  s[0].a = 0;
-  s[1].a = 1;
-  s[2].a = 2;
-  s[3].a = 3;
-  s[4].a = 0;
-  s[5].a = 1;
-  s[6].a = 2;
-  s[7].a = 3;
-
-  s[0].b = 0;
-  s[1].b = 1;
-  s[2].b = 2;
-  s[3].b = 3;
-  s[4].b = 4;
-  s[5].b = 5;
-  s[6].b = 6;
-  s[7].b = 7;
-
-  inplace_merge(s, s + 4, s + 8);
-  VERIFY( s[0].b == 0 && s[1].b == 4 && s[2].b == 1 && s[3].b == 5 );
+  for (int pivot_idx = 0; pivot_idx < 8; ++pivot_idx)
+    {
+      int bval = 0;
+      for (int i = 0; i != pivot_idx; ++i)
+       {
+         s[i].a = i;
+         s[i].b = bval++;
+       }
+
+      for (int i = pivot_idx; i != 8; ++i)
+       {
+         s[i].a = i - pivot_idx;
+         s[i].b = bval++;
+       }
+
+      inplace_merge(s, s + pivot_idx, s + 8);
+
+      for (int i = 1; i < 8; ++i)
+       {
+         VERIFY( !(s[i] < s[i - 1]) );
+         if (s[i - 1].a == s[i].a)
+           VERIFY( s[i - 1].b < s[i].b );
+       }
+    }
}


I know it's redundant, but I think I'd prefer to keep test3 unchanged
and add test4 for the more general test that uses different pivots.

OK for trunk with those changes.

Thanks for persisting with this.

Re: [C++ PATCH] Speed up inplace_merge algorithm & fix inefficient logic(PR libstdc++/83938)#

Reply via email to