Re: tms570 Cortex-R performance counters and some ideas related to RTEMS timekeeping code

Pavel Pisa Fri, 22 Aug 2014 09:45:06 -0700

Hello Joel,

On Friday 22 of August 2014 17:25:24 Joel Sherrill wrote:
> Pushed.
>
> Followups can just be subsequent patches.


thanks, you are faster than light ...

As for the RTEMS timekeeping code, I can imagine how it could
look better. I do not like Clock_driver_nanoseconds_since_last_tick.
I am not even sure if it is really used by TOD (i.e. in ticker test
seems to print rounded values on our board).

In the fact I would like to see RTEMS to work completely tickless
on hardware with modern free runing timebase and easily updated
compare event hardware. That would allow to implement all POSIX time
related functions with resolution limited only by hardware.

Scheduler is a question. Wen more than one task of same priority
are ready to run then tick is easiest but even in such case
slice time can be computed and only event for its overflow timer event
is set.

But all that is huge amount of work.

I would start with easier side now. It is necessary to have reliable
timebase. Consider 64 bit value running at some clock source speed.
It is really hard to have that reliable on PC hardware, the common
base 8254 can be used for that but access is horribly slow. All other
mechanisms (HPET, TSC) are problematic - need probe and check that
they are correct and synchronous between cores, do not change with
sleep modes etc. Really difficult task which is solved by thousands
lines of code by Linux kernel.

But ARM and PowerPC based systems usually provide reasonable
timer source register which is synchronized over all cores.
Unfortuantelly, ARM ones provide usually only 32 bits wide register.
I have solved problem how to extend that 32 bit counter to 64
bit for one my friend who worked at BlackBerry. Their phones platform
uses Cortex-A and QNX. The design constrains has been given
by usecase - userspace events timestamping in QML profiller.
This adds constrain that code can be called on more cores concurrently,
using mutex would degrade performance horribly, privileged instructions
cannot be used and value available from core was only 32 bit.

I have designed for him attached code fragments and he has written
some Qt derived code which is was used in Q10 phone debugging builds.

The main ideas is to write extension to more than 60 bits without
locking and use GCC builtin atomic support to ensure that counter overflow
results only in single increment of higher value part.

The only requirement for correct function is that clockCyclesExt()
is called at least once per half of the counter overflow period
and its execution is not interrupted for longer than equivalent time.
Code even minimizes cache write contention cases.

What do you think about use of this approach in RTEMS?

Then next step is to base timing on values which are not based on
the ticks. I have seen that discussion about NTP time format
(integer seconds + 1/2^32 fractions). Other option is 64bit nsec
which is better regard 2038 overflow problem. The priority queue
for finegrained timers ordering is tough task. It would worth
to have all operations with additional paremeter about required precision
for each interval/time event etc ...

But that is for longer discussion and incremental solution.

I cannot provide my full time for such enhancements anyway.

But it could be nice project if funding is found. I have friend
who has grants from ESA to develop theory for precise time sources
fussion (atomic clocks etc) and works on real hardware for satelite
based clock synchronization too. We have spoken about Linux kernel
NTP time synchronization and PLL loop long time ago and both gone
to same conclusion how it should be done right way. I would be interresting
to have this solution in RTEMS as well. But to do it right it would
require some agency/company funded project. We have even networking
cards with full IEEE-1588 HW support there for Intel and some articles
about our findings regarding problem to synchronize time where most
problematic part are latencies between ETHERNET card hardware and
CPU core. They are even more problematic than precise time over
local ETHERNET LAN ... So I think that there is enough competent
people to come with something usesfull. But most of them cannot
afford to work on it only for their pleassure.

OK, that some dump of my ideas.

I need to switch to other HW testing now to sustain our company
and university project above sea level.

Best wishes,

                  Pavel

/* gcc -Wall atomic-extend-cc.c */

/***************************************************************************
 *									   *
 * Copyright (c) 2014, Pavel Pisa <p...@cmp.felk.cvut.cz>	   *
 * All rights reserved.							   *
 *									   *
 * Redistribution and use in source and binary forms, with or without	   *
 * modification, are permitted provided that the following conditions are  *
 * met:									   *
 *									   *
 * 1. Redistributions of source code must retain the above copyright	   *
 *    notice, this list of conditions and the following disclaimer	   *
 *  - or license is changed to one of following standard licenses          *
 *      BSD, GPL (even with linking exception), LGPL,  MPL                 *
 *									   *
 * 2. Redistributions in binary form must reproduce the above copyright	   *
 *    notice, this list of conditions and the following disclaimer in the  *
 *    documentation and/or other materials provided with the		   *
 *    distribution.							   *
 *									   *
 * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS	   *
 * "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT	   *
 * LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR   *
 * A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT	   *
 * OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL,   *
 * SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT	   *
 * LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE,   *
 * DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY   *
 * THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT	   *
 * (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE   *
 * OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.	   *
 *									   *
 ***************************************************************************/

#include <stdint.h>

/* for GCC 4.8 when C11 is implemented */
/* #include <stdatomic.h> */

uint32_t hw_cnt;

uint64_t extended_cnt;

uint32_t hw_cnt_test_seq[]={
  0x10000000,
  0x20000000,
  0x20000000,
  0x20000001,
  0x20000002,
  0xC0000000,
  0xF0000000,
};

uint64_t ClockCycles(void)
{
        static int seq_pos = 0;

	if(seq_pos>=sizeof(hw_cnt_test_seq)/sizeof(*hw_cnt_test_seq))
		seq_pos = 0;
	hw_cnt = hw_cnt_test_seq[seq_pos++];

	return (uint64_t)hw_cnt << 32;
}


#if 0

uint64_t clockCyclesExt()
{
        static uint32_t wrap_count = 0;
        static uint32_t recent_cc = 0;
        
        mutex.lock();
        uint32_t cc = (uint32_t)(ClockCycles() >> 32);
        if(cc < recent_cc) wrap_count++;
        recent_cc = cc;
        uint64_t ret = wrap_count;
        mutex.unlock();
        
        ret <<= 32;
        ret += cc;
        return ret;
}

#else

/* Has to be one or more bits, additional bits result in relaxed
   requirement for relative timing of calls for minimal required frequncy
   of the calls. Minimal frequency for 1 bit is given by CC/2, for more
   there is allowed jitter up period almost equal to CC */
#define CC_HI_MASK 0xC0000000

uint64_t clockCyclesExt()
{
        static uint32_t cc_wrap_and_hi = 0;

        uint32_t wahi, expected;
	uint32_t cc;

#if (__GNUC__ * 1000 + __GNUC_MINOR__) >= 4007
	wahi = __atomic_load_n(&cc_wrap_and_hi, __ATOMIC_SEQ_CST);
#else /* OLD GCC */
	wahi = *(volatile uint32_t *)&cc_wrap_and_hi;
	__sync_synchronize();
#endif  /* OLD GCC */

        cc = (uint32_t)(ClockCycles() >> 32);

	if ((cc ^ wahi) & CC_HI_MASK) {
		/* Slow path */
		expected = wahi;
		if (cc < wahi)
			wahi++;
		wahi &= ~CC_HI_MASK;
		wahi |= cc & CC_HI_MASK;
#if (__GNUC__ * 1000 + __GNUC_MINOR__) >= 4007
		/* __atomic_compare_exchange_n (type *ptr, type *expected, type desired, bool weak, int success_memmodel, int failure_memmodel) */
		__atomic_compare_exchange_n(&cc_wrap_and_hi, &expected, wahi, 0, __ATOMIC_SEQ_CST, __ATOMIC_SEQ_CST);
#else /* OLD GCC */
		/*bool __sync_bool_compare_and_swap (type *ptr, type oldval type newval, ...)*/
		__sync_bool_compare_and_swap(&cc_wrap_and_hi, expected, wahi);
#endif  /* OLD GCC */
	}

        uint64_t ret = wahi & ~CC_HI_MASK;
        ret <<= 32;
        ret |= cc;
        return ret;
}


#endif

#include <stdio.h>
#include <inttypes.h>

int main(int argc, char *argv[])
{

  int i;
  
  for(i = 0; i < 100; i++) {
    extended_cnt = clockCyclesExt();
    printf("%016"PRIx64"\n", extended_cnt);
  }
  
  return 0;
}

/* g++ -Wall -std=c++11 atomic-extend-cc.cpp */

/***************************************************************************
 *									   *
 * Copyright (c) 2014, Pavel Pisa <p...@cmp.felk.cvut.cz>	   *
 * All rights reserved.							   *
 *									   *
 * Redistribution and use in source and binary forms, with or without	   *
 * modification, are permitted provided that the following conditions are  *
 * met:									   *
 *									   *
 * 1. Redistributions of source code must retain the above copyright	   *
 *    notice, this list of conditions and the following disclaimer	   *
 *  - or license is changed to one of following standard licenses          *
 *      BSD, GPL (even with linking exception), LGPL,  MPL                 *
 *									   *
 * 2. Redistributions in binary form must reproduce the above copyright	   *
 *    notice, this list of conditions and the following disclaimer in the  *
 *    documentation and/or other materials provided with the		   *
 *    distribution.							   *
 *									   *
 * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS	   *
 * "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT	   *
 * LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR   *
 * A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT	   *
 * OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL,   *
 * SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT	   *
 * LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE,   *
 * DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY   *
 * THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT	   *
 * (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE   *
 * OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.	   *
 *									   *
 ***************************************************************************/


#include <atomic>

uint32_t hw_cnt;

uint64_t extended_cnt;

uint64_t ClockCycles(void)
{
	return (uint64_t)hw_cnt << 32;
}


#if 0

uint64_t clockCyclesExt()
{
        static uint32_t wrap_count = 0;
        static uint32_t recent_cc = 0;
        
        mutex.lock();
        uint32_t cc = (uint32_t)(ClockCycles() >> 32);
        if(cc < recent_cc) wrap_count++;
        recent_cc = cc;
        uint64_t ret = wrap_count;
        mutex.unlock();
        
        ret <<= 32;
        ret += cc;
        return ret;
}

#else

/* Has to be one or more bits, additional bits result in relaxed
   requirement for relative timing of calls for minimal required frequncy
   of the calls. Minimal frequency for 1 bit is given by CC/2, for more
   there is allowed jitter up period almost equal to CC */
#define CC_HI_MASK 0xC0000000

uint64_t clockCyclesExt()
{
        static std::atomic<uint32_t> cc_wrap_and_hi = ATOMIC_VAR_INIT(0);

        uint32_t wahi, expected;

	wahi = cc_wrap_and_hi.load(std::memory_order_seq_cst);

        uint32_t cc = (uint32_t)(ClockCycles() >> 32);

	if ((cc ^ wahi) & CC_HI_MASK) {
		/* Slow path */
		expected = wahi;
		if (cc < wahi)
			wahi++;
		wahi &= ~CC_HI_MASK;
		wahi |= cc & CC_HI_MASK;

		/* __atomic_compare_exchange_n (type *ptr, type *expected, type desired, bool weak, int success_memmodel, int failure_memmodel) */
		cc_wrap_and_hi.compare_exchange_strong(expected, wahi, std::memory_order_acq_rel);
	}

        uint64_t ret = wahi & ~CC_HI_MASK;
        ret <<= 32;
        ret |= cc;
        return ret;
}


#endif

int main(int argc, char *argv[])
{

  extended_cnt = clockCyclesExt();
  
  return 0;
}

_______________________________________________
devel mailing list
devel@rtems.org
http://lists.rtems.org/mailman/listinfo/devel

Re: tms570 Cortex-R performance counters and some ideas related to RTEMS timekeeping code

Reply via email to