Home
Company

QNX Webinars

Extracting Maximum Performance from Multicore Processors - Questions & Answers

Webinar February 15, 2012 -- Register for the On-demand version


Question 1: What is your opinion on increasing core number with regards to Amdahl's Law and communication overhead problems?

Freescale: Freescale's goal is to provide linear improvements as core counts increases. If done correctly, the power, performance and price of the solution will scale in a predictable fashion. Our 45nm P-series QorIQ families have achieved this goal, we expect the AMP series to offer the same benefits.

QNX: In an SMP system as cores are added, increased overhead within the operating occurs thread scheduling, etc., decreasing overall performance gain. In an AMP system, communication between the different cores increases, again decreasing overall performance. At some point, perhaps the overhead will outweigh the performace gain as cores are added, however at this time we have not seen this.

Question 2: Is the kernel aware of the hardware configuration. For example migrate threads to a core that shares the caches with the core being currently used. Same thing for x86 hyperthreading (give low priority to hyperthread core)?

No - it is only aware on which core the thread last ran in an attempt to schedule there first to maximize the possiblility of the cache still being current. Hyperthreading is supported, however the kernel is not aware of a particular core being a hyperthreaded - it is simply another "core" for scheduling purposes.

Question 3: How well is parallelism in software application design catching up on its traditional lag with respect to advances in multi-core processor hardware capability?

While advances are being made in automatic thread parallelization, it still requires much research. Much of the work of parallelization is still being performed manually.

Question 4: What is the worst case power consumption?

Not sure which device you're referring to, but we can certainly provide the power numbers given more specific information up front.

Question 5: Do you have a Rad Hardened QorIQ CPU?

Freescale's processors are available as extended temp (-40 to +125 tj). For rad hardening you would need to use a third party that specializes in military/aerospace testing.

Question 6: Whats your plan on multicore support with regard to C++11 futures, and the async function?

This is not implemented at this time and is being investigated for future releases.

Question 7: Does your Soc have FPGA's?

Freescale's microprocessors do not have FPGA style elements. It's quite common to have FPGA's in a system along with our processors, attached via Serial RapidIO or PCI Express interconnects.

Question 8: Whats your plan on multithread support on the EGL and OpenGL|ES stack? We discovered that the majority of EGL/ES implementations do not support any kind of multi thread usage (from the same process).

IEGL/OpenGL ES supports multi-threaded use, however there is a restriction that a rendering context cannot be used by more than one thread at a time. This means a process can have two threads rendering simultaneously using two different contexts or it can have two threads that alternatively use the same render context. Note also that while the CPU might be multi-core, it does not mean that the GPU is. More specifically, the GPU will do work for a single context on all its cores.Issuing OpenGL ES calls in multiple threads running concurrently is not necessary to achieve parallelism on the GPU.

Question 9: Is the p5 a multi core?

Yes, there are multicore P5 platform processors. Please contact your Arrow or Freescale representative for more information.

Question 10: Greetings, how, in a little more detail, does CoreNet replace busses? Or did I get something wrong here? Thanks!

CoreNet does replace the bus used on our P1 & P2 processors. It's a non-blocking point-to-point/point-to-multipoint deterministic fabric. Access to the fabric is through a series of peripheral access management units.

Question 11: What does the watt rating of X watts 'w/o IO' mean?

Since all new microprocessors feature SerDes, the I/O power will vary; we can provide you with tables to determine your total power (core + selected I/O).

Question 12: So, there are two Memory Controllers, each with 64-bit data bus, right?

Depends on which device you're referring to.. many of our processors feature dual 64-bit memory controllers.

Question 13: If I got it correctly, your board supports Debugging for multiple threads/cores to some degree. Does this extend to multiple processe also?

You can debug multiple processes at the same time from the development tools.

Question 14: In P4080, can the power states be changed so as to reduce the power consumption?

Yes they can be changed, there's detailed information in the P4080 reference manual (available under NDA).

Question 15: What considerations should we consider for managing thermal performance on multi-core parts. Is it as "simple" as lowering power consumption, or are there other techniques you would recommend?

Yes, managing the relative performance of the processor will help, but there are system-level elements you ought to take into consideration as well.. carefully select the switch/phy chip as they're known to be power hungry, and if possible use DDR3 memory.

Question 16: I understand that this presentation is QioR 'centric'; however, I would, (personally) be interested in how some of this 'rolls out' for the e200zen-core family.

The e200 core is used in Freescale's microcontrollers, the performance is well suited to the intended applications. It's worth noting that the Kinetis microcontroller family is based on ARM cores.

Question 17: What about ARM in the embedded market on slide 20?

Freescale: Freescale uses ARM processors in our Kinetis microcontrollers, and our i.MX multimedia processors. ARM is suitable for "consumer" products.

QNX: While ARM is certainly large in the embedded marketplace, this seminar was not intened to cover this processor family. Please contact your Arrow representative to discuss the Freescale ARM processing family. QNX runs on these processors as well.

Question 18: Could you give an example for embedded units executing compue applications?

There are many embedded accelerators to improve compute performance, hardware blocks that handle the data flows, as well security and deep packet inspection. These accelerators do the heavy lifting required in embedded systems, allowing the high performance processor cores and caching subsytems to shine. There are many applications that benefit from this architecture: switches, routers, load balancers, unified threat management, test equipment, military/aerospace, and much more...

Question 19: From a software perspective, it's often stated that issues exist when starving a core or soaking a core, and that the best approach is to balance execution across cores to maximize performance. What if power consumption is a primary driver? Are there then techniques we should consider? For instance, is it advisable to delay execution of threads, or to intentionally starve a core in some circumstances to fit within a required power envelope over time?

The answer is yes, however that is a topic that is beyond the scope of this webinar.

Question 20: Is it possible to have an AMP and BMP architecture in the same processor?

It is possible using something like a hypervisor, however this could have significant impact on real-time processes (you are putting another layer of software between the real-time system and the hardware). It would probably be better to simply use two different machines - one AMP, one BMP, and just provide a high speed communications channel between them.

Question 21: The recent release of 'C'-2011 includes constructs like "atomic" and "mutex", won't these be needed to bring 'highly optimized' legacy code into a BMP approach?

This is not implemented at this time and is being investigated for future releases.

Question 22: How does Bound Multiprocessing work with technologies such as Adaptive Partitioning?

This is discussed in the Adaptive Partitioning Users Guide . Page 28 and 29 refer to "Using the thread scheduler and multicore together" and "Scheduler partitions and BMP".

Question 23: In order to improve performance on a system with multiple I/O channels (e.g., multiple PCIEx cards), it has been proposed (within my company) that interrupts could be reassigned to different cores using virtualization technologies like Intel's VTx/VTd or AMD's IOMMU. Then, dedicated threads running on each of those different cores could handle each I/O channel separately. How would this be done in QNX 6.5.0 with an SMP/BMP microkernel? AMP is not a viable solution for this application, nor is a hypervisor running multiple VMs. The solution would need to be a single QNX OS on a multicore pr?

We cannot reassign the physical interrupt, however you can definitely dedicate the handling of the interrupt to an individual core. There are two ways to handle interrupts - a combination interrupt handler (which must run on the core the physical interrupt comes in on) and an interrupt thread, which can run on any processor and can be locked to any processor (or set of processors, if you prefer). Alternately, you can use InterruptAttachEvent, which only consists of a thread which can be locked to a specific core. For more information, download the QNX Realtime Programmers Manual and refer to the interrupts section.

Question 24: What do you think about MCAPI as IPC in a AMP architecture? Is QNX involved in MCAPI?

QNX is not involved with MCAPI.

Question 25: What is the typical file size to be transmitted over the air?

It depends on the amount of changes. An average size of a mobile device update package is around few MB.

Question 26: What are typical application domains for the QorQ series besides networking application?

There has been explosive growth in QorIQ adoption across many different verticals. A few that come to mind include factory automation, military/aerospace, medical, automotive, smart energy, printing & imaging, and home automation. Expanding segments such as wireless access points and digital video recording have standardized on QorIQ processors.

Question 27: How can I deal with the case, e.g. the OMAP that has 2 A15 cores, but also a M4 core? Does it look like a thread, or does it need to be handled separately, like a DSP type architecture?

QNX does not run on the M4 core - it would be need to be treated separately.

Question 28: I know e200z has a 'cache-locking' API, is this not present on the e500?

Cache can be treated as SRAM, as a shared resource, or allocated per core (and locked) in chunks of 64k.

Question 29: Is the FOTA update relevant for Safety Critical Systems, e.g. Stability Control ECU ? Also, how is the updated version verified, i.e. how is it guaranteed to be properly patched with the existing software version?

Yes FOTA is suitable for doing updates in safety critical systems. First it can be verified after using the delta generation tool then it can be tested in the lab before distributing the update.

Question 30: What's the concurrent programming model for QNX? Is it just threads, locks and mutexes or is there a higher level model that is going to be put forward?

The model is thread based.

Question 31: Question to Mr. Logan: Is the QorIQ an opportunity for automotive application? E. G. CAN, LIN are required.

Yes sir, we have derivatives with CAN interfaces, and more. Freescale's Qorivva product family was designed from the ground up for automotive applications.

Question 32: HOW to perform the kernel trace of specific processes, not the whole system ? I tried to use TraceEvent( ) and to define static rule using _NTO_TRACE_SETCLASSPID, but it applies the rule to the last process specified, i.e. a setting done by previous call is overwritten. Thank you! Backgorund: While debugging complex issues, sometimes we have a need to use kernel trace to record behaviour of several processes under investigation for a long period of time. However, the problem is that amount of emitted kernel traces in case of system-wide kernel tracing is very high.

This is beyond the scope of this webinar. Please contact your QNX Field Applications Engineer or use your QNX technical support plan for assistance.

Question 33: How do you handle cache line contention here? Is there not a magic number that's required on the sizes of the array that are sent to the different cores? I've hit this a few times before where I need to pad between structs in order to ensure that things don't line up behind each other due to cache contention.

This is processor dependant.

Question 34: What factors should we consider for managing security within a multi-processing system? For example, are their any QorIQ-specific features in the fabric of the chip, or software fatures within QNX that allow asymmetric multi-processing in such a way that one core could be a "secure" core providing a Trusted Execution Environment with access to peripherals on the chip, and the second core run "non secure" code, with restricted access to peripherals, and defined policies to move an application from a "non secure" mode to a "secure" mode?

Freescale: Freescale's architecture takes security into account in many ways, a high performance FIPS 140 certified security engine is embedded in all our processors, an extremely efficient pattern matching engine for deep packet inspection, and a Trusted Architecture is featured in our multicore family.

QNX: There are some things that can be done. This would be best discussed with your QNX Sales Representative.

Question 35: Other than static analysis and rigorous testing, are there any debugging tools available from QNX or Freescale that assist in ensuring that a system is thread-safe? For instance, any "lint-like" tools that make recommendations on standard coding practices to reduce the liklihood of coding errors in how threads are synchronized and resources are locked and accessed?

Not that I am aware of.

Question 36: We are using Freescale ARM Processors. Single and Multi Core.. Whats better atomic Swap or load and store conditionals for lock free datastructures?

I simply do not know - please pose to your QNX sales team.

Question 37: QNX available for 64 bits T4240 in simulation?

I do not know which processor you are referring to.

Question 38: Any support for Ada '95?

No.

Question 39: Are the development tools free?

No.

Question 40: For the array example that you demonstrated, if we were to work on the 24 core system would the performance increase by approximately 24 times?

No, it would be less than that due to overhead. It will most likely be greater than 23x, however.

Question 41: Can we get your e-mail contact details for some post-questions?

Freescale: jeff.logan@freescale.com

QNX: jpschaffer@qnx.com.

Question 42: Any major SMP bug fixes from 6.3.2 vs 6.5.0 that we should know about. Currently we are using 6.3.2.

Since I do not know what you would consider major or minor, please refer to the release notes for 6.4.0, 6.4.1, and 6.5 for details on changes made. These are available in the Download section at www.qnx.com.

Question 43: Can you show how to use the application profile? How to setup them for mult-core?

The subject is to broad for a quick answer. There is no special setup for multi-core. For details on using the application profiler, see the Application Profiler documentation in the IDE Users Guide.

Question 44: If one runs a multicore code on a single core processor, i'm assuming that this is possible and the performance roughly the same as a single thread code?

This is just a multi-threaded program, so it will definitely run on a single core. As far as performance, this would depend on the interactions between the threads.

Question 45: Any references for POSIX materials available?

Not sure exactly what you mean here, however QNX is a POSIX compliant operating system and POSIX API calls are noted as such in our documentation.

Question 46: How does QNX mutlicore support compare to what CUDA or C++AMP support?

I do not know enough about either one to make a valuable comment here.

Question 47: Are Pentinum counters synchronized between cores?

Counters are local to an individual core.

Question 48: Does QNX have any kind of watchdog processes to check for threading deadlocks?

We have a watchdog process (High Availability Manager) which can be used to monitor various types of conditions and, when a stiuation is detected, take an appropriate action as assigned by the developer. If setup properly, this can include deadlock conditions.

Question 49: Could you comment on the 5 & 10 yr roadmaps wrt the number of cores/threads?

Under NDA, gladly. Please contact your Freescale account manger for a complete update.

Question 50: What are the advantages of QNX's SMP against other OS's SMP technologies?

Standard SMP models permit the locking of threads to a particular core only within the code itself and does not apply to any child of the thread, requiring code changes to each child thread that needs to be locked to a same core. If source is not available to that code (such as a multi-threaded third party library), they would not be locked to a core. The QNX BMP model permits the child of a thread to inherit the same runtime mask as the parent, permitting the locking of entire process trees to a particular core without any of the child threads even being aware that they are locked to a particular core.

Question 51: The undefined memory model of C makes multicore programming hard. What Java implementations are available on the presented platforms?

QNX works with third party vendors to provide Java. Please refer to the Partners section of our website (www.qnx.com) for a listing of Java vendors.

Question 52: Can you say that AMP is not a good way to use multi-processing instead of BMP?

BMP can do everything AMP can do except run different operatings on the different cores - SMP and BMP require only one operating system which controls all of the cores.

Question 53: Does the QoRIQ series support different cores with different bit-ness (like 32 bit, 64 bit) on the same chip?

It can, however there is a performance penalty associated with running in 32-bit mode.

Question 54: Can the cores be clocked independently?

Yes, please contact your Freescale or Arrow representative for more details availble under NDA.

Question 55: Are there application notes for real time handling of accelerometer and gyroscopic threads in parallel?

If you are referring to the devices in the Blackberry Playbook, these should be addressed through the RIM developers site.

Question 56: For Mr. Logan: Is QorIQ supported by codecs applications? Which ones? Also, which compilers and which other OS's? For Mr. Schaffer: Isn't it logical that locking-threads helps cache optimization? Also, wouldn't it be logical to partition application in terms of dedicated tasks to dedicated cores instead of into generic worker-manager thread manner? Does QNX support 64b?

Freescale: Freescale delivers a standard Linux drop with API's and function calls to take advantage of the processors unique capabilities. Updates are available at kernel.org Freescale's networking processor division, that develops and supports QorIQ processors, works with all operating system vendors except Microsoft. As part of our preferred third party program there are important synergies between Freescale and QNX that should be considered.

QNX: Locking threads does help with cache optimization, however that does not mean the application will run best in that configuration. If you remember in the example, performance was actually slightly worse when each thread was locked to a particular core do to the the OS being unable to most efficiently schedule the threads based on current conditions cores. In terms of partitioning cores to dedicated tasks, it can be done that way, however it may not be the best way. Take a network disk drive as an example - logically, it makes sense to break this into one core for disk handling and one for the network interface. However, if there is a flood of network traffic, the networking core may get overwhelmed while there is plenty of cycles available in the other core. It all depends on the system architecture to determine the best approach.

Question 57: Do you have a basic dev board available with QNX?

We have many different platforms available, Arrow keeps an inventory in their Test Drive program to allow customers to "kick the tires". You may want to check out the P2020 & QNX bundle.

Question 58: Is the inter-core IPC of QNX OS encapsulated in C-style function calls, or is it assembly level primitives?

The API for QNX IPC is implemented as C function calls.

Question 59: Does the OS for Freescale muli-core of QNX comply to Do-178B (aerospace certifying body)?

I am not familiar enough with the details of the DO-178B spec to properly answer that. I can say that no operating system is DO-178B certified - it is simply a component of the final product being certified. QNX has been part of DO-178B certified devices, however I do not know if multi-core was used in those products.

Question 60: How easy will it be to migrate an 8641D based application (using altivec) to an AMP series processor?

The SIMD accelerators in both the e600 based 8641D and the next gen T series will look and feel identical to the user.

Question 61: My questions are BlackBerry Playbook-related. 1) Are those Momentix System profiler tools available in the standard BBNDK and do they work with stock PlayBooks? 2) Is anything else, besides creating an extra worker thread, to run on 2 cores simultaneously. My app written using boost thread gets about 75% speed up in win32 when 2ng thread is enabled and on PlayBook it actually slows down by a few %-- looks like it only runs on one core.

1) The application profiler is part of the BB Tablet NDK 2.0, however the system profiler is not.
2) To run on multiple cores, all that is required is multiple threads (same process or not). I do not use the boost library, so I cannot comment here. One thing to watch for, however, is that the parent is NOT locked to a particular core. If it is, then all children will be locked to that core. This is one where the system profiler is invaluable - contact your QNX support person and they can help here.

Question 62: What is the typical Ethernet PHY to user space application latency on QNX on QorQ for Ethernet packets?

Not in the scope of this seminar. Please pose to your QNX Sales team.

Question 63: Is it possible to implement hierarchical scheduling in BMP?

QNX does not natively support hierarchical scheduling, however I suppose you could build one with a lot of manual manipulation of the threads. I believe this would be a lot of work and I am not sure how much would be gained over some of the existing scheduling mechanisms.

Question 64: Are there any power profiling tools similar to system profiler and application profiler?

Freescale provides power numbers in a simple to understand format; these tables take into account speed grade, temperature points and I/O.

Question 65: Is the OS be mapped to a single core?

The kernel itself runs on a single core in a multi-core environment, however resource manager threads (which includes protocol stacks, filesystems, drivers, etc) can run on either core or be locked to a single one.

Question 66: Is Qemu able to run QNX as a platform to develop on?

There is a QNX community page addressing this.

Question 67: Are there associated license and royalty fees for the Momentics tool?

Business issues should be posed to your QNX Sales team.

Question 68: When are there going to be tools like strace, or memory analysis like valgrind?

I will pass on the suggestion.

Question 69: More specifically: can they talk about what a 'context switch' cost in CPU cycles?

Not in the scope of this seminar. Please pose to your QNX Sales team.

Question 70: Is Ada supported by all toolsets?

No.

Question 71: Is it possible for corporate developer/hobbiest developer to get access to OS source to get deep understanding of OS mechanisms ?

Access to kernel source is restricted.

Question 72: Does QNX support the AltiVec too ?

Yes.

Question 73: Do you expect to support OpenCL on QorIQ ?

Not in the scope of this seminar. Please pose to your QNX Sales team.

Question 74: Are there SIL3 safety level processor versions available?

Yes.

Question 75: Is supported in the chips IOMMU technology like a Intel VT-d?

I can't comment on Intel, however Freescale's microprocessors all feature robust memory management units.