

3A GPSE FISE - SATE

# ARCHITECTURES POUR LE CALCUL

# COURS







# ARCHITECTURES POUR LE CALCUL

# CONTACTS



Équipe enseignante

hugo descoubes - COURS hugo.descoubes@ensicaen.fr +33 (0)2 31 45 27 61

Isabelle Lartigau isabelle.lartigau@ensicaen.fr

Emmanuel Cagniot emmanuel.cagniot@ensicaen.fr

ENSICAEN 6 boulevard Maréchal Juin CS 45053 14050 CAEN cedex 04

# RESSOURCES



Les différentes ressources numériques sont accessibles sur la plateforme pédagogique de l'ENSICAEN. Télécharger l'archive complète de travail **opt.zip** 

https://foad.ensicaen.fr/course/view.php?id=117



# ARCHITECTURES POUR LE CALCUL

# ÉVALUATION



• Examen de pratique sur ordinateur (1h30)

L'évaluation de la compétence se fera sur machine personnelle ou machine école et portera sur les points suivants :

- Création d'un projet sous IDE CCS. A l'image du projet présent dans cm/eval/examen\_nom
- Optimisation d'une fonction algorithmique élémentaire (cf. trame de TP)
  - écriture en C canonique
  - écriture ASM C6000 canonique
  - écriture ASM VLIW
  - écriture de l'algorithme optimisé avec l'une des techniques avancée suivante :
    - Vectorisation en langage C par programmation intrinsèque
    - Pipelining software en ASM C6000
    - Vectorisation en base 2 ou 4 en ASM C6000



# DIVERSITÉ DES ARCHITECTURES PROCESSEURS



# Chapter 1 Diversity of Processor Architectures

<|>





### ON THE DIVERSITY OF PROCESSOR ARCHITECTURES Digital electronics history

×

### Quick reminder

(c)

1947: Invention of the Bipolar Junction Transistor → by Bardeen, Schokley and Brattain (Bell labs), Nobel Prize winners

1958/1959: Creation of Integrated Circuits by Texas Instruments (hybrid IC), then Fairchild (true monolithic IC)

1960: Invention of the MOS Field-Effect Transistor → by Mohammed Atalla and Dawon Kahng





### ON THE DIVERSITY OF PROCESSOR ARCHITECTURES

First processor



The first ever commercialised processor is the Intel 4004 in 1971.

It has 2,300 transistors with a 10 µm etching process (4-bit processor, 16 pins, 740 kHz, 90 kIPS or kilo-Instructions Per Second).



# ON THE DIVERSITY OF PROCESSOR ARCHITECTURES



Processors evolution

Ever since, processors have evolved following natural selection.

Those that matched specific needs improved while others disappeared from markets and research labs.



# ON THE DIVERSITY OF PROCESSOR ARCHITECTURES Processors evolution



As for animals and plants, the evolution process of processors is never-ending. New processor architectures are likely to born in the next few years!



Let's take a look at the current processor architectures.

| ON THE DIVERSITY OF PROCESSOR ARCHITECTURES |
|---------------------------------------------|
| Common processor architectures              |



| MCU | ΑΡ | GPP | SoC / SoB | FPGA | DSP | (GP) GPU |
|-----|----|-----|-----------|------|-----|----------|
|     |    |     |           |      |     |          |

#### ON THE DIVERSITY OF PROCESSOR ARCHITECTURES

#### Common processor architectures





### ON THE DIVERSITY OF PROCESSOR ARCHITECTURES

Common processor architectures



| <b>General architecture</b><br>Control processors |                          |                                 | Hybrid<br>architectures                                                                      | <b>Specialised architectures</b><br><i>Coprocessors or Calculus processors</i> |                                |                                                             |
|---------------------------------------------------|--------------------------|---------------------------------|----------------------------------------------------------------------------------------------|--------------------------------------------------------------------------------|--------------------------------|-------------------------------------------------------------|
| МСИ                                               | АР                       | GPP                             | SOC / SOB                                                                                    | FPGA                                                                           | DSP                            | (GP) GPU                                                    |
| Micro<br>Controller<br>Unit                       | Application<br>Processor | General<br>Purpose<br>Processor | Systém<br>on<br>Chip / Board<br>- FPGA-AP<br>- FPGA-MCU<br>- GPP-GPU<br>- AP<br>- MCU-analog | Field<br>Programmable<br>Gate Array                                            | Digital<br>Signal<br>Processor | Graphics<br>Processing<br>Unit<br>General<br>Purpose<br>GPU |

Applications Architectures Designers and products Market shares





### MCU – MICROCONTROLLER UNIT Applications

MCUs (Microcontroller Units, fr: *micro-contrôleurs*) are the most common processors in our environment (talking about quantity).

We use about 200 processors every day, without even being aware!





Applications



MCUs are control processors that are dedicated to the supervision of electronic processes. They control their input/output interfaces with their application-custom embedded firmware.

They aim for markets applications that require low-cost, low-consumption, small-size, and big production volumes.



# MCU – MICROCONTROLLER UNIT Applications



The IoT (Internet of Thing, fr: *objets connectés*) is the major market for MCUs. The IoT is the Internet extension to physical world objects and places. It is considered as the third Internet evolution and has been therfore named « Web 3.0 ».

With 3.6 billions of active connections in 2015, 11.7 billions in 2020 and 30 billions planned in 2025, the IoT counted for 18% of MCUs population in 2019 and will be around 29% in 2025.



# MCU – MICROCONTROLLER UNIT Architecture



MCU processors are digital systems integrated onto an Integrated Circuit. They are designed to be stand-alone (no need for external RAM, HDD ... ).



# MCU – MICROCONTROLLER UNIT Board and schematic

# Example of a schematic that uses a Microchip's PIC18 MCU.

Olimex PIC-USB-4550 board.





# MCU – MICROCONTROLLER UNIT Board and schematic



# Exercise: link these board devices to the schematic in the previous slide.







There is a big number of MCU products from various designers and foundries, each made for different uses.

MCUs from the same family possess the same CPU and associated buses. The ISA (Instruction Set Architecture, fr: *jeu d'instructions*) and the toolchain are therefore similar. The difference between same-family MCUs resides in the peripherals set and the memory resources.



# MCU – MICROCONTROLLER UNIT Arduino project



The Arduino project is certainly the most famous MCU-based electronic project. However it is too user-friendly (too magic, too many hidden things) and is not used in professional environments, which is why it is not studied in engineer schools.



# MCU – MICROCONTROLLER UNIT ARM's Cortex CPU

ENSI CAEN COLE PURIOUS D'INCINITION ECCLE PURIOUS D'INCINITION

Even though the MCU market is very competitive, the vast majority of MCU founders (e.g. STMicroelectronics, Renesas, Texas Instruments, NXP, ...) use similar CPU architectures: the Cortex-M family, designed by the British company ARM

This guaranties an access to reliable development tools, libraries and software services. Some tools can also be open-source (IP / Graphical / USB / Bluetooth, stack, RTOS, ...).



# MCU – MICROCONTROLLER UNIT ARM's Cortex CPU



19

ARM offers the Cortex-M series, with 'M' standing for "MCU".

This includes a whole family of MCU cores that are suitable for a wide range of applications.



### MCU – MICROCONTROLLER UNIT STMicroelectronics



As an example let's take a look at the range of STM32. Those are 32-bit MCUs based on a Cortex-M core.

They are designed by the French-Italian company STMicroelectronics, which also is the main European manufacturer.



### STMicroelectronics





**STMicroelectronics** 



The STMicroelectronics Nucleo project offers low-cost (≈ €10) evaluation boards that use ARM-based MCUs and industrial development tools.



# MCU – MICROCONTROLLER UNIT

Market shares

Let's take a look at an annual markets study.



Presented By: **EE Times** embedded

© 2019 AspenCore All Rights Reserved



#### Market shares



25

E











e

a

#### Which of the following 32-bit chip families would you consider for your next embedded project?



#### MCU – MICROCONTROLLER UNIT

Market shares



27

### Which of the following 8-bit chip families would you consider for your next embedded project?

| Atmel/Microchip AVR                 |                  |                | 44%        |                |                   |             |
|-------------------------------------|------------------|----------------|------------|----------------|-------------------|-------------|
| Microchip PIC                       |                  | 38%            | 46%        |                |                   |             |
| STMicroelectronics ST8              | 18               | 25%            |            |                |                   |             |
| TI TMS370, 7000                     | 13%              |                |            |                |                   |             |
| Freescale/NXP HC                    | 13%              |                |            |                |                   |             |
| Intel 80xx, '251                    | 13%<br>13%       |                |            |                |                   |             |
| Atmel/Microchip 80xx                | 1022%            |                |            |                |                   |             |
| Renesas H8                          | 10%<br>11%<br>9% |                |            |                |                   |             |
| Xilinx PicoBlaze (soft core)        | 9%               |                |            |                | - Distance Second |             |
| SiLabs 80xx                         | 8%               | By Regions     | World      | Americas       | EMEA              | APAC        |
| NXP/Philips P80x, P87x, P89x        | 8%               | Atmel          | 44%        | 44%            | 52%               | 39%         |
| CypressPSoC 1 (M8C) / PSoC 3 (8051) | 7%               | Microchip AVR  | 1170       | 1170           | 5270              | 5570        |
| Zilog Z8, Z80, Z180, eZ80           | 5%               | Microchip PIC  | 38%        | 41%            | 43%               | 23%         |
| Parallax                            | 4%               | STMicro ST8    | 25%        | 22%            | 31%               | 28%         |
| Maxim 80xx                          | 2%               | 511111110 510  | 2070       | 2270           | 5170              | 2070        |
| Infineon XC800, C500                | 2%               |                |            |                |                   |             |
| EFM8                                | 3%               |                |            |                |                   |             |
| Digi / Rabbit 2000, 3000            | 2%<br>3%         | 2019 (N = 351) |            |                |                   |             |
| Toshiba                             | 2%<br>2%         | 2017 (N = 462) |            |                |                   |             |
|                                     |                  |                |            |                |                   |             |
| EE Times embedded 201               | 9 Embedded Marke | ets Study ©    | 2019 Copyr | ight by AspenC | ore. All right    | s reserved. |

Applications Architecture Motherboards Superscalar processor





GPP – GENERAL PURPOSE PROCESSOR Applications



**GPP (General Purpose Processors)** have a complex CPU architecture that gives them a great adaptability especially for executing non-optimised programs.

Most of the time, those programs contain sequential code with a lot of tests and function calls, which are difficult to accelerate.

| 444 | prev = NULL;                                                            |
|-----|-------------------------------------------------------------------------|
| 445 | <pre>for (mpnt = oldmm-&gt;mmap; mpnt; mpnt = mpnt-&gt;vm_next) {</pre> |
| 446 | <pre>struct file *file;</pre>                                           |
| 447 |                                                                         |
| 448 | if (mpnt->vm_flags & VM_DONTCOPY) {                                     |
| 449 | <pre>vm_stat_account(mm, mpnt-&gt;vm_flags, -vma_pages(mpnt));</pre>    |
| 450 | continue;                                                               |
| 451 | }                                                                       |
| 452 | charge = 0;                                                             |
| 453 | if (mpnt->vm_flags & VM_ACCOUNT) {                                      |
| 454 | <pre>unsigned long len = vma_pages(mpnt);</pre>                         |
| 455 |                                                                         |
| 456 | <pre>if (security_vm_enough_memory_mm(oldmm, len)) /* sic */</pre>      |
| 457 | <pre>goto fail_nomem;</pre>                                             |
| 458 | charge = len;                                                           |
| 459 | }                                                                       |
| 460 | <pre>tmp = kmem_cache_alloc(vm_area_cachep, GFP_KERNEL);</pre>          |
| 461 | if (!tmp)                                                               |
| 462 | <pre>goto fail_nomem;</pre>                                             |
| 463 | <pre>*tmp = *mpnt;</pre>                                                |
| 464 | <pre>INIT_LIST_HEAD(&amp;tmp-&gt;anon_vma_chain);</pre>                 |
| 465 | <pre>retval = vma_dup_policy(mpnt, tmp);</pre>                          |
| 466 | if (retval)                                                             |
| 467 | <pre>goto fail_nomem_policy;</pre>                                      |
| 467 | goto fail_nomem_policy;<br>root/kernel/fork.c - www.kernel.org          |

Applications



Their target market are personal and professional computer and laptops.

Thus their main usage is for general applications (i.e. not specific) for personal and professional uses. Most of the time that does not require all the computing power that is really available



# GPP – GENERAL PURPOSE PROCESSOR Applications



Of course some applications are likely to need full capability of the hardware, even though they are not the most common ones.

One can think of audio, image and video processing or software development as well-known examples.

| Character IChr  | 10                                  | _      |           | til Austin                    | 11 Aut |
|-----------------|-------------------------------------|--------|-----------|-------------------------------|--------|
|                 | Not                                 |        | te-124bam | Di Conga and Tambourine-12000 |        |
| 2 Sounds        | FE Bongos-1300prs.alo               | 4      | 10000     |                               |        |
| 99 Drume        | Fire Bouncy-117bom alc              | - 11   |           | -                             |        |
| 1 instruments   | 11 Break Booty-Hitpen alc           | - 11   |           |                               |        |
| - Audio Effect  | File Break Booty-136tpm alc         | - 41   |           | 0                             |        |
| -S. MOI Effects | FE Break Classic Machine-134bpm.alc | _      |           | 0                             | 0      |
| C) Max for Live | FE Clapping Flamence-1285pm.alc     | _      |           | 8                             |        |
| C Pagana        | The Conga and Tambourise-1980pm.alc | _      |           | 8                             |        |
| Clips           | TH Congas-128bpm.aic                | _      |           |                               |        |
| + Samples       | The Deep House-120bpm.alc           | _      |           |                               |        |
|                 | Te Dirty Neptune-95bpm.alc          |        |           |                               |        |
| PLACES          | Fill Dolan #Stepmale                |        |           |                               |        |
| Packs T         | E Drum and Bass Live-1786pm.alc     | v      |           |                               |        |
| 0               | Cliquez pour pré-écoutar            |        |           |                               |        |
|                 |                                     |        |           |                               |        |
| Nors du Groove  | Base Quartil. Timing Aldat. D       | lynam. |           |                               |        |

Audio editing (Ableton)









#### Applications



Industrial applications are a historical part of GPP uses.

They are typically encountered on control tasks or specialised calculus functions. This market tends to use integrated solutions, such as AP (Application Processor), SoC (System on Chip), DSP (Digital Signal Processor), FPGA (Field Programmable Gate Array) ...



Radar GM400 (Thalès)



Rafale (Dassault)



Automatic bollard Box j200

GPP – GENERAL PURPOSE PROCESSOR Applications

Please note that GPPs can also be used in embedded systems applications. For instance this is the NUC Core i5, an Intel motherboard.





### GPP – GENERAL PURPOSE PROCESSOR Intel architectures



Let's have a look on major Intel architectures. Note that Intel is the historical and current leader of GPP market, but it is also the leader of semi-conductors market.



# GPP – GENERAL PURPOSE PROCESSOR Intel architectures



35

Today's leading GPP architectures ar the Intel Core i3/i5/i7/i9 families.

However there are many other actors and manufacturers aiming for different markets.



#### Architecture



# A GPP consists of a are processing element, with no main memory.

A GPP possesses one or several CPU (of same architecture) that are associated with their cache memories. They use an UMA (Uniform Memory Access) and and interface controller.



### GPP – GENERAL PURPOSE PROCESSOR Example: Intel Core i5



37

### Example of the Intel Core i5 family.



### Example: Intel Core i5





# GPP – GENERAL PURPOSE PROCESSOR Example: Intel Core i5



39

### GPP integrated into a motherboard



Example: Intel Core i5



# GPP integrated into a motherboard



# GPP – GENERAL PURPOSE PROCESSOR Motherboard



A GPP must be carried onto a motherboard, on which main memory (RAM) and external interface peripherals will be placed.

Example of a motherboard from ASUS, second leader of world market in 2016.



**GPP – GENERAL PURPOSE PROCESSOR** Superscalar architecture



GPPs have CPU said to be superscalar. Processors with this type of CPU pipeline are generally characterised by the implementation of the following hardware accelerating mechanisms:

- Out-Of-Order execution stage: instructions are not executed in the programmed order. A hardware scheduler looks for dependencies on data, the intermediate results are stored in other registers and instructions are executed in another order (in comparison to the "programmed" order).
- Branch-prediction stage: use statistics and counters to estimate the success rate of a test statement (if, else, for, while, ...)
- RISC-like execution stage: even if the ISA (Instruction Set Architecture) is CISC.

ENSI

CAEN

# **GPP – GENERAL PURPOSE PROCESSOR** Superscalar architecture

Die of a Core i7 CPU (Intel Sandy Bridge generation).



Sandy Bridge CPU/Core

### **GPP – GENERAL PURPOSE PROCESSOR** Superscalar architecture

However, GPP's great adaptability and hardware complexity leads to a lack of determinism and performance when it comes to the execution of specific algorithms.

### For GPPs, the calculation power is simply not good when compared to the power consumption and the price.

GPPs are designed to support an high-end OS (Operating System, fr: Système d'exploitation) and to execute application code. As already mentioned, they are not specialised for signal, image, audio and video processing for instance.







Market shares: Intel vs. AMD





# AP APPLICATION PROCESSOR

Applications Architecture Qualcomm ARM





# AP – APPLICATION PROCESSOR Applications



The AP (Application Processor) market is recent and has started with mobile phones and tablets.

APs embed many functionalities and hardware services, and even SoC (System on Chip).



### AP – APPLICATION PROCESSOR Applications



Mobile phones is the main target market for APs.

This market has led to an overwhelming use of the Android operating system in 2016 (Android is a Linux-kernel based OS).



# AP – APPLICATION PROCESSOR Applications



However application processors are seen in many other embedded systems as well, whatever the final application: consumer, defence, transport, ...

In those cases they are usually embedded with an operating system and a graphical interface.







Sony X94C 4K television



Cook tablet (EOLANE, made in Caen)

# AP – APPLICATION PROCESSOR

SOM SOLO

**AP – APPLICATION PROCESSOR** 

**Applications** 

**Applications** 

In most cases, APs are used by high-level operating systems.

SOM QUAD

UN MODULE MULTIMÉDIA

PERFORMANT POUR VOS PRODUITS

On those markets, GNU/Linux systems and customs versions reign supreme.

Example of EOLANE (French, #2 in Europe): industrial platform working with a Freescale iMX6 SoC/AP based on a GNU/Linux system.

Here are the two major solutions of user-oriented AP-based boards:

Raspberry Pi (Broadcom BCMxxxx SoC) and Beaglebone (TI AM335x SoC) projects.

These solutions are also based on GNU/Linux operating systems.

They are more likely to be used for prototyping stages or in a teaching environment, but cannot be industrialised. However hardened versions exist.



53



STARTER KIT

UNE PLATEFORME D'EVALUATION POUR VOS MAQUETTES







SBC

INDUSTRIEL INTEGRÉ

### AP – APPLICATION PROCESSOR Architecture

An application processor has one or several superscalar generalist CPUs. Their work is to execute the high-level operating systems (virtual or real) and application codes.

An AP may also have many calculus specialised functions (such as GPU, DSP, cryptography, ...), an evolved peripheral set and an internal memory. However the latter is not capable of containing the operating system but has a bootloader instead.

As a consequence a DDR volatile main memory and a remanent mass storage (MMC, eMMC, SDcard) must both be added as external components.

# AP – APPLICATION PROCESSOR Architecture

APs are fully operational systems in an integrated circuit (heterogeneous architecture). Nonetheless main memory must be added as an external component.









### AP – APPLICATION PROCESSOR Comparison of control processors



Contrary to MCUs, which contain all hardware services in a single chip, application processors require an important unitary cost and are therefore no the best solution for low-cost or large-quantities productions.

Yet if the application needs evolved interface and/or connectivities, MCUs are not suitable any more because of their low performances. APs then become the best solution.



# AP – APPLICATION PROCESSOR Architecture



Observe the point of a heterogeneous architecture for video games applications.



# AP – APPLICATION PROCESSOR Qualcomm Snapdragon solution



### The market leader is Qualcomm.

This is due to its Snapdragon family dedicated to mobile phones market.





# AP – APPLICATION PROCESSOR Qualcomm Snapdragon solution



### Internal architecture and hardware functionalities of the Qualcomm Snapdragon 810.



### AP – APPLICATION PROCESSOR Solution ARM : Cortex-A

Observons la

famille i.MX6

de Freescale :



Les deux leaders du marché hors terminaux mobiles sont Texas Instruments et Freescale, deux fondeurs offrant de larges communautés d'utilisateurs.



AP – APPLICATION PROCESSOR ARM Cortex-A solution



61

Outside of the mobile phones market, the ARM Cortex-A is the leading architecture in embedded markets. The 'A' stands for "Application".

### ARM® Cortex® Processors across the Embedded Market







# GPU GRAPHICS PROCESSING UNIT

Applications Architecture Nvidia products Markets





GPU – GRAPHICS PROCESSING UNITS Applications



**GPUs (***Graphics Processing Unit***)** are specialised co-processors dedicated for high intensity calculus and processing.

The term of GPGPU (General Purpose GPU) appeared in the last few years. It relates to massive computing in very sense. Applications are diverse: finance, research, science, medical imagery, video games, ...



http://www.nvidia.com/content/gpu-applications/PDF/gpu-applications-catalog.pdf

# GPU – GRAPHICS PROCESSING UNITS

#### Architecture



GPU possess a shared NUMA (Non Uniform Memory Access), allowing a cloning of data to be processed and a execution parallelism. They integrated a massively parallel architecture.



# GPU – GRAPHICS PROCESSING UNITS Nvidia products: the Tesla P100 board



Let's take a look at the Tesla P100 board characteristics. It has been produced by Nvidia in 2016 and it is dedicated to the then most advanced data centres.

The GPU is a Nvidia GP100.



| GPU Architecture                | NVIDIA Pascal                                                 |
|---------------------------------|---------------------------------------------------------------|
| NVIDIA CUDA® Cores              | 3584                                                          |
| Double-Precision<br>Performance | 5.3 TeraFLOPS                                                 |
| Single-Precision<br>Performance | 10.6 TeraFLOPS                                                |
| Half-Precision<br>Performance   | 21.2 TeraFLOPS                                                |
| GPU Memory                      | 16 GB CoWoS HBM2                                              |
| Memory Bandwidth                | 732 GB/s                                                      |
| Interconnect                    | NVIDIA NVLink                                                 |
| Max Power Consumption           | 300 W                                                         |
| ECC                             | Native support with no<br>capacity or performance<br>overhead |
| Thermal Solution                | Passive                                                       |
| Form Factor                     | SXM2                                                          |
| Compute APIs                    | NVIDIA CUDA,<br>DirectCompute,<br>OpenCL™, OpenACC            |

# GPU – GRAPHICS PROCESSING UNITS

### Nvidia products: Pascal architecture





### **GPU – GRAPHICS PROCESSING UNITS**

Nvidia products: GP100 GPU architecture



ENSI

# GPU – GRAPHICS PROCESSING UNITS Nvidia products: GP100 GPU architecture

### The Nvidia GP100 GPU in a nutshell

- 6 Graphics Processing Clusters
- 30 Texture Processing Clusters (5 / GPC)
- 60 Streaming Multiprocessors (2 / TPC)
- 3840 single precision cores (64 / SM)
- 1920 double precision units (32 / SM)
- 240 texture units (4 / SM)
- 8 memory controllers
  - 8 x 512 KB = 4096 KB L2 cache
  - 4 pairs that control HBM2 DRAM

Note : the Tesla P100 board uses only 56 SMs out of the 60 available in the GP100 GPU.

| Tesla Products                | Tesla K40      | Tesla M40           | Tesla P100          |
|-------------------------------|----------------|---------------------|---------------------|
| GPU                           | GK110 (Kepler) | GM200 (Maxwell)     | GP100 (Pascal)      |
| SMs                           | 15             | 24                  | 56                  |
| TPCs                          | 15             | 24                  | 28                  |
| FP32 CUDA Cores / SM          | 192            | 128                 | 64                  |
| FP32 CUDA Cores / GPU         | 2880           | 3072                | 3584                |
| FP64 CUDA Cores / SM          | 64             | 4                   | 32                  |
| FP64 CUDA Cores / GPU         | 960            | 96                  | 1792                |
| Base Clock                    | 745 MHz        | 948 MHz             | 1328 MHz            |
| GPU Boost Clock               | 810/875 MHz    | 1114 MHz            | 1480 MHz            |
| Peak FP32 GFLOPs <sup>1</sup> | 5040           | 6840                | 10600               |
| Peak FP64 GFLOPs <sup>1</sup> | 1680           | 210                 | 5300                |
| Texture Units                 | 240            | 192                 | 224                 |
| Memory Interface              | 384-bit GDDR5  | 384-bit GDDR5       | 4096-bit HBM2       |
| Memory Size                   | Up to 12 GB    | Up to 24 GB         | 16 GB               |
| L2 Cache Size                 | 1536 KB        | 3072 KB             | 4096 KB             |
| Register File Size / SM       | 256 KB         | 256 KB              | 256 KB              |
| Register File Size / GPU      | 3840 KB        | 6144 KB             | 14336 KB            |
| TDP                           | 235 Watts      | 250 Watts           | 300 Watts           |
| Transistors                   | 7.1 billion    | 8 billion           | 15.3 billion        |
| GPU Die Size                  | 551 mm²        | 601 mm <sup>2</sup> | 610 mm <sup>2</sup> |
| Manufacturing Process         | 28-nm          | 28-nm               | 16-nm FinFET        |

# GPU – GRAPHICS PROCESSING UNITS Nvidia products: GP100 GPU architecture

### GPUs integrate a large number of classical pipeline CPUs but with vectorial SIMD execution units.

| EU =    | Execution Unit                   |
|---------|----------------------------------|
| SIMD =  | Single Instruction Multiple Data |
| GPC     | = Graphics Processing Cluster    |
| TCP     | = Texture Processing Cluster     |
| SM      | = Streaming Multiprocessor       |
| (multit | chreaded processor)              |
| Warp    | = thread of SIMD instructions    |
| DP      | = Double Precision               |

- LD/ST = Load/Store
- SFU = Special Function Unit
- Tex = Texture







ENS

# GPU – GRAPHICS PROCESSING UNITS Nvidia products: Telsa P100 board



#### Communication and interconnection systems (Tesla P100) 4 NVlink / GPU 40 GB/s / NVlink PCIe SWITCH NVlink NVlink NVlink NVlink GPU GPU GPU GPU CPU NVlink NVlink HIGH BANDWIDTH HIGH BANDWIDTH HIGH BANDWIDTH HIGH BANDWIDTH MEDIUM GRAPHICS MEMORY GRAPHICS MEMORY GRAPHICS MEMORY GRAPHICS MEMORY BANDWIDTH LARGE SYSTEM MEMORY

# GPU – GRAPHICS PROCESSING UNITS Nvidia products: application example

# Example of an application using the Nvidia Tesla P100 board.



ENSI

ΔΕΝ

# GPU – GRAPHICS PROCESSING UNITS





The undisputed leader of the GPU/IGP market is Intel, thanks to their graphics coprocessors IGPs (Integrated Graphics Units) embedded in a wide range of their GPPs (more than 70% of market shares in 2016).



# GPU – GRAPHICS PROCESSING UNITS Markets



Nonetheless the leader of high-performance external solutions in the American company Nvidia.



# DSP DIGITAL SIGNAL PROCESSOR

Applications Architecture Texas Instruments





DSP – DIGITAL SIGNAL PROCESSOR Applications



**DSPs (Digital Signal Processors)** are dedicated to applications with Digital Signal Processing (fr: Traitement numérique du signal).



### DSP – DIGITAL SIGNAL PROCESSOR

#### Architecture



### DSPs are very close to MCUs: they are autonomous systems. However their CPU is specialised for signal processing and calculus.



# DSP – DIGITAL SIGNAL PROCESSOR Architecture

DSP's CPUs possess execution units dedicated for MAC (Multiply Accumulate) or SOP (Som Of Products) operations. These are elementary operations met in almost every signal processing algorithm.

Expansion of the Danielson-Lanczos Lemma to 8 terms:

$$\begin{split} F(n) &= \sum_{k=0}^{N/8-1} x(8k) e^{\frac{-j2\pi kn}{(\frac{N}{8})}} + W_{\frac{N}{4}}^n \sum_{k=0}^{N/8-1} x(8k+4) e^{\frac{-j2\pi kn}{(\frac{N}{8})}} + \\ W_{\frac{N}{2}}^n &\sum_{k=0}^{N/8-1} x(8k+2) e^{\frac{-j2\pi kn}{(\frac{N}{8})}} + W_{\frac{N}{2}}^n W_{\frac{N}{4}}^n \sum_{k=0}^{N/8-1} x(8k+6) e^{\frac{-j2\pi kn}{(\frac{N}{8})}} + \\ W_{N}^n &\sum_{k=0}^{N/8-1} x(8k+1) e^{\frac{-j2\pi kn}{(\frac{N}{8})}} + W_{N}^n W_{\frac{N}{4}}^n \sum_{k=0}^{N/8-1} x(8k+5) e^{\frac{-j2\pi kn}{(\frac{N}{8})}} + \\ W_{N}^n W_{\frac{N}{2}}^n &\sum_{k=0}^{N/8-1} x(8k+3) e^{\frac{-j2\pi kn}{(\frac{N}{8})}} + W_{N}^n W_{\frac{N}{4}}^n W_{\frac{N}{2}}^n W_{\frac{N}{4}}^n \sum_{k=0}^{N/8-1} x(8k+7) e^{\frac{-j2\pi kn}{(\frac{N}{8})}} + \end{split}$$



# DSP – DIGITAL SIGNAL PROCESSOR

#### Architecture

ENSI CAEN CONTRACTOR

CPU with MAC/SOP dedicated execution units. The ISA (Instruction Set Architecture) contains specific instructions for working with these EUs.



MAC = SOP

MAC : Multiply-Accumulate SOP : Som of Products

ISA : Instruction Set Architecture EU : Execution Unit

DSP – DIGITAL SIGNAL PROCESSOR Texas Instruments products: C5500

This is the Texas Instruments C5500 DSP, one of the leading DSP solutions.





### DSP – DIGITAL SIGNAL PROCESSOR Texas Instruments products: C5500



# Here is an extract of the C5500 datasheet, with a summary of its characteristics.

#### 1.1 Features

- CORE:
  - High-Performance, Low-Power, TMS320C55x
     Fixed-Point Digital Signal Processor
    - 20-, 10-ns Instruction Cycle Time
    - 50-, 100-MHz Clock Rate
    - One or Two Instructions Executed per Cycle
    - Dual Multiply-and-Accumulate Units (Up to 200 Million Multiply-Accumulates per Second [MMACS])
    - Two Arithmetic and Logic Units (ALUs)
    - Three Internal Data and Operand Read Buses and Two Internal Data and Operand Write Buses
    - Software-Compatible with C55x Devices
    - Industrial Temperature Devices Available
  - 320KB of Zero-Wait State On-Chip RAM, Composed of:
    - 64KB of Dual-Access RAM (DARAM), 8 Blocks of 4K x 16-Bit
    - 256KB of Single-Access RAM (SARAM), 32 Blocks of 4K x 16-Bit
- 128KB of Zero Wait-State On-Chip ROM (4 Blocks of 16K x 16-Bit)
- Tightly Coupled FFT Hardware Accelerator

https://www.ti.com/lit/ds/symlink/tms320c5533.pdf



# DSP – DIGITAL SIGNAL PROCESSOR Texas Instruments products: C6600



Let's switch to the Keystone C6600. This Texas Instruments DSP is one of the highest performances in the current market.





# DSP – DIGITAL SIGNAL PROCESSOR Texas Instruments products: C6600

### Texas Instruments C6600 CorePac.

Memory configurable as cache memory or addressable SRAM with no bandwidth loss.

UMA or NUMA models configurable for each core.





### DSP – DIGITAL SIGNAL PROCESSOR

Texas Instruments products: C6600



### C6600 core with:

- 14-stage VLIW hardware pipeline (Very Long Instruction Word)

- software pipeline with a max width of 8 instructions



# DSP – DIGITAL SIGNAL PROCESSOR Texas Instruments products: C6600



These DSPs are designed for both parallel and daisy-chain work.

Parallel configuration is suitable for massive parallel processing whereas daisy-chain configuration is more suitable for deep processes algorithms.



DSP – DIGITAL SIGNAL PROCESSOR Solutions Texas Instruments : C6600

## Advantage of using daisy-chain configuration:



DSP – DIGITAL SIGNAL PROCESSOR Texas Instruments products: Keystone II



That's not all, TI also offers the Keystone II family. It consists of an AP-SoC with application processors dedicated for digital signal processing applications.

The main target is the telecommunications area.









# DSP – DIGITAL SIGNAL PROCESSOR Actors



The historical and current leader is by far Texas Instruments. TI was the first company to design DSP in 1982.





DSP – DIGITAL SIGNAL PROCESSOR Actors



Here is the range of Texas Instruments processors.

| Microcontro                   | ollers (MCUs)           | Δ                                                                                                         | ARM®-based Processor                                                                                                     | s                                                  | Digital Signal Processors             |                                                                              |                        |  |  |  |  |
|-------------------------------|-------------------------|-----------------------------------------------------------------------------------------------------------|--------------------------------------------------------------------------------------------------------------------------|----------------------------------------------------|---------------------------------------|------------------------------------------------------------------------------|------------------------|--|--|--|--|
| 16-bit Ultra Low<br>Power MCU | 32-bit Real-Time<br>MCU | 32-bit ARM MCU                                                                                            | 32-bit ARM<br>Processors for<br>Performance<br>Applications                                                              | Application<br>Processors                          | Singlecore DSP                        | Multicore DSP                                                                | Ultra Low Power<br>DSP |  |  |  |  |
| • MSP430™ @                   | • C2000™                | <ul> <li>TMS570 Cortex® R4</li> <li>RM4 Cortex® R4F</li> <li>TMS470M Cortex®<br/>M3 Automotive</li> </ul> | <ul> <li>Sitara<sup>™</sup> Cortex A<br/>and ARM9</li> <li>KeyStone Cortex®<br/>A15 and Cortex®<br/>A15 + DSP</li> </ul> | OMAP™ Processors     DaVinci™ Video     Processors | C6000 <sup>™</sup> Power<br>Optimized | <ul> <li>KeyStone Multicore<br/>DSP+ARM</li> <li>C6000™ Multicore</li> </ul> | • C5000™ <i>®</i>      |  |  |  |  |

### DSP – DIGITAL SIGNAL PROCESSOR

Actors



6

# Which of the following DSP chip families would you consider for your next embedded project?





Classifying processors according to their execution model

SISD – SIMD – MISD – MIMD





EXECUTION MODELS Disclaimer

The next slides are not intended for proper lecturing. However you'll hear those terms quite a lot, so here are a few slides about execution models.





ENS

CAEN

Flynn's classification



# Flynn's classification (1972)



*Simple data stream* : each operand contains only one piece of data (one memory cell per operand).

*Multiple data streams*: each operand contains multiple pieces of data (a fixed-size array per operand).

*Single instruction stream*: the CPU can execute one instruction at once (sequential execution).

*Multiple instruction streams*: the CPU can execute multiple instructions at once, either using data parallelism (e.g. *forall* loop) or using control parallelism (e.g. parallel sections).



Flynn's classification



## SISD – Single Instruction stream, Single Data stream



The processor execute one instruction at once, each instruction operand containing a single memory cell.

### This is the typical mono-processor architecture:

- $\rightarrow$  Von Neumann architecture
  - $\rightarrow$  MCUs and old GPP generations
  - $\rightarrow$  Sequential processor (no parallelism)

#### → Scalar processor

 $\rightarrow$  A single piece of data (a single memory cell) for each operand

# EXECUTION MODELS

Flynn's classification

### SISD – Single Instruction stream, Single Data stream

| Example: TI C6600 assembly language<br>Adding two floats | Example canonical C:<br>Adding two floats |
|----------------------------------------------------------|-------------------------------------------|
| ; Single Precision ADD                                   | <pre>float a, b ;</pre>                   |
| ADDSP A17, A5, A5                                        | <pre>// Initialising a and b</pre>        |
| ; Result:<br>; A5 = A5 + A17                             | a = a + b ;                               |



Flynn's classification



### SIMD – Single Instruction stream, Multiple Data streams



The same instruction will be executed by multiple EUs, each processing its own piece of data. It means the whole CPU will execute a single instruction on multiple pieces of data.

### Parallel architecture with centralised control unit:

#### → Vectorial processor

 $\rightarrow$  GPU

→ Intel SSE and AVR instructions set architecture for x86 SSE = Streaming SIMD Extension (SSE, SSE2, SSE3, SSE4) AVR = Advanced Vector Extensions (AVX, AVX2, AVX512)

NSI

AEN

# EXECUTION MODELS

Flynn's classification

### SIMD – Single Instruction stream, Multiple Data streams

Example: TI C6600 assembly language Adding two couples of floats
; Dual ADD Single Precision DADDSP A21:A20, A25:A24, A25:A24
; Result: ; A25 = A25 + A21 ; A24 = A24 + A20
; Just like the SSE for Intel, the C6600 ; DSP has a C extension (C functions) ; for vectorial instructions



| int8x16             | s0 s1 s2 s3 s |        | s4 | 4 s5 s6 s7 |     | s8 | s9  | s10 | s11           | s12 | s13 s14 |    | s15 |    |  |    |  |  |  |
|---------------------|---------------|--------|----|------------|-----|----|-----|-----|---------------|-----|---------|----|-----|----|--|----|--|--|--|
|                     |               |        |    |            |     |    |     |     |               |     |         |    |     |    |  |    |  |  |  |
| int16x8             | s             | 0      | s1 |            | s2  |    | s   | s3  |               | s4  |         | s5 |     | s6 |  | 57 |  |  |  |
|                     |               |        |    |            |     |    |     |     |               |     |         |    |     |    |  |    |  |  |  |
| int32x4 / float32x4 |               | s0 (x) |    |            |     | s1 | (y) |     | s2 (z) s3 (w) |     |         |    |     |    |  |    |  |  |  |
|                     |               |        |    |            |     |    |     |     |               |     |         |    |     |    |  |    |  |  |  |
| float64x2           |               |        |    | s0         | (x) |    |     |     | s1 (y)        |     |         |    |     |    |  |    |  |  |  |
|                     |               |        |    |            |     |    |     |     |               |     |         |    |     |    |  |    |  |  |  |

# EXECUTION MODELS Flynn's classification



### MISD – Multiple Instruction streams, Single Data stream



Each EU execute its own instruction, with single pieces of data.

Few practical applications

 $\rightarrow$  code redundancy (for detection of execution errors)

→ VLIW processors (Very Long Instruction Word) e.g. C66xx Texas Instruments DSP

## EXECUTION MODELS

Flynn's classification



### MISD – Multiple Instruction streams, Single Data stream



# EXECUTION MODELS Flynn's classification



### MIMD – Multiple Instruction streams, Multiple Data streams

Data stream EU EU EU EU EU EU EU Each EU executes its own instructions flow on their own data flow.

Execution Unit can be grouped as a cluster.

Parallel architectures with independent control units

→ Super-scalar processors

 $\rightarrow$  Any modern GPP: x86-x64 (CISC), Cortex-A (RISC)

→ Includes use of SPMD (Single Program, Multiple Data)

NSI

AEN

### **EXECUTION MODELS**

Flynn's classification

### MIMD – Multiple Instruction streams, Multiple Data streams

Example: TI C6600 assembly language Example: C and OpenMP Simultaneously adding and multiplying two Parallelisation of for loop different couples of data #pragma omp parallel reduction(+:acc) ; Dual ADD Single Precision { Dual SUBSTRACT Single Precision DADDSP A21:A20, A25:A24, A25:A24 #pragma omp for schedule(static) н DSUBSP B25:B24, B23:B22, B23:B22 for( k = 0; k < size; k ++ )</pre> { acc += A[i \* size + k] \* x[k]; ; The pipes (||) explicitly indicate that } } ; instructions must be executed in parallel ; (use of software pipeline) ; Result ; A25 = A25 + A21A24 = A24 + A20; B23 = B25 - B23;

; B22 = B24 - B22





# APPLICATION VS ALGORITHME



# Chapter 2 Choosing a specialized CPU





CHOOSING A HIGH-PERFORMANCE CPU Software: Applications + System



The objective of an application is to fulfill specifications (or requirements).



# CHOOSING A HIGH-PERFORMANCE CPU Application





# About 90 % of the time, the processing consists of a **simple supervision**.



Algorithm examples: search, sort, digital signal processing (audio, radar, comms, ...), ...

# CHOOSING A HIGH-PERFORMANCE CPU Algorithm



The first choice of processor should always be a general-purpose processor.

However if it does not match the specifications, it is wise to switch to a processing-specialized architecture so that we can:

- Reduce the processing time
- Reduce the code size and/or its memory footprint

Note that switching to a specialized processor should be justified with measurements.



# CHOOSING A HIGH-PERFORMANCE CPU DFT algorithm example





- Each product is independent from another
- → Parallelism available!
- Same for the processing every single frequency sample



# CHOOSING A HIGH-PERFORMANCE CPU CPU architecture selection

CHOOSING A HIGH-PERFORMANCE CPU

Finally, choose the CPU according to your needs.

**DSP:** low-power, low-cost, very low-level development (C, asm)

**GPU:** high-power, high-cost, high-level development (C++, OpenMP, Cuda, ...), high-parallelism potential

**MPPA**: Massively Parallel Processor Array, not widespread yet, but huge potential (dispatch cores to specific algorithms).









# ARCHITECTURE CPU C6678 VLIW DE TEXAS INSTRUMENTS



# Chapter 3 TI C6678's Architecture





# TMS320C6678 PROCESSOR

Processor and Core specifications









### TMS320C6678 PROCESSOR Processor Architecture

The TI C6600 is a multicore DSP with a

homogeneous CPU architecture.

It includes 8 RISC-like VLIW CPUs that can be clocked up to 1.4 GHz.

 $\rightarrow$  44.8 GMAC/core for fixed point @1.4 GHz

 $\rightarrow$  22.4 GFLOP/core for floating point @1.4 GHz



4MB MSM SRAM

Memory Subsyste

64-Bit DDR3 EMIF

TMS320C6678 functional block diagram

# TMS320C6678 PROCESSOR

Core Architecture

## The C66x CorePac consists of several components:

- The C66x DSP and associated C66x CorePac core
- Level-one and level-two memories (L1P, L1D, L2)
- Data Trace Formatter (DTF)
- Embedded Trace Buffer (ETB)
- Interrupt Controller
- Power-down controller
- External Memory Controller
- Extended Memory Controller
- A dedicated power/sleep controller (LPSC)



32KB L1D

TMS320C66x CorePac DSP Block Diagram





### TMS320C6678 PROCESSOR

Core Architecture

### Each core has its own cache memories :

- 32 kB L1P cache memory
- 32 kB L1D cache memory
- 512 kB L2 cache memory

#### Figure 1-1 Flat Versus Hierarchical Memory Architecture



#### TMS320C6678 PROCESSOR

**Processor Architecture** 

Also, all cores can access to a 4 MB multicore shared memory (MSM), which can be configured either as a cache memory or as an addressable SRAM.



Multicore Shared Memory Controller (MSMC)
 4096KB MSM SRAM Memory Shared by Eight DSP

C66x CorePacs – Memory Protection Unit for Both MSM SRAM and DDR3\_EMIF





NSI

AEN

### TMS320C6678 PROCESSOR Core Architecture



The IDMA (Internal Direct Memory Access) is a DMA controller local to the CorePac.

It can be configured and is fully accessible by the developer.

It can handle data transfer between local memories, or between peripheral configuration space (CFG) and local memories.

Local transfers to the CPU are determinist.





# C6600 HARDWARE PIPELINE





C6600 HARDWARE PIPELINE



Reminder: a CPU is a sequential machine, but it can process simultaneously several instructions thanks to the stages of its hardware pipeline.





### The C6600 pipeline has 16 stages (called phases)

|                 | — Га | tch — |    | •  | •  | ⊢ De | code |     |    | Exec |       |          |    |     |      |    |    |    |     |   |
|-----------------|------|-------|----|----|----|------|------|-----|----|------|-------|----------|----|-----|------|----|----|----|-----|---|
|                 | - re | lcn – |    |    |    |      |      |     |    | Exec | ute - |          |    |     |      |    |    |    |     |   |
| PG              | PS   | PW    | PR | DP | DC | E1   | E2   | E3  | E4 | E5   | E6 E  | 7 E8     | E9 | E10 |      |    |    |    |     |   |
|                 |      |       | P  | •  |    | •    | -    |     |    |      |       | <b>D</b> |    |     |      |    |    |    |     |   |
| gure 5<br>Fetch | 'n   |       |    |    |    |      |      |     |    | •    |       | Packe    |    |     |      |    |    |    |     |   |
| Packe           | et 👘 | 1     | 2  |    | 3  | 4    | 5    | _   | 6  | 7    | 8     | 9        | 10 |     |      | 13 | 14 | 15 | 16  | 1 |
| n               | L    | PG    | PS | -  | PW | PR   | DF   | ,   | DC | E1   | E2    | E3       | E4 | E5  | E6   | E7 | E8 | E9 | E10 |   |
| n+1             |      |       | PG | i  | PS | PW   | PR   |     | DP | DC   | E1    | E2       | E3 | E4  | E5   | E6 | E7 | E8 | E9  | E |
| n+2             |      |       |    |    | PG | PS   | PW   | /   | PR | DP   | DC    | E1       | E2 | E3  | E4   | E5 | E6 | E7 | E8  | E |
| n+3             |      |       |    |    |    | PG   | PS   |     | PW | PR   | DP    | DC       | E1 | E2  | E3   | E4 | E5 | E6 | E7  | E |
| n+4             |      |       |    |    |    |      | PG   | i . | PS | PW   | PR    | DP       | DC | E1  | E2   | E3 | E4 | E5 | E6  | E |
| n+5             |      |       |    |    |    |      |      |     | PG | PS   | PW    | PR       | DP | DC  | E1   | E2 | E3 | E4 | E5  | E |
| n+6             |      |       |    |    |    |      |      |     |    | PG   | PS    | PW       | PR | DP  | DC   | E1 | E2 | E3 | E4  | E |
| n+7             |      |       |    |    |    |      |      |     |    |      | PG    | PS       | PW | PR  | DP   | DC | E1 | E2 | E3  | E |
| n+8             |      |       |    |    |    |      |      |     |    |      |       | PG       | PS | PW  | / PR | DP | DC | E1 | E2  | E |
|                 |      |       |    |    |    |      |      |     |    |      |       |          | PG | PS  | PW   | PR | DP | DC | E1  | E |
| n+9             |      |       |    |    |    |      |      |     |    |      |       |          |    |     |      |    |    |    |     |   |

C6600 HARDWARE PIPELINE

CPUs from the C6600 family are equipped with a VLIW (Very Long Instruction Word) hardware pipeline.

It can process up to 8 instructions at once with its 8 execution units.



Pipeline Phases Block Diagram



# C6600 HARDWARE PIPELINE

FETCH stage



## The FETCH stage is divided into four phases:



# C6600 HARDWARE PIPELINE FETCH stage

Each instruction has a 32-bit fixed size (RISC-like instruction set).

The very last bit of each instruction is named P (Parallel) and is set to 0 or 1 either during the compilation phase or directly by the developer (in assembly language).

By reading this bit, the FETCH stage knows exactly how many instructions to search for, with a maximum of 8.

| Figure 3-3 | Basic F | ormat of a Fe | etch Packet |      |   |
|------------|---------|---------------|-------------|------|---|
|            | 01      | 0.01          | 0.01        | 0.01 | 0 |

| _                             | 31               | 0 | 31 (             | 0 | 31 0             | 31 | 1 0              | 31          | 0      | 31            | 0  | 31           | 0    | 31 |               | 0 |
|-------------------------------|------------------|---|------------------|---|------------------|----|------------------|-------------|--------|---------------|----|--------------|------|----|---------------|---|
|                               |                  | р |                  | р | p                |    | p                |             | þ      |               | þ  |              | p    |    |               | р |
|                               | Instruction<br>A | n | Instruction<br>B | I | Instruction<br>C | I  | Instruction<br>D | Instru<br>E | iction | Instruct<br>F | on | Instruc<br>G | tion | In | structio<br>H | n |
| LSBs o<br>the byte<br>address | e 00000b         |   | 00100b           |   | 01000b           |    | 01100b           | 100         | 00b    | 10100         | b  | 11000        | )b   |    | 11100b        |   |



# C6600 HARDWARE PIPELINE DECODE stage



With instructions arriving by packets, the decoding stage takes two phases.

- 1. The *Dispatch* phase redirects the instructions to their dedicated Execution Unit.
- 2. Each Execution Unit has its proper decoding unit.





The VLIW execution stage has 8 SIMD Execution Units (or Functional Units).

The Execution Units are labeled .L1, .S1, .M1, .D1, .L2, .S2, .M2, .D2, and some instructions are EU-specific.

They are split into two symmetrical sides, each side having its own 32-bit register file.





#### C6600 HARDWARE PIPELINE

**EXECUTION** stage



#### Each EU has its own VLIW pipeline.

| Table 5-1 | Operations Occ | Operations Occurring During Pipeline Phases (Part 2 of 2) |                                                                                                                                                                                |  |  |  |  |  |  |
|-----------|----------------|-----------------------------------------------------------|--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|--|--|--|--|--|--|
| Stage     | Phase          | Symbol                                                    | During This Phase                                                                                                                                                              |  |  |  |  |  |  |
| Execute   | Execute 1      | E1                                                        | For all instruction types, the conditions for the instructions are evaluated and operands are read.                                                                            |  |  |  |  |  |  |
|           |                |                                                           | For load and store instructions, address generation is performed and address modifications<br>are written to a register file. <sup>1</sup>                                     |  |  |  |  |  |  |
|           |                |                                                           | For branch instructions, branch fetch packet in PG phase is affected.1                                                                                                         |  |  |  |  |  |  |
|           |                |                                                           | For single-cycle instructions, results are written to a register file.1                                                                                                        |  |  |  |  |  |  |
|           |                |                                                           | For DP compare, ADDDP/SUBDP, and MPYDP instructions, the lower 32-bits of the sources<br>are read. For all other instructions, the sources are read. <sup>1</sup>              |  |  |  |  |  |  |
|           |                |                                                           | For MPYSPDP instruction, the src1 and the lower 32 bits of src2 are read.1                                                                                                     |  |  |  |  |  |  |
|           |                |                                                           | For 2-cycle DP instructions, the lower 32 bits of the result are written to a register file.1                                                                                  |  |  |  |  |  |  |
|           | Execute 2      | E2                                                        | For load instructions, the address is sent to memory. For store instructions, the address and data are sent to memory. <sup>1</sup>                                            |  |  |  |  |  |  |
|           |                |                                                           | Single-cycle instructions that saturate results set the SAT bit in the control status register<br>(CSR) if saturation occurs. <sup>1</sup>                                     |  |  |  |  |  |  |
|           |                |                                                           | For multiply unit, nonmultiply instructions, results are written to a register file. <sup>2</sup>                                                                              |  |  |  |  |  |  |
|           |                |                                                           | For multiply, 2-cycle DP, and DP compare instructions, results are written to a register file.1                                                                                |  |  |  |  |  |  |
|           |                |                                                           | For DP compare and ADDDP/SUBDP instructions, the upper 32 bits of the source are read.1                                                                                        |  |  |  |  |  |  |
|           |                |                                                           | For MPYDP instruction, the lower 32 bits of src1 and the upper 32 bits of src2 are read.1                                                                                      |  |  |  |  |  |  |
|           |                |                                                           | For MPYI and MPYID instructions, the sources are read. <sup>1</sup>                                                                                                            |  |  |  |  |  |  |
|           |                |                                                           | For MPYSPDP instruction, the src1 and the upper 32 bits of src2 are read.1                                                                                                     |  |  |  |  |  |  |
|           | Execute 3      | E3                                                        | Data memory accesses are performed. Any multiply instructions that saturate results set the<br>SAT bit in the control status register (CSR) if saturation occurs. <sup>1</sup> |  |  |  |  |  |  |
|           |                |                                                           | For MPYDP instruction, the upper 32 bits of src1 and the lower 32 bits of src2 are read.1                                                                                      |  |  |  |  |  |  |
|           |                |                                                           | For MPYI and MPYID instructions, the sources are read.1                                                                                                                        |  |  |  |  |  |  |

|                               | Execute 4  | E4             | For load instructions, data is brought to the CPU boundary.1                                                              |
|-------------------------------|------------|----------------|---------------------------------------------------------------------------------------------------------------------------|
|                               |            |                | For multiply extensions, results are written to a register file. <sup>3</sup>                                             |
|                               |            |                | For MPYI and MPYID instructions, the sources are read. <sup>1</sup>                                                       |
|                               |            |                | For MPYDP instruction, the upper 32 bits of the sources are read. <sup>1</sup>                                            |
|                               |            |                | For MPYI and MPYID instructions, the sources are read. <sup>1</sup>                                                       |
|                               |            |                | For 4-cycle instructions, results are written to a register file.1                                                        |
|                               |            |                | For INTDP and MPYSP2DP instructions, the lower 32 bits of the result are written to a registe file.1                      |
|                               | Execute 5  | E5             | For load instructions, data is written into a register. <sup>1</sup>                                                      |
|                               |            |                | For INTDP and MPYSP2DP instructions, the upper 32 bits of the result are written to a registe<br>file.1                   |
|                               | Execute 6  | E6             | For ADDDP/SUBDP and MPYSPDP instructions, the lower 32 bits of the result are written to a<br>register file. <sup>1</sup> |
|                               | Execute 7  | E7             | For ADDDP/SUBDP and MPYSPDP instructions, the upper 32 bits of the result are written to a<br>register file. <sup>1</sup> |
|                               | Execute 8  | E8             | Nothing is read or written.                                                                                               |
|                               | Execute 9  | E9             | For MPYI instruction, the result is written to a register file.1                                                          |
|                               |            |                | For MPYDP and MPYID instructions, the lower 32 bits of the result are written to a register file.1                        |
|                               | Execute 10 | E10            | For MPYDP and MPYID instructions, the upper 32 bits of the result are written to a register file.1                        |
| This assumes                  |            | structions are | evaluated as true. If the condition is evaluated as false, the instruction does not write any results or have any         |
| nineline one                  |            |                |                                                                                                                           |
| pipeline ope<br>Multiply unit |            | AVG2 AVG4      | BITC4, BITR, DEAL, ROT, SHFL, SSHVL, and SSHVR.                                                                           |

FNSI

ΔΕΝ



Instructions with a execution time greater than one cycle is followed by a delay slot, written with a NOP instruction (No Operation).

The NOP instruction corresponds to the time of the instruction travelling through the current Execution Unit.



# C6600 HARDWARE PIPELINE

EXECUTION stage



### As an example, here is the documentation for the MPYSP instruction.









PROGRAMMING A VLIW CPU Example code ENSI Caen

Let's see how a VLIW (Very Long Instruction Word) CPU works by focusing on the execution units. We'll start with a canonical assembly code.

| MPYSP  | .M1 | A2, A3, A4 13 CPU cycles |
|--------|-----|--------------------------|
| NOP    |     | 3                        |
| ADDSP  | .S1 | A2, A4, A2               |
| NOP    |     | 3                        |
| FADDSP | .S1 | A0, A1, A0               |
| NOP    |     | 2                        |
| MV     | .D1 | A0, A1                   |
| MV     | .D2 | B9, B7                   |

#### PROGRAMMING A VLIW CPU Rewriting code



Now sort the instructions according to the data dependencies. We'll get three instruction branches.

| MPYSP  | .M1 | A2, A3, A4 8 CPU cycles |
|--------|-----|-------------------------|
| NOP    |     | 3                       |
| ADDSP  | .S1 | A2, A4, A2              |
| NOP    |     | 3                       |
| FADDSP | .S1 | A0, A1, A0 4 CPU cycles |
| NOP    |     | 2                       |
| MV     | .D1 | A0, A1                  |
| MV     | .D2 | B9, B7 1 CPU cycle      |

#### PROGRAMMING A VLIW CPU Rewriting code

In theory, these branches can be executed in parallel.

#### 8 CPU cycles

| MPYSP | .M1 | A2, A3, A4 |          |      |            |         |      |        |
|-------|-----|------------|----------|------|------------|---------|------|--------|
| NOP   |     | 3          | 4 CPU cy | cles |            |         |      |        |
| ADDSP | .S1 | A2, A4, A2 | FADDSP   | .S1  | A0, A1, A0 |         |      |        |
| NOP   |     | 3          | NOP      |      | 2          | 1 CPU c | ycle |        |
|       |     |            | MV       | .D1  | A0, A1     | MV      | .D2  | B9, B7 |

#### PROGRAMMING A VLIW CPU Rewriting code

#### However, we must pay attention to functional dependencies!

| 8 CPU cy | /cles |            |           |      |            |             |     |        |  |  |
|----------|-------|------------|-----------|------|------------|-------------|-----|--------|--|--|
| MPYSP    | .M1   | A2, A3, A4 | Same unit |      |            |             |     |        |  |  |
| NOP      |       | 3          | 4 CPU cy  | cles |            |             |     |        |  |  |
| ADDSP    | .S1   | A2, A4, A2 | FADDSP    | .S1  | A0, A1, A0 | J           |     |        |  |  |
| NOP      |       | 3          | NOP       |      | 2          | 1 CPU cycle |     |        |  |  |
|          |       |            | MV        | .D1  | A0, A1     | MV          | .D2 | B9, B7 |  |  |

### PROGRAMMING A VLIW CPU Rewriting code

We shall rewrite the code (refactoring) to make parallelism possible.

#### 8 CPU cycles

| MPYSP | .M1 | A2, A3, A4 | 4 CPU cy | cles |            |         |      |        |
|-------|-----|------------|----------|------|------------|---------|------|--------|
| NOP   |     | 3          | FADDSP   | .S1  | A0, A1, A0 |         |      |        |
| ADDSP | .S1 | A2, A4, A2 | NOP      |      | 3          | 1 CPU c | ycle |        |
| NOP   |     | 3          | MV       | .D1  | A0, A1     | MV      | .D2  | B9, B7 |







#### PROGRAMMING A VLIW CPU Rewriting code



#### Here are the canonical and optimized versions of the same code.

| Canonical a | asm – 13 | CPU cycles |
|-------------|----------|------------|
|-------------|----------|------------|

| MPYSP  | .M1 | A2, A3, A4 |
|--------|-----|------------|
| NOP    |     | 3          |
| ADDSP  | .S1 | A2, A4, A2 |
| NOP    |     | 3          |
| FADDSP | .S1 | A0, A1, A0 |
| NOP    |     | 2          |
| MV     | .D1 | A0, A1     |
| MV     | .D2 | B9, B7     |

#### Optimized asm – 8 CPU cycles

|   | MPYSP  | .M1 | A2, A3, A4 |
|---|--------|-----|------------|
|   | NOP    |     | 2          |
|   | FADDSP | .S1 | A0, A1, A0 |
|   | ADDSP  | .S1 | A2, A4, A2 |
|   | NOP    |     | 2          |
|   | MV     | .D1 | A0, A1     |
| П | MV     | .D2 | B9, B7     |
|   |        |     |            |

NSI

ΔΕΝ

#### PROGRAMMING A VLIW CPU Code execution DATA MEMORY CPU Optimized asm – 8 CPU cycles FETCH . . . DISPATCH / DECODE A2, A3, A4 **MPYSP** .M1 NOP 2 EXECUTE FADDSP .S1 A0, A1, A0 A2, A4, A2 ADDSP .S1 NOP 2 MV .D1 A0, A1 B9, B7 || MV .D2 . . . DATA MEMORY

Optimized asm – 8 CPU cycles

.M1

.S1

.S1

.D1

.D2

A2, A3, A4

A0, A1, A0

A2, A4, A2

2

2

A0, A1

B9, B7

Code execution

. . .

NOP

**MPYSP** 

FADDSP

ADDSP

NOP

MV

. . .

|| MV



#### DATA MEMORY

#### PROGRAMMING A VLIW CPU

Code execution





Code execution



|   |        |     |         |    | NOP |     |            |         |        |     |     |     |
|---|--------|-----|---------|----|-----|-----|------------|---------|--------|-----|-----|-----|
|   | MPYSP  | .M1 | A2, A3, | A4 | NOP |     | D<br>MPYSP | ISPATCH | / DECO | DE  |     |     |
|   | NOP    |     | 2       |    |     |     |            | EXE     | CUTE   |     |     |     |
|   | FADDSP | .S1 | A0, A1, | A0 | .S1 | .L1 | .M1        | .D1     | .D2    | .M2 | .L2 | .S2 |
| 4 | ADDSP  | .S1 | A2, A4, | A2 |     |     |            |         |        |     |     |     |
|   | NOP    |     | 2       |    |     |     |            |         |        |     |     |     |
| I | MV     | .D1 | A0, A1  |    |     |     |            |         |        |     |     |     |
|   | MV     | .D2 | B9, B7  |    |     |     |            |         |        |     |     |     |
|   |        |     |         |    |     |     |            |         |        |     |     |     |
|   |        |     |         |    |     |     |            | 1       | 1      |     |     |     |
|   |        |     |         |    |     |     | Л          |         | IEMO   | γ   |     |     |

#### PROGRAMMING A VLIW CPU

Code execution





DATA MEMORY

Code execution



#### CPU Optimized asm – 8 CPU cycles FETCH ADDSP . . . **DISPATCH / DECODE MPYSP** .M1 A2, A3, A4 FADDSP NOP 2 EXECUTE FADDSP A0, A1, A0 .S1 ADDSP A2, A4, A2 .S1 MPYSP NOP 2 MV .D1 A0, A1 .D2 B9, B7 || MV . . . DATA MEMORY

#### PROGRAMMING A VLIW CPU

Code execution





DATA MEMORY

Code execution

. . .

NOP

NOP

MV

. . .

|| MV



#### CPU Optimized asm – 8 CPU cycles FETCH NOP **DISPATCH / DECODE MPYSP** .M1 A2, A3, A4 NOP ADDSP 2 EXECUTE FADDSP A0, A1, A0 .S1 FADDSP ADDSP A2, A4, A2 .S1 2 .D1 A0, A1 .D2 B9, B7 DATA MEMORY

#### **PROGRAMMING A VLIW CPU**

Code execution





DATA MEMORY

Code execution

. . .

NOP

**MPYSP** 

FADDSP

ADDSP

NOP

MV

. . .

|| MV

.M1

.S1

.S1

.D1

.D2

2

2



#### DATA MEMORY CPU Optimized asm – 8 CPU cycles FETCH **DISPATCH / DECODE** A2, A3, A4 MV EXECUTE A0, A1, A0 A2, A4, A2 ADDSP FADDSP A0, A1 B9, B7 DATA MEMORY 37

#### **PROGRAMMING A VLIW CPU**

Code execution





Code execution



DATA MEMORY

#### Optimized asm – 8 CPU cycles . . . MPYSP Δ2 Δ3 Δ4 М1

|   | hip i Sp | • 111 | ΑZ, | AS, | A4 |
|---|----------|-------|-----|-----|----|
|   | NOP      |       | 2   |     |    |
|   | FADDSP   | .S1   | A0, | A1, | A0 |
|   | ADDSP    | .S1   | Α2, | A4, | A2 |
|   | NOP      |       | 2   |     |    |
|   | MV       | .D1   | A0, | A1  |    |
| Ш | MV       | .D2   | В9, | B7  |    |
|   | •••      |       |     |     |    |
|   |          |       |     |     |    |

## **PROGRAMMING A VLIW CPU**

Code execution





#### PROGRAMMING A VLIW CPU VLIW CPU properties



One particularity of VLIW processors is that their assembly code (and binary code as well) is out of order in the program memory, but they come out of the pipeline in order.

This very simple CPU has a very good performances/Watt ratio.

However, intelligence and skills belong to the developper and the toolchain.







#### PROGRAMMING A VLIW CPU VLIW CPU properties

#### VLIW CPU

- Intelligence bring by toolchain and engineer
- Memory program code is out of order
- Execution In Order
- Determinist
- Excellent performance/consumption ratio



41

#### Superscalar CPU

- Intelligence lies within the execution stage
- Memory program code is in order
- Execution is Out Of Order (OOO execution)
- Not determinist
- Bad performance/consumption ratio

PROGRAMMING A VLIW CPU VLIW CPU properties



On the one hand superscalar CPUs are designed to execute generic code with almost no optimisation and that includes lots of branches and tests. Keyword is genericity.

On the other hand VLIW CPUs must run target-dependant code in order to use their maximum capability. However this means architecture-specific code (no portability).

ptr\_x2[11] = xt1 \* co1 + yt1 \* si1; ptr\_x2[11 + 1] = yt1 \* co1 - xt1 \* si1; ptr\_x2[h2] = xt0 \* co2 + yt0 \* si2; ptr\_x2[h2 + 1] = yt0 \* co2 - xt0 \* si2; ptr\_x2[12] = xt2 \* co3 + yt2 \* si3; ptr\_x2[12 + 1] = yt2 \* co3 - xt2 \* si3;

TI DSPLIB, FFT algorithm, floating point
Canonical implementation
→ PORTABLE

x\_lo\_x\_0o = \_daddsp(xh1\_0\_xh0\_0, xh21\_0\_xh20\_0); x\_3o\_x\_2o = \_daddsp(xh1\_1\_xh0\_1, xh21\_1\_xh20\_1);

yt0\_0\_xt0\_0 = \_dsubsp(xh1\_0\_xh0\_0, xh21\_0\_xh20\_0); yt0\_1\_xt0\_1 = \_dsubsp(xh1\_1\_xh0\_1, xh21\_1\_xh20\_1);

TI DSPLIB, FFT algorithm, floating point
Optimised implementation (intrinsec functions)
→ NOT PORTABLE



If one wants to use the full capability of a processor, he must master the hardware architecture as well as associated developping tools (i.e. toolchain).

Also one must be able to use math and rewrite the algorithm (and its implementation) with the aim of a code acceleration.

As a matter of fact, the most performant codes are most of the time not portable.







# CONVOLUTION DISCRÈTE



# Chapter 4 Lab's example algorithm





LAB'S EXAMPLE ALGORITHM Discrete convolution



Lab sessions will use a well known algorithm: the **discrete convolution**.

This algorithm has a very simple structure, but it is very difficult to accelerate without mathematical refactoring.



#### LAB'S EXAMPLE ALGORITHM





#### Let's have a look at the mathematical definition of the discrete convolution

$$y(k) = \sum_{k=0}^{Y} \sum_{j=0}^{N} a(j) \cdot x(k-j)$$

Where:

- x() is the input samples vector
- y() is the output samples vector
- a() is the coefficients vector
- Y is the output vector size
- N is the number of coefficients
- k is the index of the current sample

LAB'S EXAMPLE ALGORITHM Typical workflow for algorithm coding



Before being coded in C onto the wanted processor, the algorithm is usually validated with prototyping and simulation tools, such as Matlab/Simulink.

Validating the algorithm consists in coding its canonical implementation and check the input and output vectors values.



#### LAB'S EXAMPLE ALGORITHM



Typical workflow for algorithm coding

#### Here is the Matlab implementation of the discrete convolution algorithm.

```
function yk = fir_sp(xk, coeff, coeffLength, ykLength)
  yk = single(zeros(1,ykLength)); % output array preallocation
  % output array loop
  for i=2:ykLength
    yk(i) = single(0);
    % FIR filter algorithm - dot product
    for j=1:coeffLength
        yk(i) = single(yk(i)) + single(coeff(j)) * single(xk(i+j-1));
    end
  end
end
```

This code is given with lab materials

LAB'S EXAMPLE ALGORITHM Typical workflow for algorithm coding



5

#### Observe some of the outputs suggested by Matlab sources, for a 64<sup>th</sup>-order FIR filter.



Matlab sources given with lab materials

LAB'S EXAMPLE ALGORITHM Canonical C implementation



Once the algorithm has been validated, it can be implemented in the processor. First make a canonical C implementation, using IEEE-754 single-precision floats.



LAB'S EXAMPLE ALGORITHM Canonical C implementation

Another canonical C implementation.

This one is given by Texas Instruments in its library dsplib.

#pragma CODE\_SECTION(DSPF\_sp\_fir\_gen\_cn, ".text:ansi"); #include "DSPF\_sp\_fir\_gen\_cn.h" void DSPF\_sp\_fir\_gen\_cn(const float \*x, const float \*h,
float \*y, int nh. int ny) int i, j; float sum; for(j = 0; j < ny; j++)</pre> sum = 0;// note: h coeffs given in reverse order: { h[nh-1], h[nh-2], ..., h[0] } for(i = 0; i < nh; i++)</pre> sum += x[i + j] \* h[i]; y[j] = sum; 3 }



LAB'S EXAMPLE ALGORITHM

ENSI CAEN COMPACT CONTROLOGICAL

Canonical C implementation

Another canonical C implementation, from the Texas Instruments **dsplib**. But this time, it uses **16-bit signed integers** with the **Q1.15 format**.

| #pragma CODE_SECTION(DSP_fir_g           | en_cn, ".text:ansi");                        |   |
|------------------------------------------|----------------------------------------------|---|
| #include "DSP_fir_gen_cn.h"              |                                              |   |
| void DSP_fir_gen_cn (                    |                                              |   |
|                                          | <pre>/* Input array [nr+nh-1 elements]</pre> |   |
| <pre>const short *restrict h,</pre>      | <pre>/* Coeff array [nh elements]</pre>      | * |
| short *restrict r,                       | <pre>/* Output array [nr elements]</pre>     | * |
| int nh,                                  | <pre>/* Number of coefficients</pre>         | * |
| int nr                                   | <pre>/* Number of output samples</pre>       | * |
| )                                        |                                              |   |
| {                                        |                                              |   |
| int i, j, sum;                           |                                              |   |
|                                          |                                              |   |
| <pre>for (j = 0; j &lt; nr; j++) {</pre> |                                              |   |
| sum = 0;                                 |                                              |   |
| <pre>for (i = 0; i &lt; nh; i+</pre>     |                                              |   |
| sum += x[i + j] *                        | h[i];                                        |   |
| r[j] = sum >> 15;                        |                                              |   |
| }                                        |                                              |   |
| }                                        |                                              |   |





9

The main goal of the lab sessions is to present a **generic methodology for optimizing algorithms for a specific architecture**.

In our case, we'll optimize a **discrete convolution algorithm for a TI C6678 DSP**.





# ASSEMBLEUR C6000









# C6678 INSTRUCTION SET ARCHITECTURE







Let's see the different fields of an instruction line in assembly language for the Texas Instruments C6600 architectures.

Note that some fields are specific to VLIW architectures.





Remember that all fields of an instruction in assembly language correspond to a field in the binary code of the instruction (except for the label and the comment).

#### See for instance the MPYSP instruction.



#### C6678 INSTRUCTION SET ARCHITECTURE

#### Execution units



In order to ease the understanding of the C6600 Instruction Set Architecture, we'll look at the effects of the assembly instructions onto the execution stage.



#### C6678 INSTRUCTION SET ARCHITECTURE

Execution units



5

#### The C6000 CPU has a Load-Store architecture.

This means that some execution units (.D1 and .D2) are dedicated to memory access, and both of them have a direct access to the L1 cache memory (64-bit bus).

The other execution units are used for control and processing.



#### C6678 INSTRUCTION SET ARCHITECTURE

Addressing modes



Only **3 addressing modes** are supported by the C6000 ISA.

Remind that addressing modes correspond to data manipulation strategies , as used by the instructions.

Being a calculation processor, the C6000 CPU heavily uses register addressing mode.

- Register addressing
  - 324 instructions (full ISA)
- Indirect addressing
  - 18 instructions (load/store instructions)
- Immediate addressing







#### DISCRETE CONVOLUTION ALGORITHM IN CANONICAL C6678 ASSEMBLY LANGUAGE Presentation



For the remaining part of this lecture, we'll translate the C algorithm into a C6678 assembly language program.

The canonical C version of the program is on the next slide.



For educational purpose, we will ignore the delay slots that are associated to instructions execution time.

The absence of the NOP instructions will facilitate the understanding of the canonical assembly program.

Do not forget to add the delay slots when programming the target DSP!



This is the algorithm that will be studied during the lab sessions. It's a canonical C implementation, using IEEE-754 single-precision floats.



#### DISCRETE CONVOLUTION ALGORITHM IN CANONICAL C6678 ASSEMBLY LANGUAGE Registers use



11

#### Registers used for passing parameters through function call:

See "TMS320C6000 Optimizing Compiler V7.6" User's guide, Chapter "7.3 Register conventions"



• yk (temp) = A5

Multiplication-Accumulation



fir\_sp\_asm: The easiest way to translate the C into assembly language is to start from the main operation, inside the inner loop. Like many other digital signal processing algorithms, the discrete convolution uses MAC (Multiply-Accumulate) or **SOP** (Sum Of Products) instructions. MPYSP .M1 A9, B9, A17 First let's look at the MPYSP instruction. 4.212 MPYSP Multiply Two Single-Precision Floating-Point Values Syntax MPYSP (.unit) src1, src2, dst unit = .M1 or .M2 13



## DISCRETE CONVOLUTION ALGORITHM IN CANONICAL C6678 ASSEMBLY LANGUAGE Multiplication-Accumulation



| Now we can use an addition instruction.                             | fir_sp_asm:                                     |
|---------------------------------------------------------------------|-------------------------------------------------|
| <b>4.14 ADDSP</b><br>Add Two Single-Precision Floating-Point Values |                                                 |
| Syntax ADDSP (.unit) src1, src2, dst                                |                                                 |
| unit = .L1, .L2, .S1, .S2                                           |                                                 |
|                                                                     |                                                 |
|                                                                     | MPYSP .M1x A9, B9, A17<br>ADDSP .L1 A17, A5, A5 |
| Well done!                                                          |                                                 |
| You just implemented<br>a MAC operation!                            |                                                 |
|                                                                     | 15                                              |

### DISCRETE CONVOLUTION ALGORITHM IN CANONICAL C6678 ASSEMBLY LANGUAGE Multiplication-Accumulation



### Use of execution units:



#### Data management



| Move data from a | CPU register to another one.                                                                                | fir_sp_asm:<br>MV .L1   | A8, B0                     |
|------------------|-------------------------------------------------------------------------------------------------------------|-------------------------|----------------------------|
| 4.222 MV         | Move From Register to Register<br><b>MV</b> (.unit) <i>src2, dst</i><br>unit = .L1, .L2, .S1, .S2, .D1, .D2 |                         |                            |
|                  |                                                                                                             | MPYSP .M1x<br>ADDSP .L1 | A9, B9, A17<br>A17, A5, A5 |

#### DISCRETE CONVOLUTION ALGORITHM IN CANONICAL C6678 ASSEMBLY LANGUAGE

#### Data management

Before performing the MAC, we must load the cells values (stored in the L1 cache memory) into CPU registers.

We must use one of the LDx (load) instructions:

| • LDB,  | В  | = Byte        | = 1 | l byte  | = char         |
|---------|----|---------------|-----|---------|----------------|
| • LDH,  | Н  | = Half-word   | = 2 | 2 bytes | = short int    |
| • LDW,  | W  | = Word        | = 4 | 1 bytes | = int, float   |
| • LDDW, | DW | = Double-Word | = 8 | 3 bytes | = long, double |

The .D1 and .D2 executions units are dedicated to ST and LD instructions only.

| fir_sp_asm:<br>MV .L1 A8,B0                                        |   |
|--------------------------------------------------------------------|---|
|                                                                    |   |
|                                                                    |   |
|                                                                    |   |
| LDW .D1 *A19, A9                                                   |   |
| LDW .D2 *B19, B9<br>MPYSP .M1x A9, B9, A2<br>ADDSP .L1 A17, A5, A9 |   |
| AUDSF .LI AIT, AS, A.                                              | , |
|                                                                    |   |
|                                                                    |   |
|                                                                    |   |

FNS

CAEN:

#### Data management



#### Use of execution units:





Data management

| Note that a pointer-like style is used.                                                                                                                                | fir_sp_asm:<br>MV .L1 A8,B0                                                             |    |
|------------------------------------------------------------------------------------------------------------------------------------------------------------------------|-----------------------------------------------------------------------------------------|----|
| In this example case, A19 and B19 registers contain each an address. The '*' character before the register name indicates the use of <b>indirect addressing mode</b> . |                                                                                         |    |
| This is equivalent to the use of pointers in C.                                                                                                                        | LDW .D1 *A19, A9<br>LDW .D2 *B19, B9<br>MPYSP .M1x A9, B9, A17<br>ADDSP .L1 A17, A5, A5 |    |
|                                                                                                                                                                        |                                                                                         |    |
|                                                                                                                                                                        |                                                                                         |    |
|                                                                                                                                                                        |                                                                                         | 21 |

DISCRETE CONVOLUTION ALGORITHM IN CANONICAL C6678 ASSEMBLY LANGUAGE Data management

Similarly to the pointers in C language, registers used in indirect addressing mode support **pre- and post-incrementations**.

Also, load and store operations can **be indexed with the [] notation**, like arrays in C.

| Addressing Type                               | No Modification of<br>Address Register | Preincrement or Predecrement of<br>Address Register | Postincrement or Postdecrement<br>of Address Register |
|-----------------------------------------------|----------------------------------------|-----------------------------------------------------|-------------------------------------------------------|
| Register indirect                             | *R                                     | *++R                                                | *R++                                                  |
|                                               |                                        | *R                                                  | *R                                                    |
| Register relative                             | *+R[ <i>ucst5</i> ]                    | *++R[ <i>ucst5</i> ]                                | *R++[ <i>ucst5</i> ]                                  |
|                                               | *-R[ <i>ucst5</i> ]                    | *R[ <i>ucst5</i> ]                                  | *R[ <i>ucst5</i> ]                                    |
| Register relative with 15-bit constant offset | *+B14/B15[ucst15]                      | not supported                                       | not supported                                         |
| Base + index                                  | *+R[offsetR]                           | *++R[offsetR]                                       | *R++[offsetR]                                         |
|                                               | *-R[offsetR]                           | *R[offsetR]                                         | *R[offsetR]                                           |

#### Table 3-10 Indirect Address Generation for Load/Store





Data management

|                                                                                                                                   |                                                                                | ÉCOLE PUBLIQUE D'INGÉNIEURS<br>CENTRE DE RECHERCHE |
|-----------------------------------------------------------------------------------------------------------------------------------|--------------------------------------------------------------------------------|----------------------------------------------------|
| To summarize:                                                                                                                     | fir_sp_asm:<br>MV .L1 A8,B0                                                    |                                                    |
| The A19 and B19 registers contain the address of the current cell of a[] and xk[] arrays.                                         |                                                                                |                                                    |
| The two LDW instructions load 4 bytes from the L1 cache memory to the CPU registers A9 & B9.                                      | LDW .D1 *A19++,<br>LDW .D2 *B19++,<br>MPYSP .M1x A9, B9,<br>ADDSP .L1 A17, A5, | A17                                                |
| The address value contained in A19 and B19 registers are incremented afterward, making these registers pointing to the next cell. |                                                                                |                                                    |
|                                                                                                                                   |                                                                                | 23                                                 |





Control and branch



| The C6000 family historically supported only one branch instruction: the <b>B (branch) instruction</b> . | fir_sp_asm:<br>MV .L1 A8,B0                                                                                                                                                                               |    |
|----------------------------------------------------------------------------------------------------------|-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|----|
| It allows to perform function calls as well as all known control structures (if, for, while,).           |                                                                                                                                                                                                           |    |
| The B instruction uses .S1 and .S2 execution units.                                                      | fir_sp_asm_l2:           LDW         .D1         *A19++, A9           LDW         .D2         *B19++, B9           MPYSP         .M1x         A9, B9, A17           ADDSP         .L1         A17, A5, A5 |    |
|                                                                                                          | B .S1 fir_sp_asm_l2                                                                                                                                                                                       |    |
|                                                                                                          |                                                                                                                                                                                                           | 25 |

## DISCRETE CONVOLUTION ALGORITHM IN CANONICAL C6678 ASSEMBLY LANGUAGE Control and branch



#### Use of execution units:



Control and branch



| A condition can be added to the execution of an instruction.                                                                                    | fir_sp_asm:<br>MV .L1 A8,B0                                                                                                               |    |
|-------------------------------------------------------------------------------------------------------------------------------------------------|-------------------------------------------------------------------------------------------------------------------------------------------|----|
| Five registers (A1, A2, B0, B1, B2) can be used as condition values.                                                                            |                                                                                                                                           |    |
| <pre>Syntax: • [R] = instruction executed if R ≠ 0 • [!R] = instruction executed if R = 0 All instructions can be executed conditionally.</pre> | fir_sp_asm_l2:<br>LDW .D1 *A19++, A9<br>LDW .D2 *B19++, B9<br>MPYSP .M1x A9, B9, A17<br>ADDSP .L1 A17, A5, A5<br>[A1] B .S1 fir_sp_asm_l2 |    |
|                                                                                                                                                 |                                                                                                                                           | 27 |

DISCRETE CONVOLUTION ALGORITHM IN CANONICAL C6678 ASSEMBLY LANGUAGE Control and branch



Implement the internal loop's counter.

| fir_sp_asm:<br>MV | .L1                       | A8, B0                                  |
|-------------------|---------------------------|-----------------------------------------|
| MV                | .L1                       | B6, A1                                  |
| LDW               | :<br>.D1<br>.D2<br>P .M1x | *A19++, A9<br>*B19++, B9<br>A9, B9, A17 |
|                   | P .L1<br>.L1              | A17, A5, A5<br><b>A1, 1, A1</b>         |
|                   |                           |                                         |

#### DISCRETE CONVOLUTION ALGORITHM IN CANONICAL C6678 ASSEMBLY LANGUAGE Control and branch



#### DISCRETE CONVOLUTION ALGORITHM IN CANONICAL C6678 ASSEMBLY LANGUAGE Control and branch



ENSI

The **return address** of a function is always given by the calling function through the **B3 register**.



#### DISCRETE CONVOLUTION ALGORITHM IN CANONICAL C6678 ASSEMBLY LANGUAGE

#### Final version



| Final version                                                                                   | fir_sp_asm:<br>MV .L1 A8,B0                                                                                                                                                                       |    |
|-------------------------------------------------------------------------------------------------|---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|----|
| Without the required delay slots!                                                               | fir_sp_asm_l1:<br>ZERO .L1 A5                                                                                                                                                                     |    |
| <pre>void fir_sp ( const float * restrict xk, \</pre>                                           | MV .L1 B6, A1<br>MV .L1 A4, A19                                                                                                                                                                   |    |
| <pre>float * restrict yk, \ int na, \</pre>                                                     | MV .L1 B4, B19                                                                                                                                                                                    |    |
| <pre>int nyk) { int i, j; for (i=0; i<nyk; i++)="" pre="" yk[i]="0;" {="" }<=""></nyk;></pre>   | fir_sp_asm_l2:<br>LDW .D1 *A19++, A9<br>LDW .D2 *B19++, B9<br>MPYSP .M1x A9, B9, A17<br>ADDSP .L1 A17, A5, A5                                                                                     |    |
| <pre>/* FIR filter algorithm - dot product */ for (j=0; j<na; j++)="" pre="" {<=""></na;></pre> | [A1] SUB .L1 A1, 1, A1<br>[A1] B .S1 fir_sp_asm_l2                                                                                                                                                |    |
| <pre>yk[i] += a[j]*xk[i+j]; } </pre>                                                            | STW         .D1         A5, *A6++           ADD         .L1         A4, 4, A4           [B0]         SUB         .L2         B0, 1, B0           [B0]         B         .S1         fir_sp_asm_l1 |    |
|                                                                                                 | В ВЗ                                                                                                                                                                                              | 31 |

---



### GLOSSAIRE SYSTÈMES EMBARQUES

#### А

- ABI : Application Binary Interface
- ADC : Analog to Digital Converter
- ALU : Arithmetic and Logical Unit
- AMD : Advanced Micro Devices
- ANSI : American National Standards Institute
- AP : Application Processor
- API : Application Programming Interface
- APU : Accelerrated Processor Unit
- ARM : Société anglaise fabless concevant des CPU RISC 32bits
- ASCII : American Standar Code for Information Interchange
- ASIC : Application Specific Integrated Circuit

#### В

- BP : Base Pointer
- **BSL** : Board Support Library
- BSP : Board Support Package

#### С

- CCS : Code Composer Studio
- CEM : Compatibilité ElectroMagnétique
- CISC : Complex Instruction Set Computer
- CMS : Composant Monté en Surface
- CPU : Central Processing Unit
- CSL : Chip Support Library

#### $\mathsf{D}$

- DAC : Digital to Analog Converter
- DDR : Double Data Rate
- DDR SDRAM: Double Data Rate Synchronous Dynamic Random Access Memory
- DMA : Dual Inline Package (boîtier de composant)
- DMA : Direct Memory Access
- DSP : Digital Signal Processor
- DSP : Digital Signal Processing

#### Е

- EDMA : Enhanced Direct Memory Access
- EUSART : Enhanced Universal Synchronous Asynchronous Receiver Transmitter
- EMIF : External Memory Interface
- EPIC : Explicitly Parallel Instruction Computing



### GLOSSAIRE SYSTEMES EMBARQUES

#### F

- FPU : Floating Point Unit
- FLOPS : Floating-Point Operations Per Second
- FMA: Fused Multiply-Add

#### G

- GCC : Gnu Collection Compiler
- GLCD : Graphical Liquid Crytal Display
- GNU : GNU'S Not UNIX
- GPIO : General Purpose Input Output
- GPGPU : General Purpose GPU
- GPP : General Purpose Processor ou MPU
- GPU : Graphical Processing Unit
- IA-64 : Intel Architecture 64bits
- I2C : Inter Integrated Circuit
- IC : Integrated Circuit
- ICC : Intel C++ Compiler
- ICC : Interface Chaise Clavier (main problem root)
- IDE : Integrated Development Environment
- IDMA : Internal Direct memory Access
- IHM : Interface Homme Machine
- IRQ : Interrupt ReQuest
- ISR : Interrupt Software Routine
- ISR : Interrupt Service Routine
- L
- L1D : Level 1 Data Memory
- L1I : Level 1 Instruction Memory (idem L1P)
- L1P : Level 1 Program Memory (idem L1I)
- Lx : Level x Memory
- LCD : Liquid Crytal Display
- LRU : Least Recently Used



### GLOSSAIRE SYSTEMES EMBARQUES

#### М

- MAC: Multiply Accumulate
- MCU : Micro Controller Unit
- MFLOPS : Mega Floating Point Operations Per Second
- MIMD : Multiple Instructions Multiple Datas
- MIPS : Mega Instructions Per Second
- MISD : Multiple Instructions Single Data
- MMACS : Mega MAC's Per Second (cf. définition MAC ci-dessus)
- MMU : Memory Managment Unit
- MPLABX : MicrochiP LABoratory 10, IDE Microchip
- MPU : Micro Processor Unit ou GPP
- MPU : Memory Protect Unit
- MPPA : Massively Parallel Processor Array

#### 0

• OS : Operating System

#### Р

- PC : Program Counter
- PC : Personal Computer
- PCB : Printed Circuit Board
- PIC18 : Famille MCU 8bits Microchip
- PIC : Programmable Interrupt Controller
- PLD : Programmable Logic Device
- **POSIX** : Portable Operating System Interface (norme IEEE 1003)
- **PPC** : Power PC

#### R

- RAM : Random Access Memory
- **RISC** : Reduced Instruction Set Computer
- RS232 : Norme standardisant un protocole série asynchrone
- RTOS : Real Time Operating System
- RTS : Real Time System



# GLOSSAIRE

- S
- SDK : Software Development Kit •
- **SIMD** : Single Instruction Multiple Date •
- SIP : System In Package
- SOB: System On Board •
- **SOC** : System On Chip **SOP** : Sums of products •
- •
- SP : Stack Pointer •
- SP: Serial Port •
- **SPI** : Serial Peripheral Interface •
- **SRAM** : Static Random Access Memory •
- SSE : Streaming SIMD Extensions
- STM32 : STMicroelectronics 32bits MCU

#### Τ

- TI : Texas Instruments •
- TNS : Traitement Numérique du Signal
- TSC : Time Stamp Counter
- **TTM** : Time To Market

#### U

- UART : Universal Asynchronous Receiver Transmitter
- **USB** : Universal Serial Bus

#### $\bigvee$

- VHDL : VHSIC Hardware Description langage
- VHSIC : Very High Speed Integrated Circuit
- VLIW : Very Long Intruction Word



























