This project provides the useful guidelines, tips and tutorials for building a modern parallel code in C++17/2x0, implemented using CL/SYCL programming model, and running it on the next generation of Raspberry Pi 4B IoT boards, based on the innovative ARM Cortex-A72, Quad-core, 64-bit RISC-V CPUs.
An audience of readers will find out about setting up a Raspberry 4B IoT board, out-of-the-box, and using it for parallel computing, delivering a parallel code in C++17, with the Khronos CL/triSYCL and Aksel Alpay's hipSYCL project's open-source distributions, installing and configuring the GNU's Compiler Collection (GCC) and LLVM/Clang-9.x.x Arm/Aarch64-toolchains, for building the parallel code's executables and running it in Raspbian Buster 10.6 OS.
Raspberry PI 4B+ IoT Boards OverviewThe next generation of innovative Raspberry Pi 4B+ IoT boards, based on the powerful ARM's multi-core symmetric 64-bit RISC-V CPUs, provides an unleashed performance, and, thus, the ultimate productivity of parallel computing, itself. Using the latest Raspberry Pi boards allows to drastically improve the actual performance speed-up of the computational processes, at the edge, such collecting and pre-processing data in real-time, prior to delivering it to a data center for processing, on exa-scale. The running of these processes in parallel significantly increases the efficiency of those cloud-based solutions, serving billons of client requests or providing data analytics and other inference.
Before we'll ground our discussion on building and running a parallel code in C++17, designed by using CL/SYCL heterogeneous programming model specification for the Raspberry Pi boards with Arm/Aarch64-architecture, let’s spend a moment and take a short glance at the next generation of Raspberry Pi 4B+ boards and its technical specs:
The Raspberry Pi 4B+ IoT boards are manufactured based on the innovative Broadcom BCM2711B0 (SoC) chips, equipped with the latest ARM Quad-Core Cortex-A72 @ 1.5GHz 64-bit RISC-V CPUs, providing an ultimate performance and scalability, while leveraging it for the parallel computing, at the edge.
The Raspberry Pi is known for the “reliable” and “fast” tiny-sized nano-computers, designed for data mining and parallel computing. Principally new hardware architectural features of the ARM's multi-core symmetric 64-bit RISC-V CPUs, such as DSP, SIMD, VFPv4 and hardware virtualization support, are capable of bringing the significant improvement to the performance, acceleration speed-up and scalability of the IoT-clusters, massively processing data, at the edge.
Specifically, one of the most important advantages of the latest Raspberry Pi 4B+ boards is the low-profile LPDDR4 memory with 2, 4 or 8 GiB RAM capacity of choice, operating at 3200Mhz and providing a typically large memory transactions bandwidth, positively affecting the performance of parallel computing, in general. The boards with 4 GiB of RAM installed, and higher, are strongly recommended for data mining and parallel computing. Also, the BCM2711B0 SoC-chips are bundled with a various of integrated devices and peripherals, such as Broadcom VideoCore VI @ 500Mhz GPUs, PCI-Ex Gigabit Ethernet Adapters, etc.
For building and running a specific parallel modern code in C++17, implemented using the CL/SYCL heterogeneous programming model, the first that we really need is a Raspberry Pi 4B+ IoT-board with the latest Raspbian Buster 10.6 OS installed and configured for the first use.
Here is a brief checklist of the hardware and software requirements, that must have been met, beforehand:
Hardware:
- Raspberry Pi 4 Model B0, 4GB IoT Board;
- Micro-SD Card 16GB For Raspbian OS And Data Storage;
- DC Power Supply: 5.0V/2-3A via USB Type-C connector (minimum 3A - for data mining and parallel computing);
Software:
- Raspbian Buster 10.6.0 Full OS;
- Raspbian Imager 1.4;
- MobaXterm 20.3 build 4396, or any other SSH-client;
Since, we've got a Raspberry Pi 4B+ IoT board, now, we can proceed with turning it on and setting up, out-of-the-box.
Setting Up A Raspberry Pi 4B IoT BoardBefore we begin, we must download the latest release of the Raspbian Buster 10.6.0 Full OS image from the official Raspberry Pi repository. To install the Raspbian OS image to the SD-card, we will also need to download and use the Raspbian Imager 1.4 application, available for a various of platforms, such as Windows, Linux or macOS:
Additionally, we must also download and install MobaXterm application for establishing a connection to the Raspberry Pi board, remotely, over the SSH- or FTP-protocols:
Since the Raspbian Buster OS and Imager application have been successfully downloaded and installed, we will be using the Imager application to do the following:
1. Erase the SD-card, formatting it to the FAT32 filesystem, by default;
2. Extract the pre-installed Raspbian Buster OS image (*.img) to the SD-card;
Since the steps above have been successfully completed, just remove the SD-card from the card-reader and plug it into the Raspberry Pi board’s SD-card slot. Then, attach the micro-HDMI and Ethernet cables. Finally, plug the DC power supply cable's connector in, and turn on the board. Finally, the system will boot up with the Raspbian Buster OS, installed to the SD-card, prompting to perform several post-installation steps to configure it for the first use.
Since the board has been powered on, make sure that all of the following post-installation steps have been completed:
1. Open the bash-console and set the ‘root’ password:
pi@raspberrypi4:~ $ sudo passwd root
2. Login to the Raspbian bash-console with 'root' privileges:
pi@raspberrypi4:~ $ sudo -s
3. Upgrade the Raspbian's Linux base system and firmware, using the following commands:
root@raspberrypi4:~# sudo apt update
root@raspberrypi4:~# sudo apt full-upgrade
root@raspberrypi4:~# sudo rpi-update
4. Reboot the system, for the first time:
root@raspberrypi4:~# sudo shutdown -r now
5. Install the latest Raspbian's bootloader and reboot the system, once again:
root@raspberrypi4:~# sudo rpi-eeprom-update -d -a
root@raspberrypi4:~# sudo shutdown -r now
6. Launch the 'raspi-config' setup tool:
root@raspberrypi4:~# sudo raspi-config
7. Complete the following steps, using the 'raspi-config' tool:
* Update the 'raspi-config' tool:
* Disable the Raspbian's Desktop GUI on boot:
System Options >> Boot / Autologin >> Console Autologin:
* Expand the root ‘/’ partition size on the SD-card:
After performing the Raspbian post-install configuration, finally reboot the system. After rebooting, you will be prompted to login. Use the ‘root’ username and the password, previously set, for logging in to the bash-console with root privileges.
Since you've been successfully logged in, install the number of packages from APT-repositories by using the following command, in bash-console:
root@raspberrypi4:~# sudo apt install -y net-tools openssh-server
These two packages are required for configuring the either the Raspberry Pi's network interface or the OpenSSH-server for connecting to the board, remotely, via SSH-protocol, by using MobaXterm.
Configure the board’s network interface ‘eth0’ by modifying the /etc/network/interfaces, for example:
auto eth0
iface eth0 inet static
address 192.168.87.100
netmask 255.255.255.0
broadcast 192.168.87.255
gateway 192.168.87.254
nameserver 192.168.87.254
Next to the network interface, perform a basic configuration of the OpenSSH-server, by uncommenting these lines in the /etc/ssh/sshd_config:
PermitRootLogin yes
StrictModes no
PasswordAuthentication yes
PermitEmptyPasswords yes
This will enable the 'root' login, into the bash-console, via SSH-protocol, without entering a password.
Finally, give a try to connect the board over the network, using the MobaXterm application and opening the remote SSH-session to the host with IP-address: 192.168.87.100. You must also be able to successfully login to the Raspbian's bash-console, with the credentials, previously set:
In 2020, Khronos Group, Intel Corp., and other vendors, announced the revolutionary new heterogeneous parallel compute platform (XPU), providing an ability to offload an execution of "heavy" data processing workloads to a widespread of hardware acceleration (e.g. GPGPU or FPGAs) targets, other than the host CPUs, only. Conceptually, the parallel code development, using the XPU-platform, is entirely based on the Khronos CL/SYCL programming model specification, - an abstraction layer of the OpenCL 2.0 library.
Here's a tiny example, illustrating the code in C++17, implemented using the CL/SYCL-model abstraction layer:
#include <CL/sycl.hpp>
using namespace cl::sycl;
constexpr std::uint32_t N = 1000;
cl::sycl::queue q{};
q.submit([&](cl::sycl::handler &cgh) {
cgh.parallel_for<class Kernel>(cl::sycl::range<1>{N}, \
[=](cl::sycl::id<1> idx) {
// Do some work in parallel
});
});
q.wait();
The code fragment in C++17, shown above, is delivered, entirely based on using the CL/SYCL-programming model. It instantiates a cl::sycl::queue{} object with the default parameter initializers list, for submitting SYCL-kernels, for an execution, to the host CPUs acceleration target, used by default. Next, it invokes the cl::sycl::submit(...) method, having a single argument of the cl::sycl::handler{} object, for accessing methods, that provide a basic kernels functionality, based on a various of parallel algorithms, including the cl::sycl::handler::parallel_for(...) method.
The following method is used for implementing a tight parallel loop, spawned from within a running kernel. Each iteration of this loop is executed in parallel, by its own thread. The cl::sycl::handler::parallel_for(...) accepts two main arguments of the cl::sycl::range<>{} object and a specific lamda-function, invoked, during each loop iteration. The cl::sycl::range<>{} object basically defines an amount of parallel loop iterations, being executed, for each specific dimension, in case when multiple nested loops are collapsed, while processing a multi-dimensional data.
In the code, from above, cl::sycl::range<1>(N) object is used for scheduling N-iterations of the parallel loop, in a single dimension. The lambda-function of the parallel_for(...) method accepts a single argument of another cl::sycl::id<>{} object. As well as the cl::sycl::range<>{}, this object implements a vector container, each element, of which, is an index value for each dimension and each iteration of the parallel loop, respectively. Passed as an argument to a code in the lamda-function's scope, the following object is used for retrieving the specific index values. The lamda-function's body contains a code that does some of the data processing, in parallel.
After a specific kernel has been submitted to the queue and spawned for an execution, the following code invokes the cl::sycl::wait() method with no arguments to set a barrier synchronization, ensuring that no code will be executed, so far, until the kernel being spawned has completed its parallel work.
The CL/SYCL heterogeneous programming model is highly efficient and can be used for a widespread of applications.
However, Intel Corp. and CodePlay Software Inc, soon, have deprecated the support of CL/SYCL for hardware architectures, other than the x86_64. This made it impossible to deliver a parallel C++ code, using the specific CL/SYCL libraries, targeting Arm/Aarch64, and other architectures.
Presently, there's a number of CL/SYCL open-source library projects, developed by a vast of developers and enthusiasts, providing support for more hardware architectures, rather than the x86_64, only.
Since 2016, Khronos Group, Inc. releases the revisions of their triSYCL library open-source project (https://github.com/triSYCL/triSYCL), recommended for using it as a testbed while evaluating the latest CL/SYCL programming model layer specification and sending a feedback to Khronos- and ISO-committees. However, the following library distribution is not "stable" and can be used solely for the demonstration purposes, and, not, for building a CL/SYCL-code, in production. Also, the Khronos triSYCL library distribution fully supports the cross-platform compilation, on a x86_64 development machine, using the GNU's Arm/Aarch64 cross-platform toolchain, rather than building a code, "natively", with LLVM/Clang compilers, on Raspberry Pi.
In 2019, Aksel Alpay, at Heidelberg University (Germany), implemented the latest CL/SYCL programming model layer specification library, targeting a various of hardware-architectures, including the Raspberry Pi's Arm/Aarch64 architectures, and contributed the most "stable" release of the hipSYCL open-source library distribution to GitHub (https://github.com/illuhad/hipSYCL).
Further, in this story, we will discuss about installing and configuring the GNU's cross-platform GCC/G++-10.x.x and "native" Arm/Aarch64's LLVM/Clang-9.x.x toolchains, and using the triSYCL and hipSYCL library distributions, for delivering a modern parallel code in C++17, based on using the libraries, being discussed.
Building A CL/SYCL-Code On Debian/Ubuntu Development Machine (x86_64) And Raspberry Pi IoT-BoardsThere are basically two methods of building a CL/SYCL-code, in C++17, introduced above, by using the GNU's GCC/G++-10.x.x cross-platform toolchain and x86_64 Debian/Ubuntu-based development machine, or, "natively", on a Raspberry Pi IoT-board, with LLVM/Clang-9.x.x, for Arm/Aarch64 hardware architectures, installed.
The using of the first method allows to build code sources in C++17/2x0, implemented, by using the Khronos triSYCL library and GNU's cross-platform Arm/Aarch64-toolchain, on the Debian/Ubuntu-based x86_64 development machine, prior to running it on a Raspberry Pi.
For deploying the x86_64 development machine, the installation of the latest Debian Buster 10.6.0 or Ubuntu 20.04 LTS, are required:
To have an ability of using the development machine, on a host computer, running Microsoft Windows 10, any of the existing (e.g. Oracle VirtualBox or VMware Workstation) virtualization environments of choice can be used, for that purpose:
To get started with the development machine deployment, the all what must be done, first, is to setup a specific virtualization environment, create a virtual machine and launch the Debian or Ubuntu installation.
Since the virtual machine has been created, and Debian/Ubuntu has been successfully installed, we can proceed with several steps, installing and configuring the GNU's GCC/G++-10.x.x cross-platform compilers, development tools, and the Khronos triSYCL library, required for building a code, targeting the Raspberry Pi's Arm/Aarch64-architectures.
Prior to installing and configuring the GCC/G++ compilers toolchain and runtime libraries, make sure that the following prerequisite steps have been completed:
- Upgrade the Debian/Ubuntu’s Linux base system:
root@uarmhf64-dev:~# sudo apt update
root@uarmhf64-dev:~# sudo apt upgrade -y
root@uarmhf64-dev:~# sudo apt full-upgrade -y
The completion of this step is required, to ensure that the running Debian/Ubuntu installation, on the x86_64 host development machine, is the most up-to-date, and the latest kernel and packages, are installed.
- Install ‘net-tools’ and OpenSSH-server packages from APT-repository:
root@uarmhf64-dev:~# sudo apt install -y net-tools openssh-server
The ‘net-tools’ and ‘openssh-server’ are installed for providing an ability of configuring the development machine's network interface and connecting to the running development machine, remotely, over the SSH- and FTP-protocols.
Since the system has been upgraded and all required packages have been installed, we can proceed with installing and configuring the specific compilers and toolchains, then.
Installing And Configuring GNU's GCC/G++-10.x.x1. Install the GNU Compilers Collection (GCC)’s toolchain, for x86_64 platform:
root@uarmhf64-dev:~# sudo apt install -y build-essential
2. Install the GNU’s cross-platform Arm64/Armhf toolchains:
root@uarmhf64-dev:~# sudo apt install -y crossbuild-essential-arm64
root@uarmhf64-dev:~# sudo apt install -y crossbuild-essential-armhf
The installation of cross-platform toolchains for Arm64/Armhf hardware architectures is essentially required for building a parallel code in C++17, that uses triSYCL library, on the x86_64 development machine.
3. Install the GNU's GCC/G++, OpenMP 5.0, Boost, Range-v3, POSIX Threads, C/C++ standard runtime libraries, required:
root@uarmhf64-dev:~# sudo apt install -y g++-10 libomp-dev libomp5 libboost-all-dev librange-v3-dev libc++-dev libc++1 libc++abi-dev libc++abi1 libpthread-stubs0-dev libpthread-workqueue-dev
4. Install the GNU’s GCC/G++-10.x.x. cross-platform compilers, for building a code, targeting Arm64/Armhf architectures:
root@uarmhf64-dev:~# sudo apt install -y gcc-10-arm-linux-gnueabi g++-10-arm-linux-gnueabi gcc-10-arm-linux-gnueabihf g++-10-arm-linux-gnueabihf
5. Select the GCC/G++-10.x.x “native” x86_64-compilers, used by default, updating the alternatives:
sudo update-alternatives --install /usr/bin/gcc gcc /usr/bin/gcc-9 1
sudo update-alternatives --install /usr/bin/gcc gcc /usr/bin/gcc-10 2
sudo update-alternatives --install /usr/bin/g++ g++ /usr/bin/g++-9 1
sudo update-alternatives --install /usr/bin/g++ g++ /usr/bin/g++-10 2
sudo update-alternatives --install /usr/bin/cc cc /usr/bin/gcc 3
sudo update-alternatives --set cc /usr/bin/gcc
sudo update-alternatives --install /usr/bin/c++ c++ /usr/bin/g++ 3
sudo update-alternatives --set c++ /usr/bin/g++
6. Select the GCC/G++-10.x.x cross-platform Arm/Aarch64-compilers, used by default, updating the alternatives:
sudo update-alternatives --install /usr/bin/arm-linux-gnueabihf-gcc arm-linux-gnueabihf-gcc /usr/bin/arm-linux-gnueabihf-gcc-9 1
sudo update-alternatives --install /usr/bin/arm-linux-gnueabihf-gcc arm-linux-gnueabihf-gcc /usr/bin/arm-linux-gnueabihf-gcc-10 2
sudo update-alternatives --install /usr/bin/arm-linux-gnueabihf-g++ arm-linux-gnueabihf-g++ /usr/bin/arm-linux-gnueabihf-g++-9 1
sudo update-alternatives --install /usr/bin/arm-linux-gnueabihf-g++ arm-linux-gnueabihf-g++ /usr/bin/arm-linux-gnueabihf-g++-10 2
sudo update-alternatives --install /usr/bin/arm-linux-gnueabihf-cc arm-linux-gnueabihf-cc /usr/bin/arm-linux-gnueabihf-gcc 3
sudo update-alternatives --set arm-linux-gnueabihf-cc /usr/bin/arm-linux-gnueabihf-gcc
sudo update-alternatives --install /usr/bin/arm-linux-gnueabihf-c++ arm-linux-gnueabihf-c++ /usr/bin/arm-linux-gnueabihf-g++ 3
sudo update-alternatives --set arm-linux-gnueabihf-c++ /usr/bin/arm-linux-gnueabihf-g++
7. Finally, check if the correct versions of the GNU’s “native” and cross-platform toolchains are installed:
root@uarmhf64-dev:~# gcc --version && g++ --version
root@uarmhf64-dev:~# arm-linux-gnueabihf-gcc --version
root@uarmhf64-dev:~# arm-linux-gnueabihf-g++ --version
8. Navigate to the /opt directory and clone the Khronos triSYCL library distribution from the GitHub repository:
root@uarmhf64-dev:~# cd /opt
root@uarmhf64-dev:~# git clone --recurse-submodules https://github.com/triSYCL/triSYCL
The following commands will create the /opt/triSYCL sub-directory, containing sources of the triSYCL library distribution.
9. Copy the triSYCL library’s C++ header files from the /opt/triSYCL/include directory to its default location /usr/include/c++/10/, on the development machine, by using ‘rsync’ command:
root@uarmhf64-dev:~# cd /opt/triSYCL
root@uarmhf64-dev:~# sudo rsync -r ./ include/ /usr/include/c++/10/
10. Set the environment variables, required for using the triSYCL library with GNU’s cross-platform toolchain, previously installed:
export CPLUS_INCLUDE_PATH=/usr/include/c++/10
env CPLUS_INCLUDE_PATH=/usr/include/c++/10
sudo echo "export CPLUS_INCLUDE_PATH=/usr/include/c++/10" >> /root/.bashrc
11. Perform a simple clean-up, by removing the /opt/triSYCL sub-directory:
root@uarmhf64-dev:~# rm -rf /opt/triSYCL
12. Build the ‘hello.cpp’ code sample using the “native” x86_64 GNU’s GCC/G++ compiler:
root@uarmhf64-dev:~# g++ -std=c++17 -o hello hello.cpp -lpthread -lstdc++
The building specific code in C++17/2x0, that uses Khronos triSYCL library, requires the POSIX threads and C++ standard libraries runtime linkage.
13. Build the ‘hello.cpp’ code sample using the GNU’s cross-platform GCC/G++ compiler:
root@uarmhf64-dev:~# arm-linux-gnueabihf-g++ -std=c++17 -o hello_rpi4b hello.cpp -lpthread -lstdc++
Since the code executable for Arm/Aarch64-architectures were successfully generated, download the executable, from the development machine, via FTP- or SSH-protocol, using the MobaXterm application. After that, upload 'hello_rpi4b' executable file, by using another SSH-session, to the Raspberry Pi board.
To run the 'hello_rpi4b' executable, use the following command in the Raspbian's bash-console, for example:
root@uarmhf64-dev:~# chmod +rwx hello_rpi4b
root@uarmhf64-dev:~# ./hello_rpi4b > output.txt && cat output.txt
This will create and append the output to the 'output.txt' file, printing its contents to the bash-console:
Hello from triSYCL on Rasberry Pi 4B+!!!
Hello from triSYCL on Rasberry Pi 4B+!!!
Hello from triSYCL on Rasberry Pi 4B+!!!
Hello from triSYCL on Rasberry Pi 4B+!!!
Hello from triSYCL on Rasberry Pi 4B+!!!
Note: Normally, the first method does not require the building of Khronos triSYCL library distribution from its sources, unless you plan the using of triSYCL against the other HPC libraries, such either OpenCL, OpenMP or TBB. For more information on using the triSYCL, along with the other libraries, refer to the following guidelines and documentation https://github.com/triSYCL/triSYCL/blob/master/doc/cmake.rst
The using of Aksel Alpay's hipSYCL open-source library distribution and LLVM/Clang-9.x.x. "native" compiler toolchain, targeting the Arm/Aarch64-architecture, is the second method, that allows to build a CL/SYCL code, in C++17/2x0, for running it on Raspberry Pi boards. The building of specific code, natively, is only possible, in case when both the LLVM/Clang-9.x.x toolchain and hipSYCL library distribution are installed on the Raspberry Pi board, and, not x86_64 development machine, itself.
Further, we will discuss about everything that is needed to know for installing and configuring the LLVM/Clang-9.x.x compiler toolchain on a Raspberry Pi board, as well as building the Aksel Alpay's hipSYCL library, from sources.
Installing And Configuring LLVM/Clang-9.x.xBefore using the Aksel Alpay's hipSYCL library project's distribution, the specific LLVM/Clang-9.x.x compilers and the Arm/Aarch64 toolchains must be properly installed and configured. To do that, make sure that you've completed the number of steps, listed below:
1. Update the Raspbian's APT-repositories and install the following prerequisite packages:
root@raspberrypi4:~# sudo apt update
root@raspberrypi4:~# sudo apt install -y bison flex python python3 snap snapd git wget
The command, above, will install an alternative 'snap' package manager, required for installing the proper version of cmake >= 3.18.0 utility, as well as the 'python', 'python3' distributions and the 'bison', 'flex' utilities, needed for building the hipSYCL open-source project from a "scratch", by using the 'cmake' utility.
2. Install the 'cmake' >= 3.18.0 utility and LLVM/Clang daemon by using the 'snap' package manager:
root@raspberrypi4:~# sudo snap install cmake --classic
root@raspberrypi4:~# sudo snap install clangd --classic
After installing the 'cmake' utility, let's check if it works and the correct version has been installed from the 'snap'-repository, by using the command below:
root@raspberrypi4:~# sudo cmake --version
You must see the following output, after running this command:
cmake version 3.18.4
CMake suite maintained and supported by Kitware (kitware.com/cmake).
3. Install the latest Boost, POSIX-Threads and C/C++ standard runtime libraries for the LLVM/Clang toolchain:
root@raspberrypi4:~# sudo apt install -y libc++-dev libc++1 libc++abi-dev libc++abi1 libpthread-stubs0-dev libpthread-workqueue-dev
root@raspberrypi4:~# sudo apt install -y clang-format clang-tidy clang-tools clang libc++-dev libc++1 libc++abi-dev libc++abi1 libclang-dev libclang1 liblldb-dev libllvm-ocaml-dev libomp-dev libomp5 lld lldb llvm-dev llvm-runtime llvm python-clang libboost-all-dev
4. Download and add the LLVM/Clang's APT-repositories security key:
root@raspberrypi4:~# wget -O - https://apt.llvm.org/llvm-snapshot.gpg.key | sudo apt-key add -
5. Append the LLVM/Clang's repository URLs to the APT's sources.list:
root@raspberrypi4:~# echo "deb http://apt.llvm.org/buster/ llvm-toolchain-buster main" >> /etc/apt/sources.list.d/raspi.list
root@raspberrypi4:~# echo "deb-src http://apt.llvm.org/buster/ llvm-toolchain-buster main" >> /etc/apt/sources.list.d/raspi.list
The completion of these two previous steps 4 and 5 is necessary to have an ability of installing the LLVM/Clang-9.x.x. compilers and specific toolchains, from the specific APT-repository.
6. Remove the existing symlinks to the previous versions of the LLVM/Clang, installed:
root@raspberrypi4:~# cd /usr/bin && rm -f clang clang++
7. Update the APT-repositories, once again, and install the LLVM/Clang's compilers, debugger and linker:
root@raspberrypi4:~# sudo apt update
root@raspberrypi4:~# sudo apt install -y clang-9 lldb-9 lld-9
8. Create the corresponding symlinks to the 'clang-9' and 'clang++-9' compilers, installed:
root@raspberrypi4:~# cd /usr/bin && ln -s clang-9 clang
root@raspberrypi4:~# cd /usr/bin && ln -s clang++-9 clang++
9. Finally, you must have an ability of using the 'clang' and 'clang++' commands in the bash-console:
root@raspberrypi4:~# clang --version && clang++ --version
Here, let's check the version of the LLVM/Clang, that has been installed, using the command, above.
After using the commands, you must see the following output:
clang version 9.0.1-6+rpi1~bpo10+1
Target: armv6k-unknown-linux-gnueabihf
Thread model: posix
InstalledDir: /usr/bin
clang version 9.0.1-6+rpi1~bpo10+1
Target: armv6k-unknown-linux-gnueabihf
Thread model: posix
InstalledDir: /usr/bin
Downloading And Building hipSYCL Library DistributionAnother essential step is downloading and building the open-source hipSYCL library staging distribution from its sources, contributed to the GitHub.
This typically done by completing the following steps, below:
1. Download the hipSYCL project's distribution, cloning it from GitHub:
root@raspberrypi4:~# git clone https://github.com/llvm/llvm-project llvm-project
root@raspberrypi4:~# git clone --recurse-submodules https://github.com/illuhad/hipSYCL
The Aksel Alpay's hipSYCL project's distribution has several dependencies from another, LLVM/Clang's open-source project. That's actually why, we normally need to clone these both distributions, for building the hipSYCL library runtimes from a "scratch".
2. Set the number of environment variables, required for building hipSYCL project from sources, by using the 'export' and 'env' commands, as well as appending the specific lines, below, to the.bashrc profile script:
export LLVM_INSTALL_PREFIX=/usr
export LLVM_DIR=~/llvm-project/llvm
export CLANG_EXECUTABLE_PATH=/usr/bin/clang++
export CLANG_INCLUDE_PATH=$LLVM_INSTALL_PREFIX/include/clang/9.0.1/include
echo "export LLVM_INSTALL_PREFIX=/usr" >> /root/.bashrc
echo "export LLVM_DIR=~/llvm-project/llvm" >> /root/.bashrc
echo "export CLANG_EXECUTABLE_PATH=/usr/bin/clang++" >> /root/.bashrc
echo "export CLANG_INCLUDE_PATH=$LLVM_INSTALL_PREFIX/include/clang/9.0.1/include" >> /root/.bashrc
env LLVM_INSTALL_PREFIX=/usr
env LLVM_DIR=~/llvm-project/llvm
env CLANG_EXECUTABLE_PATH=/usr/bin/clang++
env CLANG_INCLUDE_PATH=$LLVM_INSTALL_PREFIX/include/clang/9.0.1/include
3. Create and change to the ~/hipSYCL/build sub-directory under the hipSYCL project's main directory:
root@raspberrypi4:~# mkdir ~/hipSYCL/build && cd ~/hipSYCL/build
4. Configure the hipSYCL project's sources using 'cmake' utility:
root@raspberrypi4:~# cmake -DCMAKE_INSTALL_PREFIX=/opt/hipSYCL ..
5. Build and install the hipSYCL runtime library using the GNU's 'make' command:
root@raspberrypi4:~# make -j $(nproc) && make install -j $(nproc)
6. Copy the libhipSYCL-rt.iso runtime library to the Raspbian's default libraries location:
root@raspberrypi4:~# cp /opt/hipSYCL/lib/libhipSYCL-rt.so /usr/lib/libhipSYCL-rt.so
7. Set the environment variables, required for using hipSYCL runtime library and LLVM/Clang compilers for building a source code:
export PATH=$PATH:/opt/hipSYCL/bin
export C_INCLUDE_PATH=$C_INCLUDE_PATH:/opt/hipSYCL/include
export CPLUS_INCLUDE_PATH=$CPLUS_INCLUDE_PATH:/opt/hipSYCL/include
export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/opt/hipSYCL/lib
echo "export PATH=$PATH:/opt/hipSYCL/bin" >> /root/.bashrc
echo "export C_INCLUDE_PATH=$C_INCLUDE_PATH:/opt/hipSYCL/include" >> /root/.bashrc
echo "export CPLUS_INCLUDE_PATH=$CPLUS_INCLUDE_PATH:/opt/hipSYCL/include" >> /root/.bashrc
echo "export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/opt/hipSYCL/lib" >> /root/.bashrc
env PATH=$PATH:/opt/hipSYCL/bin
env C_INCLUDE_PATH=$C_INCLUDE_PATH:/opt/hipSYCL/include
env CPLUS_INCLUDE_PATH=$CPLUS_INCLUDE_PATH:/opt/hipSYCL/include
env LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/opt/hipSYCL/lib
Running A Parallel CL/SYCL-Code In C++17 On Raspberry Pi 4B+Since, we're finally all set with the installing and configuring LLVM/Clang and hipSYCL library, it's strongly recommended to build and run the 'matmul_hipsycl' sample's executable, making sure that everything is just working fine:
Here're the most common steps for building the following sample from sources:
rm -rf ~/sources
mkdir ~/sources && cd ~/sources
cp ~/matmul_hipsycl.tar.gz ~/sources/matmul_hipsycl.tar.gz
tar -xvf matmul_hipsycl.tar.gz
rm -f matmul_hipsycl.tar.gz
A set of commands, above, will create ~/source sub-directory and extract sample's sources from the matmul_hipsycl.tar.gz archieve.
To build the sample's executable, simply use the GNU's 'make' command:
root@raspberrypi4:~# make all
This will invoke the 'clang++' command to build the executable:
syclcc-clang -O3 -std=c++17 -o matrix_mul_rpi4 src/matrix_mul_rpi4b.cpp -lstdc++
This command will compile the specific C++17 code with the highest level of code optimization (e.g. -O3), enabled, and linking it with the C++ standard library runtime.
Note: Along with the library runtime, hipSYCL project, built, also provides also the 'syclcc' and 'syclcc-clang' tools, used for building a parallel code, in C++17, implemented using hipSYCL library. The using of these tools is a slightly different from the regular usage of 'clang' and 'clang++' commands. However, the 'syclcc' and 'syclcc-clang' can still be used, specifying the same compiler and linker options, as the original 'clang' and 'clang++' commands.
After performing the compilation using these tools, just grant the execution privileges to 'matrix_mul_rpi4' file, generated by the compiler, using the command, listed below:
root@raspberrypi4:~# chmod +rwx matrix_mul_rpi4
, and, just, run the executable, in the bash-console:
root@raspberrypi4:~# ./matrix_mul_rpi4
After running it, the execution will end up with the following output:
* * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * *
Multiplication C = A x B:
Matrix C:
323 445 243 343 363 316 495 382 463 374
322 329 328 388 378 395 392 432 470 326
398 357 337 366 386 407 478 457 520 374
543 531 382 470 555 520 602 534 639 505
294 388 277 314 278 330 430 319 396 372
447 445 433 485 524 505 604 535 628 509
445 468 349 432 511 391 552 449 534 470
434 454 339 417 502 455 533 498 588 444
470 340 416 364 401 396 485 417 496 464
431 421 325 325 272 331 420 385 419 468
Execution time: 5 ms
Optionally, we can evaluate performance of the parallel code, being executed by installing and using the following utilities:
root@raspberrypi4:~# sudo apt install -y top htop
The using of 'htop' utility, installed, visualizes the CPU's and system memory utilization, while running the parallel code executable:
micro-FPGAs, as well as the pocket-sized GPGPUs with compute capabilities, connected to an IoT-board, externally, via GPIO- or USB-interfaces, is the next tremendous step of parallel computing with IoT. The using of tiny-sized FPGAs and GPGPUs provides an opportunity of performing an even more complex and “heavy” computations, in parallel, drastically increasing an actual performance speed-up, while processing huge amounts of big-data, in real-time.
Obviously, that, another essential aspect of the parallel computing with IoT is the continuation in the development of specific libraries and frameworks, providing CL/SYCL-model layer specification and, thus, the heterogeneous compute platform (XPU) support. Presently, the latest versions of these libraries provide a support for offloading a parallel code execution to the host CPUs acceleration targets, only, since the other acceleration hardware, such as small-sized GPGPUs and FPGAs for nano-computers, have not yet been designed and manufactured, by its vendors, at this time.
In fact, the parallel computing with Raspberry Pi and other specific IoT boards is a special point of interest for the software developers and hardware technicians, conducting a performance assessment of the existing computational processes, while running it in parallel with IoT.
In conclusion, leveraging IoT-based parallel computing generally benefits in an overall performance of the cloud-based solutions, intended for collecting and massively processing big-data, in real-time, and, as the result, positively impacting the quality of machine learning (ML) and data analytics, itself.
Comments