FaaM: FPGA-as-a-Microservice-A Case Study for Data Compression

Field-programmable gate arrays (FPGAs) have largely been used in communication and high-performance computing and given the recent advances in big data and emerging trends in cloud computing (e.g., serverless [18]), FPGAs are increasingly being introduced into these domains (e.g., Microsoft’s datacenters [6] and Amazon Web Services [10]). To address these domains’ processing needs, recent research has focused on using FPGAs to accelerate workloads, ranging from analytics and machine learning to databases and network function virtualization. In this paper, we present an ongoing effort to realize a high-performance FPGA-as-a-microservice (FaaM) architecture for the cloud. We discuss some of the technical challenges and propose several solutions for efficiently integrating FPGAs into virtualized environments. Our case study deploying a multithreaded, multi-user compression as a microservice using the FaaM architecture indicate that microservices-based FPGA acceleration can sustain high-performance compared to straightforward implementation with minimal to no communication overhead despite the hardware abstraction.


Introduction
With the rapidly increasing demand for cloud computing, there is a corresponding increased interest in using field-programmable gate arrays (FPGAs) to accelerate datacenter workloads. Given an FPGA's computational flexibility, FPGA-based accelerators have been generally applied to applications with intensive, high-performance computing (HPC) demands, achieving orders of magnitude performance improvement and power efficiency as compared to functionally equivalent central processing unit (CPU)-based implementations [4][5] [6]. Additionally, an FPGA's reprogramability make FPGA-based accelerators highly suitable for datacenter-wide deployments, especially for workloads that have algorithms that may change over time. However, the economics of scaling new, non-homogenous datacenter architectures combining traditional CPUs with FPGAs remains a significant resource management challenge, which includes deployment, maintainability, and composability across an entire datacenter infrastructure. Addressing these challenges is critical for minimizing operational costs and service downtime in large-scale, production environments.
In spite of this management complexity, the emergence of hyperscale datacenters (i.e., datacenters with high scale-out capabilities) presents an opportunity for accelerator systems that tightly integrate CPUs and FPGAs (e.g., Xeon+FPGA server platform [7]). While these tightly coupled servers enable acceleration of local applications that run on each server, to meet performance demands, users must be able to access and distribute applications across a large, global FPGA accelerator pool which shares an optimized communication infrastructure. Finally, for ease of use, this pool must appear as an individual datacenter resource that is accessible to multiple, simultaneous cloud users.

FPGA Microservices
Our approach uses microservices (a collection of loosely coupled accelerator services) to offer FPGA accelerators as a set of shared, lightweight services that scales dynamically with constantly changing datacenter workload demands. Using an FPGA-as-a-microservice (FaaM) architecture for the cloud, FPGA accelerator functionality can be offered as a microservice, enabling application developers to easily leverage many microservice characteristics, including auto-deployment, scalability, dynamic configuration, and disaster recovery [13]. Additionally, since a microservice is stateless, FPGA resources can be quickly provisioned without relying on extra virtualization technology [14], further reducing the time to relocate a microservice in the event of failure.

Design and Implementation
In this section, we describe our FaaM design and implementation, which is based on Docker containers. We prototype FaaM using x86-based Xeon+FPGA physical machines running a Linux operating system. We note that the proposed FaaM architecture is not restricted to only Docker containers and Xeon+FPGA platforms but can be realized for other types of virtualization technologies and FPGA platforms. Fig. 1 depicts the general FaaM architecture consisting of a Worker Node and a Service Node. To enable dynamic scaling with workload type and size, Worker Nodes are decoupled from Service Nodes. Service Nodes host FaaM services, providing a set of hardware-accelerated functions (e.g., a compression service). FaaM services are deployed in Service Containers, simplifying manageability by the FaaM Service Manager. To support load balancing, multiple instances of a Service Container can be deployed by the FaaM Service Manager as a group of identical services, providing fault-tolerant redundancy and scalability. Service Containers may also serve unique FaaM services depending on how the FaaM Service Manager and the FPGA in the Worker Nodes have been configured.
Each Worker Node runs a single instance of the FaaM Accelerator Manager, which is a separate (privileged) Docker container instance, providing accelerator management functions (e.g., reprogramming the FPGA or providing control and monitoring features). Under the control of the FaaM Accelerator Manager, a Worker Node hosts one or more Worker Containers from a container repository that is accessible by all Worker Nodes. Each Worker Container abstracts a specific hardware accelerator function (e.g., a compression service), exposing the function as a web service, consequently enabling remote access by Service Containers. A high-speed Ethernet network connects Service Nodes with Worker Nodes. Worker Nodes are behind a secured network, and cloud users have no way of directly interacting with the FPGAs or Worker Nodes, except through a set web application programming interfaces (APIs) exposed by Service Nodes through Service Containers. The APIs are implemented as Java WebSocket, enabling point-to-point inter-node communication. As shown in Fig. 1, a Worker Node is organized into three distinct layers: the FPGA accelerator, the task scheduler, and the Java virtual machine (JVM) runtime system. Fig. 2 illustrates the FPGA accelerator layer, where accelerator function units (AFUs) provide specific hardware functionality (e.g., compression, machine learning inference, etc.). The AFUs act as a pool of FPGA configurable resources where these hardware functions can be assigned to each AFU. The hardware function is constrained in size by the amount of logic resources on the AFU, and it must expose a Cache Coherent Interface Protocol (CCI-P) that connects to the CCI-P Interconnect block and to the rest of the components on the FPGA.

Scheduler
To schedule cloud users' jobs, we focus on a task scheduler that is local to each Worker Container. The role of the scheduler is to admit threads from the web service and schedule these threads on the FPGA. When an accelerator request arrives, the scheduler examines the low-level information from the hardware (such as which AFU is currently unutilized) and makes dispatch decisions that match the corresponding thread to an available AFU. To maintain fair sharing of the AFU, we use a first-come-first-serve (FCFS) scheduling policy and use buffer sizes with minimal overhead, ranging between 32 KB and 128 KB as further discussed in Section 4.2. When an accelerator function is not available, the scheduler defaults to executing the thread on the CPU to maintain acceptable throughput.

Runtime System
We implemented an FPGA runtime system written in Java. The runtime system is designed as a dynamic library shared among multiple threads that can be associated with user requests. We prototyped the runtime system atop the Accelerator Abstraction Layer (AAL) software stack provided by Intel. AAL provides low-level accelerator management functionality to the scheduler, allowing the scheduler to call into native C/C++ libraries of AAL. While the FPGA JVM and task scheduler both run as a single JVM process, the web service runs as a separate process, allowing for a different type of application-facing web service to be integrated with the runtime system.

FPGA Integration Challenges and Solutions
In this section, we present several challenges and solutions when designing the FaaM architecture. To verify the proposed approach, we implement compression [15] as a microservice (CaaM) and evaluate runtime performance as well as overheads.

Software-FPGA Interaction
While FPGA accelerators are normally manipulated through C/C++ code or low-level libraries, some datacenter-scale applications and frameworks are commonly written in Java -or other runtime-based language like Scala -running within a (JVM) virtual machine. FPGAs are naturally not supported by JVMs, thus the first step for FPGA-to-application integration is to enable support for the FPGA in the JVM, and bridge the gap between native C/C++ code and the application runtime. Java Native Interface (JNI) is typically used to address this issue, however JNI does not always deliver an efficient solution. In particular, the cost of moving data between the JVM heap and native memory can adversely impact application performance. Using SWIG (a wrapper and interface generator), we wrote a domain-specific language (DSL) script that automatically generates Java wrappers from native C++ classes. This approach saves us a significant amount time in debugging JNI code directly, while generating clean interfaces that are optimized for our specific native libraries (i.e., the AAL runtime libraries).

FPGA-to-Host-Memory Communication
Since data movement between the JVM and native memory can incur significant overhead, we leverage the non-blocking I/O (i.e., Java NIO) mechanism that is natively built into the Java framework. A buffer from Java NIO is essentially a block of memory that is wrapped in a Java buffer object. This object is then accessible in Java as a streaming Java class, and is free of JVM garbage collection since the underlying memory is outside the JVM heap. We create NIO buffers of fixed sizes (one per software thread) and re-use individual buffers as many times as the thread associated with a respective buffer is dispatched. Because the allocation of NIO-based buffers can incur overheads (just as with direct memory allocation in C), reusing buffers between non-overlap threads helps to amortize this overhead. As empirically suggested in Fig. 4, we choose buffer size of 64 KB as the optimal transfer size. We also observe that a relatively large amount of time is required by the JVM when establishing large NIO buffers-up to 1ms for buffers as large as 1GB.

CPU-FPGA Thread Co-Existence
FPGAs are naturally suited for highly parallel tasks such as compression [16], and can rapidly offload CPU threads for these kinds of tasks. Therefore, it is necessary to maintain high resource utilization by sharing the FPGA accelerator across multiple CPU threads. To achieve sharing, we implement three versions of an accelerator function interacting with the CPU: Single-threaded C++ (ZLIB-FPGA), single-threaded Java (ZLIB-FPGA/JVM) and multithreaded Java (ZLIB-    We also compare performance with default ZLIB running on a CPU for singlethreaded (ZLIB-CPU) and multithreaded (ZLIB-CPU-T) task.
While ZLIB-FPGA is a straightforward C++ implementation, ZLIB-FPGA/JVM is a wrapper implementation of the former in Java. ZLIB-FPGA/JVM-T is coupled with our task scheduler and together integrated into a multithreaded data processing system (Spark [17]) to demonstrate real world benefits. It is important to note that the purpose of evaluating the performance of ZLIB-FPGA/JVM initially is to ensure that the developed wrapper compares with the performance of the straightforward ZLIB-FPGA implementation with least possible overheads. In the single-threaded scenario (Fig. 5), ZLIB-FPGA/JVM shows an average speedup of 9.8x over ZLIB-CPU. This was roughly the same speedup (10X) achieved when comparing the straightforward ZLIB-FPGA implementation with ZLIB-CPU, meaning ZLIB-FPGA/JVM has very minimal overhead despite the JVM abstraction.
Having created an efficient JVM version of the FPGA accelerator, we can integrate ZLIB-FPGA/JVM in a multithreaded environment. We use our task scheduler and set the buffer sizes for individual threads to 64 KB (from Fig. 4, 64 KB is the optimal transfer size). The transfer size is also congruent with the chunk sizes on the FPGA accelerator. Moreover, we find that this buffer size is most effective when taking into consideration the Resilient Distributed Datasets (RDD) block size used by the Spark. As shown in Fig. 6 and using our multithreaded JVM implementation, the total application run time is reduced from 7 minutes down to 5 minutes.

Resiliency
An important design factor in a hyperscale cloud is the ability to recover from unforeseen failures and minimize downtimes. For FPGAs deployed in the cloud, this can be particularly challenging due to the setup and initialization steps required. To address this challenge, we extend ZLIB-FPGA/JVM-T and leverage Docker's GPU passthrough [18] to create a compression-as-amicroservice (CaaM) framework. The CaaM framework, now exposing ZLIB-FPGA/JVM-T as a containerized service, achieves the same performance as with the non-containerized ZLIB-FPGA/JVM-T implementation. To provide fault recovery and improve service availability, using the FaaM Accelerator Manager we configure the CaaM framework to automatically restart upon failure, which takes only a fraction of a second as with any standard Docker container that has been configured with Autorestart.

Discussions and Conclusions
We presented an architecture for deploying FPGAs in the cloud and highlighted several challenges and solutions for harnessing FPGA accelerators in virtualized environments, such as Docker containers. Motivated by the dynamic nature of datacenter workloads, we proposed an FPGA-as-a-Microservice (FaaM) architecture to allow multiple cloud users to share FPGA accelerator services. Using this FaaM architecture, we implemented compression-as-a-Microservice (CaaM), and demonstrated that FPGA microservices achieve high performance with very minimal runtime overheads. We observed that by efficiently designing buffer movement mechanisms between Java's heap memory and native memory used by the FPGA accelerator, it is possible to reduce unnecessary data transfer overheads and achieve acceptable performance that is close to straightforward FPGA implementation in C/C++. Our Java implementation has less than 1% reduction in application performance for the CaaM. Contrary to previous work where a single, shared buffered is created and shared among multiple threads-resulting in thread contentionsour implementations create multiple private non-blocking NIO buffers, resulting in a more efficient computation-to-memory access pattern. By scaling up or down buffer sizes (to a certain threshold) along with the number of threads in relation to the total input work size, a more balanced degree in concurrency (i.e., interleaving) across threads can be achieved. Based on our experimentation, choosing a buffer size that matches the block size of the underlying file system typically results in fewer block misses for data fetched directly from disk.
For accelerator service requests, the CaaM framework assumes that the input dataset is domiciled locally on an FPGA-attached node. There is active research to integrate FPGAs with YARN and other cluster managers, whereby datasets are distributed across multiple nodes (both FPGA-and non-FPGA-attached). With a more aggressive data locality, such cluster managers could subsequently schedule FPGA-specific tasks on the FPGA-attached nodes provided the working sets of the overall data is already locally cached to the nodes. The fact that FPGA acceleration services implemented using FaaM are encapsulated and isolated across user-space containers, allows container mangers, such as Mesos and Kubernetes, to easily orchestrate such services in a datacenter environment. Future work will include conducting further studies on FaaM with a diverse set of workloads (including machine learning inference) as well as integrating the CaaM framework into streaming applications (e.g., network function virtualization) and data serialization frameworks such as Apache Thrift and Microsoft Bond.