Skip to content

Latest commit

 

History

History
75 lines (69 loc) · 4.94 KB

File metadata and controls

75 lines (69 loc) · 4.94 KB

GPU & FPGA - A Great Heterogeneous Acceleration Engine for Federated Learning

This project is an industrial-level heterogeneous acceleration system to support and speed up federated learning. We've designed and implemented two different heterogeneous acceleration solutions using GPU and FPGA, respectively, that can significantly accelerate the Paillier cryptosystem while maintaining functionality, accuracy and scalability.

How to test GPU engine

  • Requirements / Recommendations:
    • At least one capable NVIDIA GPU device is required.
      • We would recommend such device with a GPU microarchitecture of or later than Volta, such as Tesla V100 or Tesla V100S, to fully utilize the functional supports in our CUDA code.
    • CentOS with version >= 7.8.2003
      • We haven't tested if our engine works well in other Linux releases, such as Ubuntu and Debian. However, it should work with at most some slight modifications.
    • Python with version >= 3.6.8
      • The latest version of NumPy (1.19.4 as of now) is recommended.
      • You may need to install other essential Python packages.
    • If you would like to compile the CUDA code:
      • gcc version 4.8.5 would suffice. Don't use gcc later than 7 since nvcc doesn't support it.
      • nvcc version 10.0.130 would suffice.
  • To test GPU engine functionality
    python3 -m paillier_gpu.tests.test_gpu_engine
  • To test GPU engine performance (profiling)
    python3 -m paillier_gpu.tests.test_gpu_performance
  • You may switch the RAND_TYPE variable between INT64_TYPE and FLOAT_TYPE in the test file, which is recommended to make sure that both float64 (double) and int64 (long long) types can pass all assertions.

How to test FPGA engine

  • Requirements / Recommendations:
    • At least one capable Xilinx FPGA device, such as Alveo U250, is required.
    • CentOS with version >= 7.8.2003
      • We haven't tested if our engine works well in other Linux releases, such as Ubuntu and Debian. However, it should work with at most some slight modifications.
    • Python with version >= 3.6.8
      • The latest version of NumPy (1.19.4 as of now) is recommended.
      • You may need to install other essential Python packages.
    • GCC with version >= 4.8.5 if you would like to compile the C source code.
    • Superuser privileges are required as we need access sensitive directories.
      • Note that the Python path with and without sudo may differ.
  • To test FPGA engine functionality
    sudo python3 -m paillier_fpga.tests.test_fpga_engine
  • To test FPGA engine performance (profiling)
    sudo python3 -m paillier_fpga.tests.test_fpga_performance
  • You may switch the RAND_TYPE variable between INT64_TYPE and FLOAT_TYPE in the test file, which is recommended to make sure that both float64 (double) and int64 (long long) types can pass all assertions.

Profiling Information

The profiling result was obtained from a server with the following configuration.

Hardware Type Model Quantity Remark
CPU Intel Xeon Silver 4114 CPU @ 2.20GHz 2 only 1 core is used in profiling
GPU NVIDIA Tesla V100 PCIe 32GB 4 only 1 GPU card is used in profiling
FPGA Xilinx Alveo U250 1
Memory Samsung 16GiB DIMM DDR4 Synchronous 2666 MHz (0.4 ns) 12 192 GB in total
Hard Disk 2TB WDC WD20SPZX-60U 1

The chart is an overview of the profiling information of our GPU and FPGA engines compared to a CPU implementation under a unified shape of 666*666, where the throughput means the number of operations (instances, either fixed-point numbers or Paillier-encrypted numbers) a device is capable to compute within a second. For matrix multiplication, we consider the number of operations as the number of modular exponentiations we have to compute under a naive O(n^3) algorithm.
We don't count the memory allocation time as it could take a significant amount of time for I/O-bound operators like those involving modular multiplication instead of modular exponentiation. As a result, we would recommend users to reuse the already-allocated CPU memory space as much as possible in a way similar to register renaming.

Operator CPU Throughput GPU Throughput GPU Speedup FPGA Throughput FPGA Speedup
fp_encode 62303.97 33611720.05 539.48 7215836.85 115.82
fp_decode 567913.21 25958708.28 45.71 583509.90 1.03
pi_encrypt 205864.74 24814051.60 120.54 687947.44 3.34
pi_gen_obf_seed 444.05 86766.80 195.40 33653.43 75.79
pi_obfuscate 60236.27 11101085.43 184.29 2035691.96 33.80
pi_decrypt 1590.48 299298.46 188.18 69354.57 43.61
fp_mul 228424.79 11480248.47 50.26 1695313.95 7.42
pi_add 29759.90 1203071.88 40.43 423378.92 14.23
pi_mul 6175.70 1068244.51 172.98 359942.47 58.28
pi_matmul 4178.43 620310.10 148.46 150362.36 35.99
pi_sum(axis=0) 12865.10 1675271.14 130.22 844531.30 65.65
pi_sum(axis=1) 15919.62 4651463.65 292.18 947461.90 59.52
pi_sum(axis=None) 10277.66 4677684.56 455.13 877720.61 85.40