Implementation - Lattice-based Signatures 91

II. Lattice-based Signatures 91

6.5. Implementation

6. Improvement of GPV Signatures

stan-6. Improvement of GPV Signatures

dard libraries. In particular, it employs (amongst others) new implementations for polynomial representation and multiplication using enhanced algorithms such as self-made FFT subroutines involving the AVX and AVX2 instruction sets. Our optimizations also capture sampling algorithms such as an improved perturbation generation algorithm and the usage of the FastCDT sampler. We considered both the matrix and ring variant of the scheme presented in Section 6.3.5.

6.5.1. Implementation using Standard Libraries

We implemented the GPV signature scheme, the trapdoor generation, and sampling algorithms in C using the Fast Library for Number Theory (FLINT 2.3) and the GNU Scientific Library (GSL 1.15). FLINT comprises different data types for ma-trices and vectors operating in rings such asZq and Zq[X] whereas the GSL library provides a huge variety of mathematical tools from linear algebra, that can be ap-plied on different primitive data types. We also included the Automatically Tuned Linear Algebra Software Library (ATLAS) which is an empirical tuning system that creates an individual BLAS (Basic Linear Algebra Subprograms) library on the tar-get platform on which the library is installed on. Specifically, this library provides optimized BLAS routines which have a significant impact on the running times of the used mathematical operations in the key and signature generation steps. Hence, it is always recommended to include this library whenever one has to work with GSL. For the representation of matrices in Z^n×m_q FLINT provides the data struc-turenmod mat t which comes into use in our implementation of the matrix version.

Regarding the ring version, working with polynomials is performed by using the data structure nmod poly t. FLINT makes use of a highly optimised Fast Fourier Transform routine for polynomial multiplication and some integer multiplication operations.

The experiments were performed on a Sun XFire 4400 server with 16 Quad-Core AMD Opteron(tm) Processor 8356 CPUs running at 2.3GHz, having 64GB of memory and running 64bit Debian 6.0.6. We used only one core in our experiments.

The experimental results for this implementation are given in [P9].

Sampling

For sampling discrete Gaussian distributed integers in the key generation step we used the inversion transform method rather than rejection sampling because the number of stored entries is small and can be deleted afterwards. This improves the running times of the sampling step significantly. In particular, suppose the under-lying parameter is denoted bys. We precompute a table of cumulative probabiltiespt

from the discrete Gaussian distribution with t ∈ Z in the range [−ω(√

logn)·s, ω(√

logn)·s]. We then choose a uniformly random x ∈ [0,1) and find t such that x ∈ [pt−1, pt]. This can be done using binary search. The same method is applied when sampling preimages from the set Λ^⊥_u(G) with parameterr.

This parameter is always fixed and relatively small. Storing this table takes about

6. Improvement of GPV Signatures

150 bytes of memory. In this case signature generation is much faster than with simple rejection sampling. But, unfortunately, this does not apply in the random-ized rounding step because the center always changes and thus involves a costly recomputation of tables after each sample. Therefore we used rejection sampling from [GPV08] instead. As for sampling continuous Gaussians with parametert= 1, we used the Ziggurat algorithm [MT84] which is one of the fastest algorithms to pro-duce continuous Gaussians. It belongs to the class of rejection sampling algorithms and uses precomputed tables. When operating with multiprecision vectors such as sampling continuous random vectors, one should use at leastλbits of precision for a cryptographic scheme ensuring a security level ofλ(e.g., 16 bytes floating points forλ= 100).

Random Oracle Instantiation

For the GPV signature scheme a random oracle H(·) is required which on an input message msg outputs a uniform random response H(msg) from its image space. In most practical applications this is achieved by a cryptographic hash function together with a pseudorandom generator which provides additional random strings in order to extend the output length. In our implementation we used SHA256 together with the GMSS-PRNG [BDK⁺07] because strings of arbitrary size are mapped to vectors from Zⁿ_q. Each component of the vector has at mostblogqcbits.

Rand ← H(Seed_in)

Seed_out ← (1 +Seed_in+Rand) mod 2ⁿ.

The first Seed_in is the input message, and the function is repeated until enough random output Randis generated.

6.5.2. Optimized Implementation

In the following section we present an implementation that is based on self-made subroutines such as polynomial and matrix multiplication optimized for different parameter sets. Furthermore, we applied enhanced sampling algorithms that come into use in the signing step and represent a key determinant for the running time.

The respective algorithms make also use of the AVX instruction sets utilized to run similar operations in parallel realizing remarkable speed-ups. These properties were also observed in several works [GOPS13]. We therefore adopt this approach in order to enhance the performance of the scheme from Section 6.3.5.

6. Improvement of GPV Signatures

Due to lack of the AVX resp. AVX2 instruction sets on the platform used to run experiments based on the implementation from Section 6.5, the following implemen-tation and the corresponding experiments were run on a Notebook that is specified by an

• Intel Core i7-4500U processor operating at 1.8GHz and 4GB of RAM. We used a gcc-4.8.2 compiler with compilation flags Ofast, mavx2, msse2avx, march=corei7-avx, and march=core-avx-2.

Discrete Gaussian Sampling

In order to sample discrete Gaussian distributed vectors x← D_Λ⊥

v(G),r, which can be reduced to have entries sampled from D₂_Z_,r orD₁₊₂_Z_,r, we apply the improved discrete Gaussian samplerFastCDTintroduced in Section 5.1, that perfectly matches to this kind of distributions. Furthermore, we sampled the entries of the private key both in the matrix and ring variant usingFastCDT with parameter αq=p·4.7 for p=d√

n/4.7esuch thatαq >√

n. However, for the randomized rounding operation, which follows the discrete Gaussian distribution, we apply the rejection sampling algorithm. In particular, we need to sampledcc_a, which is equivalent toc+D_Z^m_−c,a. Due to the real vector c∈R^m the support always changes such that generating the corresponding tables is quite inefficient. Sinceρa,ci(Z) =ρa(Z−ci)∈ρa(Z)·[¹⁻₁₊,1]

fora≥η(Z) as per Lemma 3.1, we need to computeρ_a(Z) only once for allc∈R^m, hence saving unnecessary computations. Furthermore, it is useful to sampled¯cc_afor

c = dce −c ∈ (0,1), since dcc_a = dce − d¯cc_a and the center of the distribution is always within the range ¯c∈(0,1).

AVX and AVX2

We already explained the significance of the AVX and AVX2 instruction sets in Section 5.4, when implementing our A-LWE based encryption scheme. In our im-plementations, we are using AVX and AVX2 whenever possible. For instance, the FFT for polynomial multiplication is optimized by use of AVX due to computations with double precision complex numbers. Furthermore, it is exploited for scaling operations such as˜p2 =√

b·d2 and the multiplication of the decomposition matrix L with continuous Gaussians in the signature generation step (see Figure 6.3 and Figure 6.2). In fact, one observes remarkable speed ups.

Polynomial Representation and Multiplication

Following the efficient implementation [GOPS13] of the NTT [Win96], we imple-mented the FFT for polynomial multiplication by use of AVX and AVX2. Due to non-prime modulus q = 2^k, it is not possible to apply the NTT. We are consider-ing cyclotomic rconsider-ings of the special form R_q =Zq[X]/hXⁿ+ 1i for n a power of 2.

Therefore, the FFT is instantiated with the (complex) n-th root of unity. Similar to [GOPS13], we precomputed tables of the relevant constants prior to invoking

6. Improvement of GPV Signatures

the signing and verification algorithm. As a result, we achieve fast signing and verification engines.

Matrix-Vector Multiplication

Matrix-vector operations accomplished via additions and multiplications over the integers were performed by use of the AVX2 instruction set. In fact, our imple-mentation of the matrix variant is built upon the impleimple-mentation specified in [P3], which has been optimized with respect to matrix-vector operations.

Random Oracle Instantiation

For the random oracle instantiation, we applied the Salsa20 stream cipher as in Section 5.4. It stretches a uniform random input seed to a uniform random output of arbitrary length. Its evident performance has been observed in several works such as [GOPS13, P3]. We refer to Section 5.4 specifying how to generate uniform random elements such as polynomials or vectors.

Im Dokument On the Design and Improvement of Lattice-based Cryptosystems (Seite 124-128)