Ideas for a New Elliptic Curve Cryptography Library
Why Create a New Elliptic Curve Cryptography Library?
Commonly used crypto libraries like NSS and OpenSSL implement ECDH and
ECDSA signing and verification for a large number of prime and non-prime
curves. In order to support a wide variety of curves, these libraries internally
have frameworks that abstract the general algorithms from the curve-specific
details. These frameworks make sense when a wide variety of curves and
algorithms need to be supported, but they add significant complexity that
would be unneeded for software that only needs to deal with a small number
of well-known curves, such as the NIST
OpenSSL exposes its internal framework as a public API
for applications to use, so changing the framework risks breaking
applications built on top of it. This makes it difficult to make
improvements to the framework. Consequently, the implementations of
curve-specific optimizations in OpenSSL generally avoid the lower-level parts
of that framework, causing those implementations to have very similar but
separate implementations of the generic logic. Thus, for example, one set of optimized
implementations of the
With today’s C compilers, some key components need to be implemented
in assembly language to give reasonable performance and to prevent side channel leaks
of secret information. Assemblers usually have
macros and other high-level features that make assembly-language programming
convenient. However, some assemblers do not
have such features, and the ones that do all have different syntaxes.
OpenSSL has its own assembly language macro framework called perlasm to deal with this.
Every OpenSSL assembly language source file is actually a Perl program
that generates the assembly language file. The result is several large files of
interleaved Perl and assembly language code. The fastest
Imagine that we must find a high-performance implementation of
The fundamental problem isn't the sheer amount of code in the optimized implementations of ECC or that the techniques they use are bad. Rather, after the simple ideas used in the implementations were translated into low-level C and assembly language code, with multiple layers of repetition, it is now difficult to map the code back to the abstractions it was derived from. A small number of reasonable trade-offs of the ease of understanding by people for the ease of processing by computers accumulated quickly into a large and unreasonable such trade-off.
The specialized implementations of ECC in OpenSSL are optimized for
ECDH key agreement and ECDSA signing, so they use constant-time algorithms to avoid
timing attacks. Hoewver, ECDSA signature verification does not need to be constant
time, and the variable-time WNAF scalar elliptic curve point algorithm is
significantly more efficient than constant-time algorithms. The constant-time
ecp_nistz256 code in OpenSSL is much faster than OpenSSL’s variable-time
The IETF is currently standardizing new ECC cryptosystems based on two new curves,
Curve25519 and Ed448-Goldilocks. It seems likely that Curve25519 will be critically
important for a variety of reasons; for example, it seems like it will be a
key part of the bridge connecting the new Internet of Things (IoT) to the
normal Internet of Nothings. If we were to implement Curve25519 and
Ed448-Goldilocks using the same strategies that have been used for
Curve25519 and EdDSA (or whatever signature scheme ultimately gets
standardized) will take a while to become ubiquitous. This makes it tempting
to try to use
Goals
With those motivations in mind, it is worth exploring whether it makes sense to create a new ECC implementation with the following goals:
- We must be confident in the correctness of what we created, preferably with the assistance of automated tools for correctness checking.
- In high-end smartphones, tablets, and computers,
P-256 andP-384 signature verification should take one millisecond or less, so that all the signature verifications needed for a TLS handshake can complete in under 10ms, so that the CPU cost of the TLS handshake is not a factor in terms of latency. On more constrained devices, we will have to settle for “as efficient as the code size budget allows,” for now. - The code and any precomputed static data should be as compact as possible. In constrained applications the allowed code size is a parameter that can vary considerably, so it is hard to put a specific figure on this. However, informally we will aim to create something that is smaller than the ECC portions of other implementations like BoringSSL, OpenSSL, and NSS.
- It should be straightforward to reuse large amounts of the implementation of P-256 and P-384 to implement Curve25519 and maybe other curves.
- Without significantly sacrificing signature verification performance, the code for the arithmetic should be easy to reuse (preferably) or adapt (less preferably) for side-channel resistant implementation of signing and ECDH key agreement.
An Implementation Strategy
ECDSA signature verification consists of parsing inputs, doing an inversion and some multiplications modulo n (the group order), doing two scalar elliptic curve point multiplications, and then verifying that the result of the multiplication matches a value in the signature. We precompute a bunch of sums of the curve’s group generator point G and a bunch of sums of the public key, which requires multiplication, squaring, and inversion modulo q (the field modulus). We compute the WNAF representations of the scalars of the two points, and then perform a pattern of point doublings and point additions based on the pattern in the WNAF representation. The point doublings and additions are specific sequences of multiplications, additions, subtractions, and squarings modulo q. Multiplication and squaring modulo q are, by far, the most performance-sensitive operations because they are executed many times in a scalar point multiplication. In order to do all these arithmetic operations, we generally need to precompute some quantities that are functions of n or q and the word size of the machine. ECDSA signing and ECDH key agreement more-or-less involve the same kinds of operations, except they need everything to be constant time.
One way to minimize code size, at the cost of speed, is to implement all the algorithms generically. For example, there would be one modular multiplication function that takes two multiplicands and the modulus as inputs, one modular inversion function that takes the element to invert and the modulus as input, etc. We would probably find that such generic implementations work good enough in many cases, especially the operations modulo n that are not frequently executed, so that we can avoid writing curve-specific code for many of them. If we start with generic implementations of every algorithm, then we can use them as the foundation for a data-driven code-profile-optimize development loop that ensures that we are always optimizing the most impactful part of the implementation, without guessing. Having the generic implementations of everything available would also give us flexibility in making size vs. speed tradeoffs later in the development process.
Besides trying to share code across curves, we could also minimize object
code size by moving work from runtime to build time.
For example, instead of parsing the string representation of q into
an array of limbs at runtime, we parse the value of q during the
build and generate a static const
array definition that contains the
array. This way, we save not only the cost of the function call to the parsing
function, but we can also avoid the cost of the code for the parsing function.
More generally, we should try to do everything that can be done at build
time at build time.
Also, we should avoid doing things manually that can be done automatically by the build system. For example, in order to do Montgomery multiplication, we have to calculate and save the values the constants k and r2. Rather than calculate the values manually and hard-coding them into the program or code generator, the code generator should calculate these values from the basic inputs. This would end up having a much better documentation effect than hard-coding the values because it is guaranteed that the formulas and the results match. Further, adding this small extra complexity to the code generator would make it much easier to expand the implementation to support additional curves and platforms.
An ECC Compiler
We can take the ideas in the preceding section further and think about the code generator as a compiler for an ECC domain-specific language (DSL). The terms of the language would be field elements and scalars, its statements would be a series of assignments of the results of field operations (additions, subtractions, multiplications, and squarings), and its runtime library would consist of the curve-agnostic implementations of the various algorithms plus any curve-specific functions that we hand code.
There are already several programs written in this language in the Explicit-Formulas Database. For example, here is a program from the Explicit-Formulas Database which, given a point (X1, Y1, Z1) in Jacobian coordinates, calculates the double of that point (X3, Y3, Z3), alongside a very simple translation to a C function that calls other functions that perform the actual field operations:
|
|
By implementing a compiler, all the classic optimization techniques
used in compilers would be available to us. For example, notice in the C code
above that all the multiplications by constants were translated to calls
of functions that are optimized for multiplication by that constant; this
is a basic form of classic strength reduction. As another example, the
compiler could reorder the statements in order to
optimize the program for the out-of-order execution behavior of a
particular processor. In particular, the compiler could put effort into
ensuring, whenever possible, that a statement that stores a value into
a variable x never directly precedes a statement that reads the
value of x. Or, if our
compiler were to generate assembly language code, then we might implement a
custom calling convention where the result of a calculation is always stored
in a specific set of registers and where the inputs to a function are
always stored in those same registers, to avoid otherwise-unnecessary
load/stores. Our compiler might even transform the program into
continuation passing style (replacing ret
+ call
with jmp
) or do other things that would be easy to
for a compiler do, but would be error-prone to do (or verify) correctly
by hand.
More interestingly, our compiler could go beyond classic compiler techniques. It seems hard to justify the amount of effort that would go into building a sophisticated optimizing compiler for our language. Instead, we might implement a dumb randomized compiler that, given a seed value, configures flags that control various simple optimizations like reordering and various attempts at strength reduction. Then we could operate our compiler in two modes: profiling mode and production build mode. The profiling mode would generate a random seed, generate the code, compile the code, run a program that measures the performance of the code, and compares the execution time to the previously-known best execution time. Then we would just let it run until we run out of development time, take the best seed the program output, update the build system so that the new seed is used for production builds, and say we’re done. We could call our ridiculous compiler the “super dumb compiler” and hope that people confuse it for a superoptimizing compiler. To scale this incredibly inefficient optimization strategy, we could mint a new cryptocurrency, Superdumbcompilercoin, where we trick people into running the super dumb compiler on their computers, cross-certifying each others’ results, and sending us increasingly better optimization seeds in return for plausible delusions of becoming rich.
Besides generic compiler and superoptimization techniques that apply to all languages, the compiler could also assist in the implementation of ECC-specific optimizations. For example, it could discover, or at least assist in the implementation of, cases where we can coalesce reductions modulo q after each field operation. As another example, the compiler could choose completely different formulas based on the cost of multiplications vs. the cost of squarings and/or the cost of multiplications vs. the cost of inversions.
The more complex the compiler becomes, the more important it would be to automate the verification of the validity of the optimizations. We could do this by splitting our compiler into two parts: the optimizer and the back-end. The optimizer would take programs in the three-operand statement language as input, and output a optimized form of the program in a similar, lower-level, intermediate language. The back-end would then transform this intermediate language into assembly language (and/or C) code. If we design the intermediate language carefully, we should be able to make it easy for proof checkers or computer algebra systems to process it and verify that the intermediate program is correct, much more easily than if we were to try to formally prove that equivalent C or assembly language code is correct. It would be possible that the back-end could have a flaw that causes it to generate bad code even for good inputs, but we can minimize the risk of that happening by keeping our back-end simple; eventually we could try to formally prove the compiler’s correctness.
Optimized Implementation of Field Operations
The implementations of the
We could implement another code generator to generate optimized implementations of field operations from more abstract, but still low-level, expressions of them, like qhasm does. However, it seems likely that we will find that optimal implementations of field operations for different architectures and sub-architectures are going to be so different that it would not be worthwhile to find a way to abstract away architecture-specific bits into portable code. Instead, we might pick a single cross-platform assembler, like Yasm, and try to express the abstract patterns primarily using its macro language, hoping that the consistency of using the same macro language would make it easier to maintain the various platform-specific implementations. Assembler macros work quite similarly to how C++’s templates work (particularly how C++ templates worked before C++11) and so are surprisingly powerful. It seems like we would need to make at least a few implementations in “native” assembler before we could understand the costs and benefits of such abstractions.
There are aspects of the implementation of field operations that cannot be solved using a macro assembler in a way that is easy to understand. For example, when implementing Montgomery multiplication we need to multiply a single limb by the field modulus, and the instruction sequence—not just the operands of the operations, but the sequence of operations itself—depends on the value of the modulus. Again, it would not be clear how automating the generation of such code would help us until a few implementations have been made by hand.
One of the major design issues is the way the large integers are split into limbs. For example, should a 32-bit implementation use full 32-bit limbs to minimize the number of multiplications, partially-filled 29-bit limbs to minimize adds-with-carry, or something else? I suspect it will be best to use full limbs in most constrained implementations because the cost of multiplication will be high relative to addition-with-carry, and they do not try to be so clever with respect to out-of-order execution. Again, it is hard to say without implementing some prototypes.
Dynamic Compilation and Self-Modifying Code
On x86, the most efficient way to implement modular multiplication depends on whether the processor implements SSE2, MULX, and/or the ADCX/ADOX instruction set extensions and/or the processor family/generation (Atom vs. Core i7, Sandy Bridge vs. Haswell, etc.). Similarly on ARM the best implementation depends on what instructions (NEON, in particular) are available and whether the chip was designed for in-order or out-of-order execution. Shipping a binary with every curve optimized in each possible way would result in a lot of bloat. It may be better, instead, to ship the compiler as part of the library, and then have the compiler generate just the one implementation of each curve that is optimized for the processor that it is running on. It would likely be hard to make the compiler small enough for this to be a win, though.
On a constrained device, one could generate the executable code for a curve, run the code, and then throw the generated code away to make room for other code (e.g. the generated implementation of another curve). Or, perhaps we could ensure that the code generated for each different curve is similar enough that we could implement an efficient patching mechanism that would allow us to patch the implementation of one curve to morph it into the implementation of another curve, and then use another patch to get back to the implementation of the initial curve. Of course, this would be complete crazy unless it were automated by the compiler (and maybe even then).
Smart Caching of Intermediate Values
Although the cost/benefit ratio of a runtime compiler is questionable, there are many more straightforward optimizations we can do, based on the same idea of being flexible on how we distribute work between the compiler and the runtime system. Normally we need to precompute a bunch of additions of the curve’s group generator. If we were to do that at build time, then we would bloat the size of the executable by a significant amount. (In ecp_nistz256, the size of the static table is about 150KB; in libsecp256k1 that table would be over a megabyte.) If we were to do that at runtime, then we would need to do the precomputation at a time where we can afford to do compute a multiplicative inverse modulo q, which is quite expensive. Instead of doing either one of those, we could precompute just that inverse at build time and then do the rest of the precomputation at runtime.
A similar consideration applies to the precomputation of points related to public keys; if we have a public key cache, we should consider having that cache store the result of the inversion that is needed for the precomputation.
More generally, higher-level code that makes use of ECC primitives is often not very efficient about caching intermediate results. For example, when verifying a TLS server certificate chain, it is common to need to use each P-384 certificate public key twice: once to verify the signature of a certificate, and again to verify the signature on an OCSP response for that certificate. It may be quite worthwhile to cache all the precomputed points for the last few public keys used during certificate validation. In particular, if we know that we are going to use a given public key twice then we should precompute more points than if we expect to only use the public key once.
The Dangers of Error Checking
When verifying an ECDSA signature, we have to validate the inputs are within acceptable ranges. We could do this range checking when we parse them, or we could do the range checking before the computations that depend on the values being within range, or we could implement redundant checks. It is natural to expect that more error checking is better, but OpenSSL demonstrates that bugs frequently lie in error handling code paths. Also, when the same error condition is checked multiple times, the first check usually makes the redundant checks dead code, so we would have to leave the code implementing the redundant checks untested or we would have to jump through hoops to write tests that reach the secondary checks.
The special case of failure to allocate memory (e.g. malloc
returning
NULL
) is particularly terrible, because this kind of edge case is almost
never thoroughly tested. Frequently developers on open source projects have
significant apathy for handling and testing these cases because they have configured
their own applications to abort on memory allocation failure so any bugs in the
error handling will not affect them. This apathy leaves everybody else that does
not use the same strategy at risk.
It seems it would be better, then, to do all range checking, memory allocation, and other possibly-failing actions as early as possible, before we start calculating any results. Then all of our low-level calculation functions would become infallible. And then our mid-level calculation functions would become infallible, since they would not have to worry about error cases from the low-level functions. This pattern would result in a recursive simplification up to the top-level functions. It should also make the compiler simpler because the compiler would not have to generate any code to do error checking, and it should make our correctness proving tasks easier since there would be many fewer code paths to prove correct.
Since all the low-level functions would have important preconditions that they do not
check, it would be critically important that those preconditions be clearly documented.
The C assert
function could be used for this. assert
’s
primary advantage as a documnetation tool is that it can be machined checked to a
certain extent so that it is less likely to get out of sync with the code, especially
with good test coverage.
What about the Algorithms?
It seems silly to build an implementation that uses a bunch of crazy optimization techniques and hard-coded assembly-language to implement obsolete algorithms. However, a little experimentation has shown that ecp_nistz256 is implemented so efficiently that it beats pretty much any implementation of better algorithms that is not very carefully coded. Choosing and implementing better algorithms is much easier and has less risk of failure than the process of figuring out all the optimization techniques that would be needed regardless of what algorithms are chosen. The advantage of implementing the low-level optimizations first is that we could benchmark against ecp_nistz256 in an apples-vs-apples manner, so that we could measure the benefits of the low-level optimizations distinctly from the benefits of the improved algorithms. The compiler would help in defering the choice of algorithms until later, because it would make it make it easy to switch from commonly-used algorithms to new ones.
In any case, a description of the best algorithms to use, and why they are the best, is worth considering seperately. For now the main points to keep in mind with respect to algorithm choices are: (1) There are algorithms that are not commonly used that are seem to be significantly better than the ones used in the most commonly-used open source software, and (2) in order to realize the gains from those algorithmic improvements, the implementations of the better algorithms have to be very good.
Conclusion
It is plausible that a DSL-based approach to implementing an ECC library would have significant advantages with respect to attaining strong assurance of correctness, ease of maintenance, and optimal performance. This is particularly likely to be important for constrained platforms where are limited in our ability to trade space for time and where we cannot rely on C compilers to generate efficient enough code. It seems worth trying to build a prototype to see if such an approach makes sense.
Recommended Reading
- Fast Prime Field Elliptic Curve Cryptography with 256 Bit Primes [2013] by Shay Gueron and Vlad Krasnov. The code is ecp_nistz* in OpenSSL.
- Practical realisation and elimination of an ECC-related software bug attack [2011] by B. B. Brumley, M. Barbosa, D. Page, and F. Vercauteren.
- #3607: nistz256 is broken [2014-2015] by Adam Langley and Andy Polyakov.
- On why 0.10's release notes say "we have reason to believe that libsecp256k1 is better tested and more thoroughly reviewed than the implementation in OpenSSL" [2015] by Greg Maxwell.
- [openssl-dev] ecp_nistz256 correctness/constant-timedness [2015] and RE: [openssl-dev] ecp_nistz256 correctness/constant-timedness [2015] by Brian Smith.
- Verifying Curve25519 Software [2014] by Yu-Fang Chen, Chang-Hong Hsu, Hsin-Hung Lin, Peter Schwabe, Ming-Hsien Tsai, Bow-Yaw Wang, Bo-Yin Yang, and Shang-Yi Yang.
- Montgomery Modular Multiplication on ARM-NEON Revisited [2014] by Hwajeong Seo, Zhe Liu, Johann Großschädl, Jongseok Choi, and Howon Kim.
- Efficient Arithmetic on ARM-NEON and Its Application for High-Speed RSA Implementation [2015] by Hwajeong Seo, Zhe Liu, Johann Großschädl, and Howon Kim.
- Curve25519: new Diffie-Hellman speed records [2006] by Daniel J. Bernstein.
- Fast Elliptic Curve Cryptography in OpenSSL [2012] by Emilia Käsper.
- Analysis of Efficient Techniques for Fast Elliptic Curve Cryptography on x86-64 based Processors [2010] by Patrick Longa and Catherine Gebotys.
- Analyzing and Comparing Montgomery Multiplication Algorithms [1996] by Çetin Kaya Koç, Tolga Acar, and Burton S. Kaliski Jr.
- [curves] Goldilocks optimizations [2014] by Michael Hamburg (reformatted and retitled by Trevor Perrin).
- New Instructions Supporting Large Integer Arithmetic on Intel® Architecture Processors [2013] by Erdinc Ozturk, James Guilford, Vinodh Gopal, and Wajdi Feghali.
- Large Integer Squaring on Intel® Architecture Processors [2012] by Erdinc Ozturk, James Guilford, and Vinodh Gopal.