کد مقاله | کد نشریه | سال انتشار | مقاله انگلیسی | نسخه تمام متن |
---|---|---|---|---|
432646 | 689003 | 2016 | 18 صفحه PDF | دانلود رایگان |

• Most operations hitherto deemed hard to realize in RNS share the same bottleneck.
• Finding the LSB of an unknown in the Chinese Remainder Theorem is that bottleneck.
• Show 2 new fast methods to solve the bottleneck, one meets analytic speed bound.
• Moduli selection enables exhaustive pre-computation even at very long word lengths.
We identify a canonical parameter in the Chinese Remainder Theorem (CRT) and call it the “Reconstruction Coefficient”, (denoted by “RCRC”); and introduce the notions of “Partial” and “Full” Reconstruction. If the RCRC can be determined efficiently, then arithmetic operations that are (relatively) harder to realize in RNS; including Sign Detection, Base change/extension and Scaling or division by a constant can also be implemented efficiently. This paper therefore focuses on and presents two distinct methods to efficiently evaluate the RCRC at long wordlengths. A straightforward application of these methods leads to ultra-fast sign-detection.An independent contribution of this paper is to illustrate non-trivial trade-offs between run-time computation vs. pre-computation and look-up. We show a simple method to select the moduli which leads to both the (i) number of RNS channels nn; as well as (ii) the largest channel modulus mnmn satisfying {O(n),O(mn)}⪅N≡ the full-precision bit-length. The net result is that for many canonical operations; exhaustive look-up covering all possible input values is feasible even at long cryptographic bit-lengths NN. Under fairly general and realistic assumptions about the capabilities of current hardware, the memory needed for exhaustive look-up tables is shown to be bounded by a low degree polynomial of nn. Moreover, both methods to compute RCRC can achieve a delay of O(lgn) in a RN system with nn channels. To the best of our knowledge, no other method published to date has shown a path to achieve that lower bound on the execution delay. Further, small values of channel moduli make it ideal to implement each individual RNS channel on a simple core in a many-core processor or as a distributed node, and our algorithms require a limited number of inter-channel communications, averaging O(n)O(n). Results from a multi-core GPU implementation corroborate the theory.
Journal: Journal of Parallel and Distributed Computing - Volume 97, November 2016, Pages 78–95