1 July 1999: Add Gilmore and Brazier comments.
29 June 1999
Source: John Brazier
The paper in original .DOC format: http://cryptome.org/nsa-study.doc (278K) or Zipped .DOC: http://cryptome.org/nsa-study.zip (44K)
Date: 29 June 1999
Version: Draft 1.0 for Discussion
John R T Brazier
19 Barttelot Rd
This document estimates the possible capabilities of the NSA in breaking certain types of cipher by exhaustive key search. It is a theoretical document, and only covers technical ability: in many cases the NSA will use alternative methods to break target cipher systems. The aim of this document is to give some estimates for the security of certain key lengths for some symmetrical (block) ciphers.
This is a draft discussion document; it still requires more work before being generally released.
This paper could not exist without the work carried out by the EFF in the construction of their DES cracker. Much of this work is based on their information and experience.
Theoretical research such as this always has to be based on some constraints and assumptions. Basic ones are listed here; others are discussed in more detail in the main body of the paper.
The starting point for this study is the DES-cracking machine that has been built by the Electronic Frontier Foundation, and has been fully described by their publication [EFF]. The key points of their publication are outlined below, and overall their device may be regarded as being made up of a PC, as controlling computer, plus massively replicated components for a highly parallelized architecture. This architecture is the basis of this paper.
1. Search Unit
The basic unit is a search unit, which can be loaded with information about the plaintext, the first 16 bytes of the associated cyphertext, its starting key value, and any initial vector (for the cipher block chaining, or CBC, mode of DES operation). It has some other registers that allow the unit to search for partially known plaintexts (ie in the situation where it is known that the plaintext is ASCII, but not what the letters actually are) and to investigate certain specialized challenges.
The search unit can then step through all lower 32 bits of a key, testing the plaintext against cyphertext. It runs at 40MHz, doing one decryption every 16 clock cycles, and so will test 2.5 million keys per second. When it has stepped through all 32 low bits of the key, it halts until the controlling computer reloads a new key and the unit continues. This happens about every half-hour.
Periodically, the unit will register an apparent match: a candidate solution. Many of these will be false positives because of the slightly odd way the unit matches the plaintext against cyphertext (it only does an approximate match for speed). The search unit halts and the controlling computer fully checks out the candidate. Because the number of false positives is relatively low, the controlling computer is not a bottleneck in this process.
2. Main Chip
The main chip contains 24 search units. They share the plaintext and cyphertext information, but each has its own key register as it searches its own part of the keyspace. There are address and data wires, power and ground lines, and also signal wires to indicate when a unit has stopped: this happens when there is a candidate solution or when the unit needs a new starting key. Thus each chip will search 2.5 x 24 = 60 million keys per second.
Each board has 64 search chips along with support circuitry, lights showing chip status and parallel interfaces to the PC. Each board can thus try 3.84 x 109 keys per second.
4. Overall Device
There are 24 boards, fitted into 2 chassis of twelve boards each. Thus the overall machine will check 9.216 x 1010 keys per second.
DES has a keyspace of 256 keys; on average a search will find the correct key when 255 keys have been tested. Thus as 255 = 3.603 x 1016, then the cracker will on average find a key in 3.603 x 1016 / 9.216 x 1010 = 390,937 seconds, or 4.52 days.
The costs of the DES breaker are given as follows:
Assuming that the costs can be averaged out, and ignoring the cost of the PC (which is minor), we can produce the following cost estimates for the construction of the machine (excluding design):
Naturally, the costs are all in, so the cost per search unit includes its proportion of manufacturing, board assembly, the chassis, the power unit, and so forth.
Let us now postulate that some large organisation such as the NSA wishes to produce a high-performance cracking device. This machine should be able to attack a modern, effective crypto system that is likely to be a DES replacement, or is similar to one. One possibility is the RC5 and RC6 family. RC5 was developed by Ron Rivest in 1994 [RLR], and RC6 by Rivest et al for the Advanced Encryption Standard [RRS].
1. The RC5 Algorithm
This algorithm is heavily backed by RSA Laboratories, perhaps the premier crypto design and licensing company in the world. It is the basis of a very public series of cipher-breaking challenges offered by RSA Labs. These challenges get progressively harder due to a unique feature of RC5: it is a parametized crypto system, and so is adjustable in its strength.
RC5 is, in fact, a family of algorithms, and should correctly be specified as RC5-w/r/b, where each of the parameters are:
For typical modern computers, the optimum word size is 32, and most people opt for the recommended 12 (double) rounds. So RC5 typically becomes RC5-32/12/b, with the key length being the variable factor in increasing the strength of the algorithm family. RSA Labs challenges go from RC5-32/12/5 all the way to RC5-32/12/16, with the 5, 6 and 7 byte keys having been broken, and RC5-32/12/8 under attack by Distributed.nets world-wide distributed keybreaking effort [DIN].
RC5 has been very heavily analysed; whilst there have been some theoretical observations about its effectiveness, and a suggestion that 16 rounds rather than 12 should be used, no real attack against RC5 has been mounted except for exhaustive keysearch. Thus RC5 is an excellent model on which to base a keycracking machine because it is meaningful to speak of an 80-bit or 96-bit key for RC5, and the length of the key is the measure of the algorithms difficulty (all other factors being equal).
However, RC5 is unlikely to be of interest to the NSA, for the simple reason that it is not one of the algorithms proposed for the Advanced Encryption Standard (AES). However RC6 is, and thus may well become the standard algorithm (and thus a primary NSA target).
2. The RC6 Algorithm
The RC6 algorithm is a son of RC5. Although based on RC5, a new primitive operation has been added (integer multiplication), and other modifications made to increase the diffusion achieved in each round (the improvements have come from analyses of RC5). The other main difference is that there are four registers rather than two: the cipher operates on 128-bit blocks of data at a time.
Otherwise, RC6 is a family of algorithms in exactly the same way as RC5,
and these are specified in the same way, as
RC6 actually encrypts slightly faster than RC5 in software [RRS]; although the operations are slightly more complex, the increased block size allows a faster throughput. The key setup is practically identical in both ciphers. In general, for the purposes of this paper, we will assume that breaking RC5 and RC6 by exhaustive keysearch are equally difficult, and that the time taken to examine the key space (when specified for the same key length and number of rounds) will be approximately the same.
3. Overall Machine Specifications
The machine can now be specified in a general sense. As in the EFF DES cracker, the core of the machine is the search unit. This unit would be targeted against the RC6 chip.
We will then state that each dedicated chip will have 64 search units and each board will have 64 chips (to reflect the higher densities that are now achievable). Boards will then be rack-mounted with their supporting electronics. Rather than going for parallel connections to PCs, each board will have its own Ethernet network interface card (NIC) built in and be connected to a LAN. In addition, the boards will have a special controller unit.
A group of boards will effectively have their own 10/100 Mbps LAN, with a host PC managing it. In turn, the PCs will connect back via a logically separate LAN to an overall controlling PC.
4. The Search Unit
As the search unit is targeted against RC6, it would have the following:
It is likely that for very fast decryption both the key schedule and the main encryption loops would be unwound to some extent in hardware. The search unit will only do one decryption on a 128-bit block to decide if it is a candidate, so the key schedule now carries a significant overhead on the search unit operations and must be optimised.
In addition, the matching operations would be optimized on the chip, so that while the decryption run on the current key is in progress the results of the previous key decryption are being matched. After matching, the key schedule processing would commence for the next key whilst the current key decryption is still running.
We will make the assumption that the NSA will spend a considerable amount
of time designing the search units, chips and boards. With this assumption,
we will define that the unit will have the following capabilities:
Because of the considerable effort in up-front design, we will take it that the overall cost per search unit will be the same for the RC6 cracker as for the EFF machine, at least in small numbers (production runs will be taken into account later). The increase in complexity is countermatched by the greater effort in design. Note that the chips will be more expensive as they will have more search units.
5. False Positive Rate
The expected rate of false positives has a major effect on the overall system design, as it specifies the frequency and bandwidth of communications between the search units and the external monitoring systems. This puts a limit on what can be done within certain time periods.
We have assumed that for speed a similar mechanism to that specified for the EFF machine is used: that there is one Plaintext Vector of 256 bits, that specifies allowed plaintext byte values irrespective of their position. This system is very flexible, but can generate quite a large number of false positives.
Consider a message that is to be tested; it is known to start with #0a394bdf LINK-CIRCUIT . The first sixteen characters are all different values (RC6 handles 16 bytes at a time). Thus the Plaintext Vector will be set up with 16 allowable bytes. The probability that by chance the first byte in the message decodes to one of the allowable values is 16/256 = 1/16. However, the probability that all 16 bytes, by chance, decrypt to an allowable value is (1/16)16 = 1/264. If the key length being searched for is 64 bits, and thus one is searching for 264 keys, then the probability is about 1 false positive in a complete search of the full key space: as often as finding the key.
However, a working machine may have to deal with the case where all that is known is that the text is ASCII. Depending on assumptions about the usage of punctuation, about 64 characters might be used in the message. This will lead to 64 acceptable values in the vector, which means that the probability of one byte matching is 1/4, and 16 bytes matching by chance as (1/4)16 = 1/232. If the key is 64 bits long, this means that there will be 232 false positives in a complete run. The false positive number will increase with key length (reaching 296 for a 128-bit key).
A complete 64-bit keybreak may be seen as 232 search unit runs, each one of which steps through 232 keys in 7.16 minutes (as specified above). This implies that each search unit is likely to find a false positive in each of its runs. This has implications, as will be discussed below.
6. The Chip
The chip will have 64 search units. It will be more complex than the EFF equivalent, but most of the increased complexity has already been defined in the search unit. The other increase in complexity is due to the fact that in each 32-bit key run in a known to be ASCII test a search unit may well find a false positive (but still a candidate solution). Clearly, the unit cannot stop and wait to be interrogated.
The chip needs to be able to put the search unit ID, the chip ID and the candidate key into registers that can be read by the board controller. It then allows the unit to carry on. The read/write process might delay the unit for a couple of cycles, but this is trivial over a search run.
This will increase the cost of the chip slightly for the increased logic above and beyond the EFF chip, and for the extra registers.
7. The Board
The boards have 64 chips, but their physical size would depend on the actual
fabrication method of the chips. Thus they may or may not be larger than
the EFF ones. The boards would service all the chips, but depart from the
EFF ones in two basic ways:
It is proposed that there will be 220 = 1,048,576 boards in the machine, racked into cabinets. This means that there are 220 * 64 * 64 = 232 search units. We can see that for a 64-bit key search, every search unit will do one run. Thus the maximum length of a search will be 7.16 minutes; on average it will be 3.6 minutes for a 64-bit key.
8. Host Controllers and Network
Each board will be connected onto an Ethernet with Cat-5 cabling and switches. There will be 512 PCs, thus each PC will manage 2048 boards. Although the physical cabling and how it will be switched will depend on an optimization study, for the purposes of this paper we can regard each PC as being on a private network to 2048 boards. Each board will connect to a hub at 10mbps, and the switches will feed back to the PC at 100mbps (the backbone). A number of hubs will connect to a switch.
The PCs will in turn be linked to an overall control PC, which starts the process off and gets the final results. There will be a logical host PC network, which has a much lower throughput then the PC to search boards one.
9. The Overall Machine
There are 1,048,576 boards. If we assume high chassis we can envisage 64 boards to a chassis: we are not assuming the 9U VMEbus specified by EFF. This gives 16,384 chassis. If each chassis sits on a square metre, and requires a square metre access space, on average 1.5 square metres will do per chassis (as two facing chassis can share the access space, and they would be in back-to-back rows). This leads to a requirement of 24,576 square metres, and will take up about 21% of the Pentagon (which is quoted at 117,400 square metres).
However, because the machine is so modularized, each PC with its 2048 boards could be physically quite separate, especially if the host PC network backbone were upgraded to fibre (at costs which would be minor compared to the overall cost of the machine, see below). Thus the machine could be distributed over a university campus or even a city, making its footprint more manageable.
The greatest potential bottleneck in the machine is related to the communications. This is under the situation where on a 64-bit key run the machine may generate 232 false positives as candidate results. Each PC will see 1/512 of the total false positives in a full run, or 223 positives. There are two potential bottlenecks: network throughput and PC processing.
10.1 Network Throughput
For each candidate result, the following information would be sent:
Given Ethernet frame (26 bytes) and IP headers (24 bytes), we could say that the whole package would be about 100 bytes to report back a key. There are 2048 (211) managed by each PC, thus each board will generate 212 false positive candidates (64 chips with 64 search units = 212). Thus each board will generate 212 * 100 = 409,600 bytes of candidate key reports in one run.
A run is 7.16 minutes, so a board will generate 953.4 bytes per second of communications, or 7,628bps. This is well within the capability of the 10Mbps link back to the hub.
There are 2048 boards linked to the PC, so the PC will see 7,628 * 2048 = 15.62Mbps on the 100Mbps links: 16% utilization should not be a problem, and be well within the switches capacity. Naturally, the PC will need a high-performance network card to handle input. Note that the actual amount of data is only about a third of the network throughput.
10.2 PC Throughput
Each PC will have to analyse 223 false positives. It will do this by decrypting the second, third and fourth 16-byte blocks in software with the candidate key (the first has already passed). Even under known to be ASCII conditions the chances of this matching by luck is (1/4)48, or one in 296. Thus this test will efficiently weed out false positives for up to a 128-bit key (more decipherments would have to be done for longer keys).
The PC must analyse 223 keys in the 7.16 minute run, or 19,927 keys per second. Currently, the Distributed.net client running the RC5 64-bit key breaking trial does 748,982.85 keys per second on a Pentium-II 300MHz processor running the Windows client. Given that the host PC would have to do three blocks, where the Distributed.net RC5 client may well only do one, at a first estimate we can see that the PCs would have the capacity to do some 249,660 keys per second (assuming that RC6 can be run as fast in software as RC5).
The conclusions are that the throughput should not be limiting. The components should be able to search a 64-bit keyspace in the given time.
11. Machine Functioning
The machine would function as follows:
There are several contributing factors to the cost.
12.1 Parallel Component Costs
Identifying the costs for the main bulk of the machine needs some estimations.
These may be described as so:
We thus get a total cost of $240,375,000 for the boards in their chassis, plus power and all fixings. This production run will give us 1,111,000 boards, an excess of 62,424 (approximately 6% spares).
12.2 PC Costs
Although EFFs board costs included the PC costs, we have decided to calculate them separately. This means that there is, in fact, extra margin of safety in the board cost estimates. The PCs need to be at least 300MHz Pentium II equivalents, with a high-performance 100Mbps network board. There are no special disk space or monitor requirements, nor for multimedia accessories. We have allocated a generous $1,000 per PC, thus making a total PC cost of $513,000 (512 hosts plus 1 overall controller). This includes all hardware and an operating system thrown in.
12.3 Networking Costs
A modern network based on switched Ethernet has most of the cost associated with the switches and the hubs. A review of pricing and performance indicates that we do not need special high-throughput switching for the design we have gone for, so a cost of $20 per port seems reasonable (estimate from catalogue prices). This is probably an over-estimate (especially for 1,048,576 + 513 ports, but again leaves room for error). This gives a total electronics cost of $20,981,780.
If we assume that the hub and switch cabinets are distributed amongst their client boards, then we may estimate that one switch/hub cabinet with 2048 ports will be surrounded by some 32 board cabinets with 2048 boards. In a typical configuration, this would be a space 8 metres by 5 metres, with the comms rack in the middle. Assuming that cables have to dog-leg and that there will be twists and turns to get into and out of cabinets (and go via floor and ceiling), the average cable run might be 8 metres.
So 2048 boards will require 8 * 2048 = 16,384 metres of Cat-5 cabling. This is about $100 for 300 metres (catalogue price), which gives us a cost of $5,461 for the board network cabling. We can raise this to $5,500 for the cabling back to the host PC. Thus for 512 PCs we reach a cable cost of $2,816,000. We will assume that the cable costs for the host PC network can be subsumed within this cost.
We need two RJ45 (or better) adapters per port. We have estimated at $1 per pair, giving a cost of $1,049,089 for the machine.
We thus end up with a total cost of $24,846,869 for the network.
12.4 Design Costs
The EFF machine cost $80,000 to design. We have taken this base figure, but
note that we have made the following important modifications:
We have thus made the generous assumption that each change doubles the design work (we are not expecting efficiencies). Thus 27 * $80,000 gives us a bill of $10,400,000 for design. It should be noted that at this level of design the chip may be much more efficient than we have specified.
12.5 Software Development Costs
Given that there is a model to follow in the EFF machine, we believe that the software development costs should not be excessive. As it is in assembler, we will assume 5 man-years at $100,000 per year, giving $500,000. Again, this is probably a generous overestimate.
12.6 Total Costs
Total costs are as follows:
|Main cracking unit
Or to all intents and purposes $280 million.
13. Machine Capabilities
We have seen that the machine can do a full exhaustive key search of a 64-bit key in 7.16 minutes. On average, only 50% of the keyspace needs to be searched, so the average keybreak will be in 3.58 minutes.
A 72-bit key will take 256 times the effort (it can be regarded as running the 64-bit key run 256 times). The machine is designed to take up to 256-bit keys, so the increase in key length can be directly handled. As each 64-bit key search ends the boards are assigned new parts of the key space. Thus a 72-bit key will take 3.58 * 256 = 916.48 minutes, or 15.27 hours.
An 80-bit key would take 15.27 * 256 = 163 days, and an 88-bit key some 144.3 years, which seems a bit long.
Thus this paper implies that the NSA can if they wish break 64 to 80-bit keys routinely at the investment cost of $280 million.
14. NSA Budget
It may be suggested that $280 million would not be credible. However, the NSA budget as of 1998/1999 is estimated at slightly under $4 billion, of which about $900 million is payroll and $2 billion operations [FAS]. This leaves some 1 billion for other purposes, of which around $250 million would appear to be procurement. Thus the proposed machine would be within 1 years NSA purchasing budget, and the actual cost would probably be spread over at least two or three years. Given that the project would also be experimental, there would probably be other budgets (such as direct military and R&D) that could help contribute.
Given the sort of money the US Government is willing to invest in projects (such as $8 billion so far for the new Space Lab), a budget of $280 million does not seem incredible. Whether the NSA really wants to build such a machine is another issue (see below).
There are a number of weak points in the machine specification. They are covered in the next section. However, there is evidence that indicates the estimates in this document are not unreasonable.
Matt Blaze et al in 1996 [BDR] estimated a $300 million machine could crack DES in 12 seconds. If we imagine an equivalent machine being capable of dealing with RC6 and longer keys, then a 64-bit key would take 12 * 256 = 51.2 minutes. If we apply Moores law, their machine would now do it in a quarter of the time (doubling raw capability every 18 months), which would imply that their machine could do it in 12.8 minutes compared to the 3.58 minutes for the machine proposed in this paper. The two estimates are extraordinarily close given that they appear to have been calculated from completely different premises.
Whilst it is unlikely that the machine described in this paper would ever
be built, there are a number of areas where the estimates and design could
The machine specified in this document would never be built, mostly because
the NSA is not going to spend $300 million to break a few keys. This is because
there are more economical ways of attacking peoples cryptography, such
Lastly, such a machine is of no use unless it is targeted against the standard cipher and it can break keys in seconds (so allowing for trawling over bulk messages from target locations). Thus whilst it is possible that the NSA have a DES cracker, they will not construct any replacement until the candidate AES is chosen, and if such a machine becomes feasible against the AES cipher.
[BDR] M Blaze, W Diffie, R L Rivest, B Schneier, T Shimomura, E Thompson and M Wiener. Minimal Key Lengths For Symmetric Ciphers To Provide Adequate Commercial Security, www.bsa.org/policy/encryption/cryptographers_c.html, January 1996.
[EFF] Cracking DES. Secrets of Encryption Research, Wiretap Politics & Chip Design. Electronic Frontier Foundation, May 1998, ISBN 1-56592-520-3.
[FAS] National Security Agency. Budget and Personnel. Maintained by John Pike, www.fas.org/irp/nsa/nsabudget.html.
[RLR] R L Rivest. The RC5 Encryption Algorithm. Proceedings of the 2nd workshop on Fast Software Encryption, Springer, 1995.
[RRS] R L Rivest, M J B Robshaw, R Sidney and Y L Yin. The RC6™ Block Cipher. www.rsa.com/rsalabs/aes/rc6v11.pdf, 1998.
To: John Young <firstname.lastname@example.org>, email@example.com cc: firstname.lastname@example.org, email@example.com, firstname.lastname@example.org, email@example.com, firstname.lastname@example.org Subject: Re: Paper on NSA Crack Machine Date: Wed, 30 Jun 1999 18:22:50 -0700 From: John Gilmore <email@example.com> I was unable to read the paper, since it was in a Microsoft format, but I seemed to be able to skim the gist of it. EFF made many simplifying assumptions in its design, for two reasons: * To limit the risk of design failure (resulting in months of delay and higher cost to re-make chips). * To build a demonstration machine at the minimum easy price. If these constraints were not there, a very different design would be appropriate. A tuned custom chip would run at 400MHz or better (like modern CPUs). A design that could attack multiple ciphers would be far preferable for an intelligence organization. A field-programmable gate array would be best, if it can be had in high enough speeds and densities. More distributed intelligence, rather than a single central PC, would make the machine both faster and more flexible. The job of handing out blocks of keyspace to a million boards would take a significant fraction of 7 minutes, if done serially! However, this significantly complicates the programming, debugging, and diagnosis of the machine. I tend to agree that a small processor with an Ethernet chip should probably reside in each chassis, if not on each board. Power distribution was a major issue in the real machine, which was not addressed in the book. A large cost of such a brute force cracker is the electrical power needed to run it over its lifetime. The EFF machine ended up requiring about 60 amps at 110 volts (three separate US power circuits). Designing the chips for low power will be critical to meeting power and heat constraints that permit them to be closely packed in physical space. Our boards were hard to manufacture and debug even in small volumes; the design would need to take ease of manufacturing into account. E.g. each chip on our boards can short the global data bus; a design that partitioned the bus (so that most of the board would work and only part of it would fail if a short existed) would aid both diagnosis, and offer degraded operation. A more complex key sequencer would be far preferable too. Access Data is reportedly working on a brute force machine, which spends as much effort generating the next key to be tested, as it does on testing the key. Once you have large key sizes, being able to test the keys in the order of their likelihood (based on poor software randomness, other factors visible from outside, etc) give a big win in overall cracking time. The EFF boards should've had individual power regulation to each chip, individual clock generation for each chip, and temperature sensing. This would permit the design to exploit process variations to run each chip at its individual maximum clock rate and minimum voltage, and thus intelligently permit the chips to use less power and be more closely packed. The cost estimates are off. The critical cost is of a chip in mass production. The cost of search units is a red herring. The cost of a chip of a given area, with a given number of masks, is relatively fixed, no matter what the chip does. In a machine of this size, the design cost is inconsequential. Hire ten people at the top of their fields, for two years, pay them extravagantly: cost under 10 million dollars, or under 4% of the total cost. The machine will need to be very fault-tolerant, keeping track of everything and discarding units that seem to be failing. Software should also be able to partition the machine and run several searches simultaneously at slower speed. Any such design will need to be internally modular, since it will contain elements from several generations of design development. I.e. once you've deployed say a quarter million boards of the first design, when you add better boards designed a year or two later, they need to work together easily. We could go on like this for days, probably... John
From: "John R T Brazier" <firstname.lastname@example.org> To: "'John Gilmore'" <email@example.com>, "'John Young'" <firstname.lastname@example.org> Cc: <email@example.com>, <firstname.lastname@example.org>, <email@example.com>, <firstname.lastname@example.org> Subject: RE: Paper on NSA Crack Machine Date: Thu, 1 Jul 1999 21:17:49 +0100 Dear John, Thanks for the comments - exactly what's needed to develop the thing further. The whole thing started originally to answer a question that had been raised a while ago: what length key could the NSA crack? However, I was asked to provide evidence: which is why the machine is a bit odd. It's grounded in the EFF design, allowing extrapolations, whilst trying to be pessimistic (ie only 100MHz chips rather than 400). So it is a slightly weird machine, but it's easy to follow the logic as to how one got to it from a machine that actually exists. I didn't want to get into multi-purpose chips as I couldn't resolve the effective cost/benefits of different options. For example, is it better to have four farms, each highly targeted on one cipher, or one farm four times the size but with multi-purpose processors? Assuming, of course, that we want similar overall throughputs. I used the search unit cost simply as a way of measuring chip cost, as a chip with more search units will take up a larger area. I would like to explore this in much more detail - and will! I'm very interested in the Access Data 'directed key search' idea - any references? Thanks again, John B