S370/135 Unusual Design Flaw

Let me describe a "design flaw" on IBM's S370/135 processor. It will take a while to explain.

Memory Error Correction

The main memories of all IBM 370's used ECC (Error-Correcting Codes) in which 8 check bits are appended to 64 data bits. If any one of these 72 bits flips incorrectly, that error can be corrected; this is called a SBE (single-bit-error). If any two of the 72 bits flip, the DBE (double-bit-error) can not be corrected, but it will be detected. (Let's nevermind about triple-bit-errors, quadruple-bit-errors, etc.)

Most of the models used exactly the same ECC code. For our purpose, three simple examples, presented in hexadecimal will suffice:

Data bitsCheck bitsExpected check bitsSyndromeResponse
0000 0000 0000 0000 FE FE 00 (none)
0000 0000 0000 1000 99 99 00 (none)
0000 0000 0000 1000 FE 99 67 Invert bit 51

In the examples, you see that a change to bit-51 leads to a change in five of the check bits. This checking code is computed when the data is written, and then again when the data is read for compare with the written check bits. Exclusive-or'ing the expected and actual check bits leads to a "syndrome" which will be zero if there's been no error at all.

The syndrome will have 2, 4, 6 or 8 bits set when an uncorrectable double-bit-error has occurred, but an odd number of set bits when an odd number of bit flips has occurred. (There are 128 possible such odd-bit syndromes; 72 of them map to specific single-bit-errors; the other 56 can only arise from triple-bit-errors, quintuple-bit-errors, etc.)

As shown in the third line of the table, if bit 51 flips due to a memory error; expected and actual check bits will mismatch. The computed syndrome ('67') will direct the hardware to "correct" (flip) bit 51. Data rewritten to storage will then be that shown in the table's first line. (The stored data will continue to be that of the third line, if the memory cell has a permanent flaw and can't record a 0. ... But nevermind, the ECC circuitry will continue to correct that error on every read. ... Well ... almost any read.)

Single-bit-errors are corrected, so aren't really errors and are supposed to be completely invisible to the application software. (In the case of the Model 135, they do introduce a 165 nanosecond delay for the correction to occur.)

Double-bit-errors, on the other hand, cannot be ignored. Memory reads of such a location must cause a Machine-Error interrupt.

Note that part of a checking block cannot be written without first reading the entire block. That's because the entire block is needed to compute the check bits; thus even partial writes to such a location must also cause a Machine-Error interrupt.

Storage Validation

IBM S370 Models 155 and larger can replace eight bytes at once. This introduces no ECC problem; if the written location had a prior error that error will go unnoticed since the location isn't prefetched -- a "Fast Write" is performed.

The smaller models, however, could not do Fast Write. The Model 135 was a 16-bit machine and needed special hardware just to do a 32-bit write. It was impossible to replace a complete 64-bit ECC block (72 with checkbits) in one swoop with a Fast Write operation. It might seem logical that the 135 should have used a 32-bit ECC block. That they didn't may be largely due to the need for 7 check bits with 32 data bits -- an 8% increase in number of bits compared with the 72-bit block and (more importantly?) a deviation from their common packaging in which many chips and circuit board were standardized to support 8 or 9 bits.

When all writes are partial writes, there must be a programmable mechanism to override an uncorrectable error. Otherwise, whenever a word develops such an error, the error can never be removed! To support operation on these smaller models, IBM defined a "Validating Write" operation, where the DBE signal is overridden to allow the write to complete. Any macroinstruction which completely replaces a checking block could be a "validating write" but IBM's Principles of Operation specify that a macroinstruction must be a validating write when:

The Model 135 had a 16-bit format for microinstructions. When the opcode was Storage-Write, one of the microinstruction bits was an On/Off flag to specify "Validating Write." The purpose of that signal was to suppress an error signal on the fetch portion of the partial write; the Model 135 did that in the simplest way: by suppressing the ECC circuitry altogether.

370/135 Registers

All S370 models have 16 32-bit general purpose registers (64 bytes altogether), 64 bytes for floating point registers, as well as registers useful to the particular Model. Sounds like a lot.

However, the 128 bytes of registers available to the application code are of no direct use to the microprogrammer -- he can't overwrite application data! The 135 has only a paltry 14 bytes of register space useful to the microprogammer! These are W0-W6, seven 16-bit registers. (Actually there is an eighth register, W7, but with a special function that renders it useless to the microprogrammer. Moreover, there are eight times as many registers as just implied: one set of registers for each of eight zones. But the zones are used for channel emulaters, etc. Only one zone is available when interpreting S370 macroinstructions.)

Having only 14 bytes of working register space makes programming difficult! (There is some other space available, but slower to access, and carrying its own restrictions.)

It's been 35 years since I've actually looked at S370/135 microcode, but it is easy to see that even a simple operation like Move Characters strains the limit of working registers. To exploit 32-bit writes, you need 2 registers just for the data, perhaps 2 registers each for source and destination addresses, and another register for the byte count. You're left with zero working registers in reserve!

Virtual Addressing

The S370 introduced a memory-mapping feature such that any storage access had both a virtual and a physical address associated with it. For our purpose here, we need just note that the test mentioned above ("Source and destination do not overlap") used virtual addresses for the test on the Model 135.

The test just mentioned will give the same result whether virtual or physical addresses are used unless one physical address is mapped to two different virtual addresses. (I do not know whether any IBM Operating System would do such a thing.)

Bizarre Case -- Should this be considered a Bug?

On a MVC or MVCL instruction, the Model 135 must check for the validating write condition and branch to a special routine when they're fulfilled. Ideally, it will fetch eight bytes, then write eight bytes, and loop. But it can't do that because, as explained above, it lacks sufficient work registers. So it reads four bytes, writes four bytes, reads four bytes, writes four bytes and loops. (As a challenge, explain why the first write in this loop must differ from the second write before reading ahead.)

The first write in each checking block must have the Validate flag set in the microinstruction. That means that a single-bit-error in the other four bytes of that block will not be corrected. (Validate must be turned OFF for the second write so that a single-bit-error in the first four bytes doesn't get made permanent.)

But consider what happens if the source and destination of the Move Characters are the same! (This should function as a No-operation.) Suppose, for definiteness, that the 72 bits in storage are 0000-0000-0000-1000-FE as in the table above; that is all zeros with a correctible single-bit-error. The Model 135 will read the first word (0000-0000) and write it back with the Validation flag set. The data is now 0000-0000-0000-1000-99. The damage is done; the next step will read 0000-1000 and write 0000-1000. A single-bit-error has been made permanent!

The 135 microcode checks for overlap before branching to the special Validating-Write loop, so this can only happen when virtual addresses differ but physical addresses are the same.

The MRST/370 Diagnostic Program

I'm very proud of my MRST memory diagnostic. It had too many fine features to detail here. :-) At some point I learned that its "Test 9" was reporting errors that no other diagnostic reported, though only on the Model 135. Very briefly, Test 9 mapped virtual addresses to physical addresses randomly, and then did copies, hoping to toggle address bits in a complicated zigzag fashion all over memory. Sometimes, randomly, two virtual addresses would be mapped to the same physical address.

I'm afraid the conclusion seems anti-climactic. I guess I won't even try to relate another of those old bugs. :-(


Go back to James Allen's home page.