21: Storage Technology

Storage media

Magnetic storage

A refresher. Recall that E/M are closely related. Current is a flow of electrons. The flow of electrons generates a magnetic field according to the right hand rule.

You know what else generates a magnetic field? Magnets (aka magnetic materials). You can align the field in a magnet in various ways, one of which is inducing a strong enough external magnetic field near it (like, say by use of an electical current).

The movement of a conductor across a magnetic field (such as generated by magnets) generates a current, which you can measure.

In other words, the generation of a current by movement of a conductor let you "read" the magnetic field. And applying a stronger current lets you "overwrite" an existing magnet's orientation.

These two facts are the basis of magnetic storage media.

Magnetic data read/write

The term flux describes a magnetic field with a specific direction (or "polarity"). The drive head creates "flux reversals" on the medium to record data. The given pattern it creates for a particular sequence of bits is described by the "encoding method."

To create a reversal, the write head reverses the voltage (the direction of the current flow), aka "reverses the polarity" of the electricity to reverse the polarity of the flux. When reading over the disk, the head sees no voltage until it crosses a reversal, then the flux transition induces a small amount of current flow (and thus voltage) in the head.

Lots of electronics and precision machinery come together to make this process work at high speeds, sending one or more heads to the right part of a spinning platter of magnetic media and timing the careful application of voltage (to write) or sensors, amplifiers, filters, etc. (to read).

Data encoding

Hard disks use run length limited encoding (or similar techniques) to store bits in a sequence of flux reversals. RLL encodes bits a group at a time. The term RLL is derived from the to primary variables, the minimum number (the run length) and the maximum number (run limit) of the transition cells allowed between two actual flux transitions.

A simple code is the "Modified Non-Return-to-Zero-Inverted" code, where 1 is a transition and 0 is a non-transition. We write N for no-flux-transition, and T for flux-transition:

0 -> N 1 -> T

The problem you run into is that if there's any instability in the speed of the drive, it's possible to desynchonize during long runs of zeros.

RLL encodings help by making sure there are occasional transitions regardless of underlying bit pattern; this helps make sure the mechanical spin of the drive is kept synchronized with what the electronics are expecting (this is called "self-clocking").

A very simple RLL (0, 1) is essentially an FM encoding.

0 -> T N 1 -> T T

This was used in very early floppy drives.

But it's got a lot of overhead (two transitions per bit). Others that are popular are (1,7) and (2,7) encoding. For example, (1, 7) RLL codes encode two bits as three N/T slots, and four bits as six N/T slots:

00 -> TNT 01 -> TNN 10 -> NNT 11 -> NTN 0000 -> TNT NNN 0001 -> TNN NNN 1000 -> NNT NNN 1001 -> NTN NNN

In this encoding, there's always one non-transition slot between two transitions, and there's at most 7 N slots between Ts. This bound helps us keep thing synchronized without too much overhead.

Flash memory

Flash memory of the type we see in SDDs is built atop floating gate metal–oxide–semiconductor field-effect transistors. Say that five times fast. These are programmable transistors that act like NAND gates (readable). They start in one state and can be "set" (written) to another, or "reset" (erased).

For those of you who aren't EE/ECE (including me), what you need to know is that flash memory can be addressed bit-by-bit if need be. It's organized into (conceptully vertical) strings (linear arrays of transistors that behave like NAND gates) of 32 or 64 transistors, each represnenting a byte. Individual gates are addressed by horizontal "word lines".

We can read. We can write, once. But to re-write, a whole "block" of NAND gates must be re-set to their base state first, for reasons we're not going to go into since this isn't an ECE course. Notably, erasing is the slowest operation by far. (Typically by at least one order of magnitude.)

SSD controllers then have a choice. They can allow overwrites of a given "sector" by erasing then writing. Or they can dynamically map sectors to arbitrary, ready-to-write (already erased) locations in their flash arrays. The "empty" but not erased sectors can then be erased by the drive controller when the drive is otherwise idle.

The former strategy leads to lackluster performance, since writes are then only as fast as erasures.

If the drive has "extra" space, it can add "empty" sectors to this space and erase them as time permits, then return them to the pool of "available" space. But the best plan is to get extra information from the OS about when a sector can be erased (like, when it's deleted in the filesystem) and then add that sector to the list of "empty" sectors to be scheduled to be erased.

Enter the TRIM command.

TRIM

TRIM is, in short, a command that the disk driver in the OS can send to SDD-based disk controllers. TRIMming a sector tells the SDD that the sector in question can be erased -- it is not being used to store data any more. Most modern OSes now support TRIM. The better integrated the OS/hardware/etc, the more likely it is to work, as the OS, driver, interface, and controller must all correctly implement TRIM for the command to be issued. This is becoming more common as SDDs become more standard.

Blocks scheduled for TRIMming will be erased by the controller even if the device is behind a write blocker. Whether or not the old data will be visible if the block is read depends upon the type of TRIM implemented:

  • non-deterministic: Who knows? Could be the original data, or zeros, or something else. No guarantee of consistency.
  • deterministic zero after trim (DZAT): all read commands after a TRIM return zeroes until the page is written new data.
  • deterministic read after trim (DRAT): The data returned by SSD drives supporting DRAT as opposed to DZAT can be all zeroes or other words of data (such as the original pre-trim data stored in that logical page). The essential point is that the values read from a trimmed logical page do not change between when the TRIM command is issued and when new data get written into that logical page.

Interfaces

OK, so now we know something about how data are stored on disks. How do we connect disks to our machines?

ATA

Internal drives are typically connected by "Serial ATA," which has mostly obsoleted the previous parallel ATA used in PCs (and SCSI used in older Macs).

SATA clocks at 150, 300, or 600 Gbps (and faster) as opposed to parallel ATA, which topped out around 133Mbps (there are various EE-related challenges with parallel data transmission, that is, wires run in parallel, that SATA overcomes).

The connector has seven pins: G T+ T- G R- R+ G then a notch, so it can only be inserted one way. The signal(s) sent along the Transmit and Receive path are differentially encoded, that is, a positive and negative version of the signal are sent along parallel wires to help mitigate interference (the signal is encoded in the difference between the values on the two wires, so that interference, which tends to affect all nearby electronics equally, is canceled out). This technique used all over the place: "twisted pair" Ethernet, HDMI, USB, etc.

As you know, ATA drives used to be addressed by CHS values, but modern disk controllers expect an LBA, and translate this into disk geometry out of the view of the OS.

SCSI

SCSI was for a while a more scalable alternative to PATA, seen mostly in Macs and high-end PC systems. It's mostly been deprecated for internal use in favor of SATA, and for external use in favor of USB or Firewire.

IEEE 1394 aka Firewire

Firewire was developed by Apple as a successor to SCSI, kinda accidentally in parallel with USB. It is much, much faster though.

The original 1394 spec (from 1994, two years before USB1.0) was a 400 Mbps protocol ("Firewire 400"). Later revisions brought it up to 1.6 Gbps.

Steve Jobs infamously pronounced Firewire dead in 2008. Macs have clearly moved on to USB and Thunderbolt (which is a combination of PCI express and displayport over a single serial signal). Confusingly, Thunderbolt 3 (the latest) uses a USB-C connector and is an "Alternative mode" for USB-3.

USB

USB is, like SATA, a serial interface designed for general data transmission. USB has gone through a series of versions, each faster than the previous. USB is mostly backward compatible from a device perspective (that is, old devices can be plugged into a newer controller, but not necessarily the reverse). USB 3 introduced a new form factor for the plugs that physically enforces this limitation.

USB 1.0 (1996) topped out at 1.5MBps; 1.1 went to 12 Mbps.

USB2 goes to 480 Mbps in theory, 280Mbps in practice due to bus access constraints.

USB3 signals in theory up to 5 Gbps, but due to encoding (10 bits per 8) payload is actually 4 Gbps, and the spec says 3.2 Gbps effective throughput (due to clock sync, etc.) is reasonable.

USB3.1 signals at 10Gbps with an effective rate of around 7.2 Gbps.

There are a variety of connectors for the various versions of USB:

https://en.wikipedia.org/wiki/USB#Host_and_device_interface_receptacles

USB3 connectors are also used for Thunderbolt 3. Thanks Apple!