Warren Gish

In 1985, with a view toward rapid identification of restriction enzyme recognition sites in DNA, Gish developed a DFA function library in the C language.

The idea to apply a finite-state machine to this problem had been suggested by fellow graduate student and BSD UNIX developer Mike Karels.

Berkeley in December 1986, Gish sped up the FASTP program [7] (later known as FASTA[8]) of William R. Pearson and David J. Lipman by 2- to 3-fold without altering the results.

Others of his contributions to BLAST include: the use of compressed nucleotide sequences, both as an efficient storage format and as a rapid, native search format; parallel processing; memory-mapped I/O; the use of sentinel bytes and sentinel words at the start and end of sequences to improve the speed of word-hit extension; the original implementations of BLASTX,[9]

Gish was also the creator of and project manager for the earliest NCBI Dispatcher for distributed services (inspired by CORBA's Object Request Broker).

Little NIH funding was received for his WU-BLAST development, with an average of 20% FTE starting in November 1995, and ending shortly after the September 1997 release of the NCBI gapped BLAST (“blastall”).

As an option to WU-BLAST, Gish implemented a faster, more memory-efficient and more sensitive two-hit BLAST algorithm than was used by the NCBI software for many years.

WU-BLAST with XDF was the first BLAST suite to support indexed-retrieval of NCBI standard FASTA-format sequence identifiers (including the entire range of NCBI identifiers); the first to allow retrieval of individual sequences in part or in whole, natively, translated or reverse-complemented; and the first able to dump the entire contents of a BLAST database back into human-readable FASTA format.

In 2000, unique support for reporting of links (consistent sets of HSPs; also called chains in some later software packages) was added, along with the ability for users to limit the distance between HSPs allowed in the same set to a biologically relevant length (e.g., the length of the expected longest intron in the species of interest) and with the distance limitation entering into the calculation of E-values.