Substructure search

Substructure search (SSS) is a method to retrieve from a database only those chemicals matching a pattern of atoms and bonds which a user specifies.

The mathematical foundations for the method were laid in the 1870s, when it was suggested that chemical structure drawings were equivalent to graphs with atoms as vertices and bonds as edges.

There are many commercial systems that provide SSS, typically having a graphical user interface and chemical drawing software.

Large publicly-available databases like PubChem and ChemSpider can be searched this way, as can Wikipedia's articles describing individual chemicals.

It is implemented using a specialist type of query language and in real-world applications the search may be further constrained using logical operators on additional data held in the database.

If a user wished to limit the hits to alcohols, then the query structure would have to be drawn with an "explicit hydrogen", as C–C–O–H and ether would no longer match.

[25][26] The idea that chemical structures as depicted using drawings of the type introduced by Kekulé were related to what is now called graph theory was suggested by the mathematician J. J. Sylvester in 1878.

[27][28] Arthur Cayley had already, in 1874, considered how to enumerate chemical isomers, in what was an early approach to molecular graphs, where atoms are at vertices and bonds correspond to edges.

[31] In the 20th century, chemists developed standard ways to show structural formula, especially for individual organic compounds that were increasingly being synthesized and tested as potential drugs or agrochemicals,[32][6] By the 1950s, as the number of compounds made and tested grew, the first attempts to create chemical databases were made and the sub-discipline of cheminformatics was established.

They have to search published literature to decide whether an invention is novel, which for chemical patents often means finding known examples within the generic claims of a Markush structure.

Importantly, the existing literature had to be made searchable and a way to input a chemical structure query and return the matching results had to devised.

These requirements had been partially met as early as 1881 when Friedrich Konrad Beilstein introduced the Handbuch der organischen Chemie (Handbook of Organic Chemistry) which carefully classified known chemicals in a very systematic manner so that all examples containing a given heterocycle would be located together.

This weekly subscription service included a printed publication with summaries of articles in thousands of scholarly journals and claims in worldwide patents.

[38] However, it was only when the CAS records had been fully converted into machine-readable form and the internet was available to connect its database to end-users that comprehensive searching became possible.

[46] The need to combine chemistry search with biological data produced by screening compounds at ever-larger scales led to implementation of systems such as MACCS.

The drug lenalidomide contains substructures isoindoline (red) and glutarimide (blue)
Kekulé structure of benzene, 1872
Example of a Markush structure