Gene expression programming

These computer programs are complex tree structures that learn and adapt by changing their sizes, shapes, and composition, much like a living organism.

And like living organisms, the computer programs of GEP are also encoded in simple linear chromosomes of fixed length.

Thus, GEP is a genotype–phenotype system, benefiting from a simple genome to keep and transmit the genetic information and a complex phenotype to explore the environment and adapt to it.

In gene expression programming the linear chromosomes work as the genotype and the parse trees as the phenotype, creating a genotype/phenotype system.

Masood Nekoei, et al. utilized this expression programming style in ABC optimization to conduct ABCEP as a method that outperformed other evolutionary algorithms.ABCEP The genome of gene expression programming consists of a linear, symbolic string or chromosome of fixed length composed of one or more genes of equal size.

The reason for these noncoding regions is to provide a buffer of terminals so that all k-expressions encoded in GEP genes correspond always to valid programs or expressions.

It's also not hard to see that it is trivial to implement all kinds of genetic modification (mutation, inversion, insertion, recombination, and so on) with the guarantee that all resulting offspring encode correct, error-free programs.

Some examples of more complex linkers include taking the average, the median, the midrange, thresholding their sum to make a binomial classification, applying the sigmoid function to compute a probability, and so on.

These linking functions are usually chosen a priori for each problem, but they can also be evolved elegantly and efficiently by the cellular system[6][7] of gene expression programming.

In other words, homeotic genes determine which sub-ETs are called upon and how often in which main program or cell and what kind of connections they establish with one another.

The expression of the normal genes results as usual in different sub-ETs, which in the cellular system are called ADFs (automatically defined functions).

Each homeotic gene in this system puts together a different combination of sub-expression trees or ADFs, creating multiple cells or main programs.

These extra domains usually encode random numerical constants that the algorithm relentlessly fine-tunes in order to find a good solution.

Of these preparative steps, the crucial one is the creation of the initial population, which is created randomly using the elements of the function and terminal sets.

Broadly speaking, there are essentially three different kinds of problems based on the kind of prediction being made: The first type of problem goes by the name of regression; the second is known as classification, with logistic regression as a special case where, besides the crisp classifications like "Yes" or "No", a probability is also attached to each outcome; and the last one is related to Boolean algebra and logic synthesis.

One way to improve this type of hits-based fitness function consists of expanding the notion of correct and incorrect classifications.

So by counting the TP, TN, FP, and FN and further assigning different weights to these four types of classifications, it is possible to create smoother and therefore more efficient fitness functions.

For instance, one can combine some measure based on the confusion matrix with the mean squared error evaluated between the raw model outputs and the actual values.

More exotic fitness functions that explore model granularity include the area under the ROC curve and rank measure.

Popular examples of fitness functions based on the probabilities include maximum likelihood estimation and hinge loss.

The replication of the selected programs is a fundamental piece of all artificial evolutionary systems, but for evolution to occur it needs to be implemented not with the usual precision of a copy instruction, but rather with a few errors thrown in.

For example, below is shown a simple chromosome composed of only one gene a head size of 7 (the Dc stretches over positions 15–22): where the terminal "?” represents the placeholder for the RNCs.

Furthermore, special Dc-specific operators such as mutation, inversion, and transposition, are also used to aid in a more efficient circulation of the RNCs among individual programs.

An artificial neural network (ANN or NN) is a computational device that consists of many simple connected units or neurons.

So, in order to fully simulate an artificial neural network one must somehow encode these components in a linear chromosome and then be able to express them in a meaningful way.

For each NN-gene, the weights and thresholds are created at the beginning of each run, but their circulation and adaptation are guaranteed by the usual genetic operators of mutation, transposition, inversion, and recombination.

Decision trees (DT) are classification models where a series of questions and answers are mapped using nodes and directed edges.

Note that the edges connecting the nodes are properties of the data, specifying the type and number of branches of each attribute, and therefore don't have to be encoded.

The process of decision tree induction with gene expression programming starts, as usual, with an initial population of randomly created chromosomes.

The chromosomal architecture includes an extra domain for encoding random numerical constants, which are used as thresholds for splitting the data at each branching node.

Expression of GEP genes as sub-ETs. a) A three-genic chromosome with the tails shown in bold. b) The sub-ETs encoded by each gene.
Expression of a unicellular system with three ADFs. a) The chromosome composed of three conventional genes and one homeotic gene (shown in bold). b) The ADFs encoded by each conventional gene. c) The main program or cell.
Expression of a multicellular system with three ADFs and two main programs. a) The chromosome composed of three conventional genes and two homeotic genes (shown in bold). b) The ADFs encoded by each conventional gene. c) Two different main programs expressed in two different cells.