Multi-task learning

[1][2][3] Inherently, Multi-task learning is a multi-objective optimization problem having trade-offs between different tasks.

In a widely cited 1997 paper, Rich Caruana gave the following characterization:Multitask Learning is an approach to inductive transfer that improves generalization by using the domain information contained in the training signals of related tasks as an inductive bias.

One example is a spam-filter, which can be treated as distinct but related classification tasks across different users.

To make this more concrete, consider that different people have different distributions of features which distinguish spam emails from legitimate ones, for example an English speaker may find that all emails in Russian are spam, not so for Russian speakers.

Solving each user's spam classification problem jointly via MTL can let the solutions inform each other and improve performance.

One situation where MTL may be particularly helpful is if the tasks share significant commonalities and are generally slightly under sampled.

[8] However, as discussed below, MTL has also been shown to be beneficial for learning unrelated tasks.

There are several ways to address this challenge: Within the MTL paradigm, information can be shared across some or all of the tasks.

For example, tasks may be grouped or exist in a hierarchy, or be related according to some general metric.

Suppose, as developed more formally below, that the parameter vector modeling each task is a linear combination of some underlying basis.

[7][11] Hierarchical task relatedness can also be exploited implicitly without assuming a priori knowledge or learning relations explicitly.

In many applications, joint learning of unrelated tasks which use the same input data can be beneficial.

Novel methods which builds on a prior multitask methodology by favoring a shared low-dimensional representation within each task grouping have been proposed.

The programmer can impose a penalty on tasks from different groups which encourages the two representations to be orthogonal.

Experiments on synthetic and real data have indicated that incorporating unrelated tasks can result in significant improvements over standard multi-task learning methods.

Large scale machine learning projects such as the deep convolutional neural network GoogLeNet,[13] an image-based object classifier, can develop robust representations which may be useful to further algorithms learning related tasks.

For example, the pre-trained model can be used as a feature extractor to perform pre-processing for another learning algorithm.

Their extension to non-stationary environments is termed Group online adaptive learning (GOAL).

Such group-adaptive learning has numerous applications, from predicting financial time-series, through content recommendation systems, to visual understanding for adaptive autonomous agents.

Multitask optimization: In some cases, the simultaneous training of seemingly related tasks may hinder performance compared to single-task models.

Commonly, the per-task gradients are combined into a joint update direction through various aggregation algorithms or heuristics.

The MTL problem can be cast within the context of RKHSvv (a complete inner product space of vector-valued functions equipped with a reproducing kernel).

In particular, recent focus has been on cases where task structure can be identified via a separable kernel, described below.

and the following reproducing property holds: The reproducing kernel gives rise to a representer theorem showing that any solution to equation 1 has the form: The form of the kernel Γ induces both the representation of the feature space and structures the output across tasks.

This factorization property, separability, implies the input feature space representation does not vary by task.

With the separable kernel, equation 1 can be rewritten as where V is a (weighted) average of L applied entry-wise to Y and KCA.

to a higher dimensional space to encode complex structures such as trees, graphs and strings.

, and hence gives the solution to Q. Spectral penalties - Dinnuzo et al[17] suggested setting F as the Frobenius norm

They optimized Q directly using block coordinate descent, not accounting for difficulties at the boundary of

However these penalties are not convex, and the analysis of the barrier method proposed by Ciliberto et al. does not go through in these cases.