Machine learning in earth sciences

[7] It is difficult to apply well-known and described mathematical models to the natural environment, therefore machine learning is commonly a better alternative for such non-linear problems.

[9][10] A number of researchers found that machine learning outperforms traditional statistical models in earth science, such as in characterizing forest canopy structure,[11] predicting climate-induced range shifts,[12] and delineating geologic facies.

[13] Characterizing forest canopy structure enables scientists to study vegetation response to climate change.

[14] Predicting climate-induced range shifts enable policy makers to adopt suitable conversation method to overcome the consequences of climate change.

[10] For example, geological mapping in tropical rainforests is challenging because the thick vegetation cover and rock outcrops are poorly exposed.

[17] Incorporation of remote sensing and machine learning approaches can provide an alternative solution to eliminate some field mapping needs.

[18] In a labelling task of the research, if one kind of dinoflagellates occurs rarely in the samples, then expert ecologists commonly will not classify it correctly.

Choosing the optimal algorithm for a specific purpose can lead to a significant boost in accuracy:[19] for example, the lithological mapping of gold-bearing granite-greenstone rocks in Hutti, India with AVIRIS-NG hyperspectral data, shows more than 10% difference in overall accuracy between using support vector machines (SVMs) and random forest.

[7] In contrast, decision trees are transparent and easily understood, and the user can observe and fix the bias if any is present in such models.

[5] Random Forest, Support Vector Machine (SVM) (1) Map generated with remote sensing data only has a 52.7% accuracy when compared to the geological map, but several new possible lithological units are identified (2) Map generated with remote sensing data and spatial constraints has a 78.7% accuracy but no new possible lithological units are identified geophysical data Morocco frequency electromagnetic, radiometric measurements, ground gravity measurements Liaoning Province, China Remote Predictive Mapping (RPM) Landsat Reflectance, High-Resolution Digital Elevation Data Northwest Territories, Canada Random Forest Landslide susceptibility refers to the probability of landslide of a certain geographical location, which is dependent on local terrain conditions.

[31] Rock fractures can be recognized automatically by machine learning through photogrammetric analysis, even with the presence of interfering objects such as vegetation.

[32] Data augmentation was performed, increasing the training dataset size to 8704 images by flipping and random cropping.

[33] Carbon dioxide leakage from a geological sequestration site can be detected indirectly with the aid of remote sensing and an unsupervised clustering algorithm such as Iterative Self-Organizing Data Analysis Technique (ISODATA).

[35] The NDRE may not be accurate due to reasons like higher chlorophyll absorption, variation in vegetation, and shadowing effects; therefore, some stressed pixels can be incorrectly classed as healthy.

Quantification of the water inflow in the faces of a rock tunnel was traditionally carried out by visual observation in the field, which is labour and time-consuming, and fraught with safety concerns.

[37] The classification of the approach mostly follows the RMR system, but combining damp and wet states, as it is difficult to distinguish only by visual inspection.

[4] Exposed geological structures such as anticlines, ripple marks, and xenoliths can be identified automatically with deep learning models.

[40] False alerts can be eliminated by discriminating the earthquake waveforms from noise signals with the aid of ML methods.

The algorithm applied was a random forest, trained with a set of slip events, performing strongly in predicting the time to failure.

[41] Real-time streamflow data is integral for decision making (e.g., evacuations, or regulation of reservoir water levels during flooding).

However, water and debris from flooding may damage stream gauges, resulting in lack of essential real-time data.

In many machine learning algorithms, for example, Artificial Neural Network (ANN), it is considered as 'black box' approach as clear relationships and descriptions of how the results are generated in the hidden layers are unknown.

Methods of Splitting of the Datasets into Training Dataset and Testing Dataset
As the training of machine learning for landslide susceptibility mapping requires both training and testing datasets, splitting the dataset is required. Two splitting methods for the datasets are presented on the geologic map of the east Cumberland Gap. The method presented on the left, 'Splitting into two adjacent areas', is more useful as the automation algorithm can carry out mapping of a new area with the input of expert processed data of adjacent land. The cyan pixels show the training dataset while the remaining show the testing.
Data Augmentation Technique
In the preparation of the dataset for rock fracture recognition, data augmentation was performed. This technique is commonly used for increasing the training dataset size and variability. Although the randomly-cropped samples and the flipped samples come from the same image, the processed samples are unique. This technique can prevent the problem of data scarcity and overfitting the model.
Effect of Colour Image and Greyscale Image
The figure shows an image of a fold. The left image shows a colour image, while the one in the right shows a grayscale image. The difference in the accuracy of classifying the geological structure between colour images and grayscale images is little.
Black-box Operation of some Machine Learning Algorithms
In a black-box operation, a user only know about the input and output but not the process. Artificial Neural Network (ANN) is an example of a black-box operation. The user has no way to understand the logic of the hidden layers.