Data wrangling

The "wrangler" non-technical term is often said to derive from work done by the United States Library of Congress's National Digital Information Infrastructure and Preservation Program (NDIIPP) and their program partner the Emory University Libraries based MetaArchive Partnership.

[3] One of the first mentions of data wrangling in a scientific context was by Donald Cline during the NASA/NOAA Cold Lands Processes Experiment.

Cline also specifies duties typically handled by a storage administrator for working with large amounts of data.

This can occur in areas like major research projects and the making of films with a large amount of complex computer-generated imagery.

These steps are an iterative process that should yield a clean and usable data set that can then be used for analysis.

This process is tedious but rewarding as it allows analysts to get the information they need out of a large set of data that would otherwise be unreadable.

within a data set, and could include such actions as extractions, parsing, joining, standardizing, augmenting, cleansing, consolidating, and filtering to create desired wrangling outputs that can be leveraged downstream.

Early prototypes of visual data wrangling tools include OpenRefine and the Stanford/Berkeley Wrangler research system;[7] the latter evolved into Trifacta.

Given a set of data that contains information on medical patients your goal is to find correlation for a disease.

This process can be beneficial for determining correlations for disease diagnosis as it will reduce the vast amount of data into something that can be easily analyzed for an accurate result.

Turning messy data into useful statistics