BioJava

BioJava is an open-source software project dedicated to provide Java tools to process biological data.

[1][2][3] BioJava is a set of library functions written in the programming language Java for manipulating sequences, protein structures, file parsers, Common Object Request Broker Architecture (CORBA) interoperability, Distributed Annotation System (DAS), access to AceDB, dynamic programming, and simple statistical routines.

The BioJava libraries are useful for automating many daily and mundane bioinformatics tasks such as to parsing a Protein Data Bank (PDB) file, interacting with Jmol and many more.

These include: The BioJava project grew out of work by Thomas Down and Matthew Pocock to create an API to simplify development of Java-based Bioinformatics tools.

[5] Examples of such projects that fall under Bio* apart from BioJava are BioPython,[6] BioPerl,[7] BioRuby,[8] EMBOSS[9] etc.

The package was also integrated with the RCSB PDB web application and added protein modification annotations to the sequence diagram and structure display.

[13] The project has been moved to a separate repository, BioJava-legacy, and is still maintained for minor changes and bug fixes.

BioJava 5.0.0 is the first released based on Java 8 which introduces the use of lambda functions and streaming API calls.

The following sections will describe several of the new modules and highlight some of the new features that are included in the latest version of BioJava.

A major change between the legacy BioJava project and BioJava3 lies in the way framework has been designed to exploit then-new innovations in Java.

Specific classes for common sequences such as DNA and proteins have been defined in order to improve usability for biologists.

The translation engine really leverages this work by allowing conversions between DNA, RNA and amino acid sequences.

Special attention has been paid to designing the storage of sequences to minimize space needs.

This concept can be extended to handle very large genomic datasets, such as NCBI GenBank or a proprietary database.

This module contains several classes and methods that allow users to perform pairwise and multiple sequence alignment.

Over 400 different types of protein modifications such as phosphorylation, glycosylation, disulfide bonds metal chelation etc.

There also exists flexibility to define new amino acid molecules with their molecular weights using simple XML configuration files.

The BioJava 3.0.5 makes use of Java's support for multithreading to improve performance by up to 3.2 times,[37] on a modern quad-core machine, as compared to the legacy C implementation.

Similar to BioJava, open-source software projects such as BioPerl, BioPython, and BioRuby all provide tool-kits with multiple functionality that make it easier to create customized pipelines or analysis.

For beginners, and for writing larger programs in the Bio domain, especially those to be shared and supported by others, Python’s clarity and brevity make it very attractive.

This window shows two proteins with IDs "4hhb.A" and "4hhb.B" aligned against each other. The code is given on the left side. This is produced using BioJava libraries which in turn uses Jmol viewer. ^{[

4

]} The FATCAT ^{[

17

]} rigid algorithm is used here to do the alignment.

An example application using the ModFinder module and the protein structure module. Protein modifications are mapped onto the sequence and structure of ferredoxin I (PDB ID 1GAO). ^{[

33

]} Two possible iron–sulfur clusters are shown on the protein sequence (3Fe–4S (F3S): orange triangles/lines; 4Fe–4S (SF4): purple diamonds/ lines). The 4Fe–4S cluster is displayed in the Jmol structure window above the sequence display