diff --git a/README.md b/README.md index 13a8f30..12924e3 100644 --- a/README.md +++ b/README.md @@ -6,7 +6,7 @@ A brief introduction into [BioJava](https://www.biojava.org). The goal of this tutorial is to provide an educational introduction into some of the features that are provided by BioJava. This tutorial is still under development, hence not yet comprehensive for the entire library. Please also check other sources of [documentation](https://biojava.org/wiki/Documentation). -The examples within the tutorial are intended to work with the most recent version of BioJava, although most examples will work with BioJava 3.0 and higher. Please do submit a [new issue](https://github.com/biojava/biojava-tutorial/issues) if you find any problems. +The examples within the tutorial are intended to work with the most recent version of BioJava. Please do submit a [new issue](https://github.com/biojava/biojava-tutorial/issues) if you find any problems. The tutorial is subdivided into several books, corresponding to the respective BioJava modules. Each book is further subdivided into several chapters that intend to describe the main functionality of the module in order of increasing complexity. @@ -24,7 +24,7 @@ Book 4: [The Genomics Module](genomics/README.md), working with genomic data. Book 5: [The Protein-Disorder Module](protein-disorder/README.md), predicting protein-disorder. -Book 6: [The ModFinder Module](modfinder/README.md), identifying potein modifications in 3D structures +Book 6: [The ModFinder Module](modfinder/README.md), identifying protein modifications in 3D structures ## License diff --git a/alignment/smithwaterman.md b/alignment/smithwaterman.md index 0f38bf6..5de8acf 100644 --- a/alignment/smithwaterman.md +++ b/alignment/smithwaterman.md @@ -36,7 +36,7 @@ public static void main(String[] args) throws Exception { } private static ProteinSequence getSequenceForId(String uniProtId) throws Exception { - URL uniprotFasta = new URL(String.format("http://www.uniprot.org/uniprot/%s.fasta", uniProtId)); + URL uniprotFasta = new URL(String.format("https://www.uniprot.org/uniprot/%s.fasta", uniProtId)); ProteinSequence seq = FastaReaderHelper.readFastaProteinSequence(uniprotFasta.openStream()).get(uniProtId); System.out.printf("id : %s %s%s%s", uniProtId, seq, System.getProperty("line.separator"), seq.getOriginalHeader()); System.out.println(); diff --git a/core/README.md b/core/README.md index 0fd20de..7995c81 100644 --- a/core/README.md +++ b/core/README.md @@ -34,7 +34,7 @@ Chapter 4 - [Translating](translating.md) DNA and protein sequences. ## License -The content of this tutorial is available under the [CC-BY](http://creativecommons.org/licenses/by/3.0/) license, available [here](license.md). +The content of this tutorial is available under the [CC-BY](http://creativecommons.org/licenses/by/3.0/) license, available [here](../license.md). ## Please Cite diff --git a/core/readwrite.md b/core/readwrite.md index 1ab278b..432a419 100644 --- a/core/readwrite.md +++ b/core/readwrite.md @@ -13,7 +13,7 @@ Here an example that parses a UniProt FASTA file into a protein sequence. ```java public static ProteinSequence getSequenceForId(String uniProtId) throws Exception { - URL uniprotFasta = new URL(String.format("http://www.uniprot.org/uniprot/%s.fasta", uniProtId)); + URL uniprotFasta = new URL(String.format("https://www.uniprot.org/uniprot/%s.fasta", uniProtId)); ProteinSequence seq = FastaReaderHelper.readFastaProteinSequence(uniprotFasta.openStream()).get(uniProtId); System.out.printf("id : %s %s%s%s", uniProtId, seq, System.getProperty("line.separator"), seq.getOriginalHeader()); System.out.println(); @@ -79,6 +79,27 @@ BioJava can also be used to parse large FASTA files. The example below can parse } ``` +BioJava can also process large FASTA files using the Java streams API. + +```java + FastaStreamer + .from(path) + .stream() + .forEach(sequence -> System.out.printf("%s -> %ss\n", sequence.getOriginalHeader(), sequence.getSequenceAsString())); +``` + +If you need to specify a header parser other that `GenericFastaHeaderParser` or a sequence creater other than a +`ProteinSequenceCreator`, these can be specified before streaming the contents as follows: + +```java + FastaStreamer + .from(path) + .withHeaderParser(new PlainFastaHeaderParser<>()) + .withSequenceCreator(new CasePreservingProteinSequenceCreator(AminoAcidCompoundSet.getAminoAcidCompoundSet())) + .stream() + .forEach(sequence -> System.out.printf("%s -> %ss\n", sequence.getOriginalHeader(), sequence.getSequenceAsString())); +``` + diff --git a/core/translating.md b/core/translating.md index 9b83643..10b953a 100644 --- a/core/translating.md +++ b/core/translating.md @@ -63,7 +63,7 @@ An example for how to parse a sequence from a String and using the Translation e // define the Ambiguity Compound Sets AmbiguityDNACompoundSet ambiguityDNACompoundSet = AmbiguityDNACompoundSet.getDNACompoundSet(); - CompoundSet nucleotideCompoundSet = AmbiguityRNACompoundSet.getDNACompoundSet(); + CompoundSet nucleotideCompoundSet = AmbiguityRNACompoundSet.getRNACompoundSet(); FastaReader proxy = new FastaReader( diff --git a/genomics/README.md b/genomics/README.md index 37c44d8..a7ff27e 100644 --- a/genomics/README.md +++ b/genomics/README.md @@ -41,7 +41,7 @@ Chapter 6 - Reading genomic DNA sequences using UCSC's [.2bit file format](twobi ## License -The content of this tutorial is available under the [CC-BY](http://creativecommons.org/licenses/by/3.0/) license, available [here](license.md). +The content of this tutorial is available under the [CC-BY](http://creativecommons.org/licenses/by/3.0/) license, available [here](../license.md). ## Please Cite diff --git a/modfinder/README.md b/modfinder/README.md index 4226061..ec8ed8c 100644 --- a/modfinder/README.md +++ b/modfinder/README.md @@ -29,7 +29,7 @@ Chapter 4 - [How to define a new protein modification](add-protein-modification. ## License -The content of this tutorial is available under the [CC-BY](http://creativecommons.org/licenses/by/3.0/) license, available [here](license.md). +The content of this tutorial is available under the [CC-BY](http://creativecommons.org/licenses/by/3.0/) license, available [here](../license.md). ## Please Cite diff --git a/protein-disorder/README.md b/protein-disorder/README.md index 888ae04..7bee8c3 100644 --- a/protein-disorder/README.md +++ b/protein-disorder/README.md @@ -94,7 +94,7 @@ Map ranges = Jronn.getDisorder(sequences); ## License -The content of this tutorial is available under the [CC-BY](http://creativecommons.org/licenses/by/3.0/) license, available [here](license.md). +The content of this tutorial is available under the [CC-BY](http://creativecommons.org/licenses/by/3.0/) license, available [here](../license.md). ## Please Cite diff --git a/structure/README.md b/structure/README.md index 03af437..9552ebc 100644 --- a/structure/README.md +++ b/structure/README.md @@ -66,7 +66,7 @@ Chapter 18 - [Lists](lists.md) of PDB IDs and PDB [Status Information](lists.md) ## License -The content of this tutorial is available under the [CC-BY](http://creativecommons.org/licenses/by/3.0/) license, available [here](license.md). +The content of this tutorial is available under the [CC-BY](http://creativecommons.org/licenses/by/3.0/) license, available [here](../license.md). ## Please Cite diff --git a/structure/alignment.md b/structure/alignment.md index 4f11c54..6053e4a 100644 --- a/structure/alignment.md +++ b/structure/alignment.md @@ -20,12 +20,12 @@ acid sequences converge on a common tertiary structure. A **structural alignment** of other biological polymers can also be made in BioJava. For example, nucleic acids can be structurally aligned to find common structural motifs, -independent of sequence simililarity. This is specially important for RNAs, because their +independent of sequence similarity. This is specially important for RNAs, because their 3D structure arrangement is important for their function. For more info see the Wikipedia article on [structure alignment](http://en.wikipedia.org/wiki/Structural_alignment). -## Alignment Algorithms supported by BioJava +## Alignment Algorithms Supported by BioJava BioJava comes with a number of algorithms for aligning structures. The following five options are displayed by default in the graphical user interface (GUI), @@ -45,9 +45,9 @@ in 3D. See below for descriptions of the algorithms. Since BioJava version 4.1.0, multiple structures can be compared at the same time in a **multiple structure alignment**, that can later be visualized in Jmol. The algorithm is described in detail below. As an overview, it uses any pairwise alignment -algorithm and a **reference** structure to per perform an alignment of all the structures. +algorithm and a **reference** structure to perform an alignment of all the structures. Then, it runs a **Monte Carlo** optimization to determine the residue equivalencies among -all the strucutures, identifying conserved **structural motifs**. +all the structures, identifying conserved **structural motifs**. ## Alignment User Interface @@ -91,7 +91,7 @@ This code shows the following user interface: ![Multiple Alignment GUI](img/multiple_gui.png) The input format is a free text field, where the structure identifiers are -indidcated, space separated. A **structure identifier** is a String that +indicated, space separated. A **structure identifier** is a String that uniquely identifies a structure. It is basically composed of the pdbID, the chain letters and the ranges of residues of each chain. For the formal description visit [StructureIdentifier](http://www.biojava.org/docs/api/org/biojava/nbio/structure/StructureIdentifier.html). @@ -125,12 +125,12 @@ The Combinatorial Extension (CE) algorithm was originally developed by 1998](http://peds.oxfordjournals.org/content/11/9/739.short) [![pubmed](http://img.shields.io/badge/in-pubmed-blue.svg?style=flat)](http://www.ncbi.nlm.nih.gov/pubmed/9796821). It works by identifying segments of the two structures with similar local structure, and then combining those to try to align the most residues possible -while keeping the overall RMSD of the superposition low. +while keeping the overall root-mean-square deviation (RMSD) of the superposition low. CE is a rigid-body alignment algorithm, which means that the structures being compared are kept fixed during superposition. In some cases it may be desirable to break large proteins up into domains prior to aligning them (by manually -inputing a subrange, using the [SCOP or CATH databases](externaldb.md), or by +inputting a subrange, using the [SCOP or CATH databases](externaldb.md), or by decomposing the protein automatically using the [Protein Domain Parser](http://www.biojava.org/docs/api/org/biojava/nbio/structure/domain/LocalProteinDomainParser.html) algorithm). @@ -146,10 +146,8 @@ to the C-terminal part of the other, and vice versa. CE-CP allows circularly permuted proteins to be compared. For more information on circular permutations, see the [Wikipedia](http://en.wikipedia.org/wiki/Circular_permutation_in_proteins) or -[Molecule of the Month] -(http://www.pdb.org/pdb/101/motm.do?momID=124&evtc=Suggest&evta=Moleculeof%20the%20Month&evtl=TopBar) -articles [![pubmed] -(http://img.shields.io/badge/in-pubmed-blue.svg?style=flat)](http://www.ncbi.nlm.nih.gov/pubmed/22496628). +[Molecule of the Month](https://pdb101.rcsb.org/motm/124) +articles [![pubmed](http://img.shields.io/badge/in-pubmed-blue.svg?style=flat)](http://www.ncbi.nlm.nih.gov/pubmed/22496628). For proteins without a circular permutation, CE-CP results look very similar to @@ -173,8 +171,7 @@ It performs similarly to CE for most structures. The 'rigid' flavor uses a rigid-body superposition and only considers alignments with matching sequence order. -BioJava class: [org.biojava.nbio.structure.align.fatcat.FatCatRigid] -(www.biojava.org/docs/api/org/biojava/nbio/structure/align/fatcat/FatCatRigid.html) +BioJava class: [org.biojava.nbio.structure.align.fatcat.FatCatRigid](https://www.biojava.org/docs/api/org/biojava/nbio/structure/align/fatcat/FatCatRigid.html) ### FATCAT - flexible @@ -186,11 +183,9 @@ calmodulin with and without calcium bound can be much better aligned with FATCAT-flexible than with one of the rigid alignment algorithms. The downside of this is that it can lead to additional false positives in unrelated structures. -![(Left) Rigid and (Right) flexible alignments of -calmodulin](img/1cfd_1cll_fatcat.png) +![(Left) Rigid and (Right) flexible alignments of calmodulin](img/1cfd_1cll_fatcat.png) -BioJava class: [org.biojava.nbio.structure.align.fatcat.FatCatFlexible] -(www.biojava.org/docs/api/org/biojava/nbio/structure/align/fatcat/FatCatFlexible.html) +BioJava class: [org.biojava.nbio.structure.align.fatcat.FatCatFlexible](https://www.biojava.org/docs/api/org/biojava/nbio/structure/align/fatcat/FatCatFlexible.html) ### Smith-Waterman @@ -204,8 +199,7 @@ locating gaps can lead to high RMSD in the resulting superposition due to a small number of badly aligned residues. However, this method is faster than the structure-based methods. -BioJava Class: [org.biojava.nbio.structure.align.ce.CeCPMain] -(http://www.biojava.org/docs/api/org/biojava/nbio/structure/align/ce/CeCPMain.html) +BioJava Class: [org.biojava.nbio.structure.align.ce.CeCPMain](http://www.biojava.org/docs/api/org/biojava/nbio/structure/align/ce/CeCPMain.html) ### Other methods @@ -250,43 +244,7 @@ by the pairwise alignment algorithm limitations. The algorithm performs similarly to other multiple structure alignment algorithms for most protein families. The parameters both for the pairwise aligner and the MC optimization can have an impact on the final result. There is not a unique set of parameters, because they usually depend on the specific use case. Thus, trying some parameter combinations, keeping in mind the effect they produce in the score function, is a good practice when doing any structure alignment. -BioJava class: [org.biojava.nbio.structure.align.multiple.mc.MultipleMcMain] -(www.biojava.org/docs/api/org/biojava/nbio/structure/align/multiple/mc/MultipleMcMain.html) - -## PDB-wide Database Searches - -The Alignment GUI also provides functionality for PDB-wide structural searches. -This systematically compares a structure against a non-redundant set of all -other structures in the PDB at either a chain or a domain level. Representatives -are selected using the RCSB's clustering of proteins with 40% sequence identity, -as described -[here](http://www.rcsb.org/pdb/static.do?p=general_information/cluster/structureAll.jsp). -Domains are selected using either SCOP (when available) or the -ProteinDomainParser algorithm. - -![Database Search GUI](img/database_search.png) - -To perform a database search, select the 'Database Search' tab, then choose a -query structure based on PDB ID, SCOP domain id, or from a custom file. The -output directory will be used to store results. These consist of individual -alignments in compressed XML format, as well as a tab-delimited file of -similarity scores and statistics. The statistics are displayed in an interactive -results table, which allows the alignments to be sorted. The 'Align' column -allows individual alignments to be visualized with the alignment GUI. - -![Database Search Results](img/database_search_results.png) - -Be aware that this process can be very time consuming. Before -starting a manual search, it is worth considering whether a pre-computed result -may be available online, for instance for -[FATCAT-rigid](http://www.rcsb.org/pdb/static.do?p=general_information/cluster/structureAll.jsp) -or [DALI](http://ekhidna.biocenter.helsinki.fi/dali/start). For custom files or -specific domains, a few optimizations can reduce the time for a database search. -Downloading PDB files is a considerable bottleneck. This can be solved by -downloading all PDB files from the [FTP -server](ftp://ftp.wwpdb.org/pub/pdb/data/structures/divided/pdb/) and setting -the `PDB_DIR` environmental variable. This operation sped up the search from -about 30 hours to less than 4 hours. +BioJava class: [org.biojava.nbio.structure.align.multiple.mc.MultipleMcMain](https://www.biojava.org/docs/api/org/biojava/nbio/structure/align/multiple/mc/MultipleMcMain.html) ## Creating Alignments Programmatically @@ -363,8 +321,7 @@ MultipleAlignmentJmolDisplay.display(result); Many of the alignment algorithms are available in the form of command line tools. These can be accessed through the main methods of the StructureAlignment -classes. Tar bundles are also available with scripts for running -[CE and FATCAT](http://source.rcsb.org/jfatcatserver/download.jsp). +classes. Example: ```bash @@ -378,7 +335,7 @@ file in various formats. ## Alignment Data Model -For details about the structure alignment data models in biojava, see [Structure Alignment Data Model](alignment-data-model.md) +For details about the structure alignment data models in BioJava, see [Structure Alignment Data Model](alignment-data-model.md) ## Acknowledgements diff --git a/structure/bioassembly.md b/structure/bioassembly.md index ab667e5..de2c2c5 100644 --- a/structure/bioassembly.md +++ b/structure/bioassembly.md @@ -153,7 +153,7 @@ List bioAssemblies = StructureIO.getBiologicalAssemblies(pdbId); ## Further Reading -The RCSB PDB web site has a great [tutorial on Biological Assemblies](http://www.rcsb.org/pdb/101/static101.do?p=education_discussion/Looking-at-Structures/bioassembly_tutorial.html). +The RCSB PDB web site has a great [tutorial on Biological Assemblies](https://pdb101.rcsb.org/learn/guide-to-understanding-pdb-data/biological-assemblies). diff --git a/structure/caching.md b/structure/caching.md index e2da072..7be2be1 100644 --- a/structure/caching.md +++ b/structure/caching.md @@ -53,10 +53,8 @@ This example turns on the use of chemical components when loading a `Structure`. AtomCache cache = new AtomCache(); cache.setPath("/tmp/"); - + FileParsingParameters params = cache.getFileParsingParams(); - - params.setLoadChemCompInfo(true); StructureIO.setAtomCache(cache); diff --git a/structure/contact-map.md b/structure/contact-map.md index 57b6818..bb9236d 100644 --- a/structure/contact-map.md +++ b/structure/contact-map.md @@ -9,7 +9,7 @@ Contacts are a useful tool to analyse protein structures. They simplify the 3-Di ## Getting the contact map of a protein chain -This code snippet will produce the set of contacts between all C alpha atoms for chain A of PDB entry [1SMT](http://www.rcsb.org/pdb/explore.do?structureId=1SMT): +This code snippet will produce the set of contacts between all C alpha atoms for chain A of PDB entry [1SMT](https://www.rcsb.org/structure/1SMT): ```java AtomCache cache = new AtomCache(); @@ -51,7 +51,7 @@ One can also find the contacting atoms between two protein chains. For instance ``` -See [DemoContacts](https://github.com/biojava/biojava/blob/master/biojava3-structure/src/main/java/demo/DemoContacts.java) for a fully working demo of the examples above. +See [DemoContacts](https://github.com/biojava/biojava/blob/master/biojava-structure/src/main/java/demo/DemoContacts.java) for a fully working demo of the examples above. diff --git a/structure/crystal-contacts.md b/structure/crystal-contacts.md index cf1fcbe..f610610 100644 --- a/structure/crystal-contacts.md +++ b/structure/crystal-contacts.md @@ -11,7 +11,7 @@ Looking at crystal contacts can also be important in order to assess the quality ## Getting the set of unique contacts in the crystal lattice -This code snippet will produce a list of all non-redundant interfaces present in the crystal lattice of PDB entry [1SMT](http://www.rcsb.org/pdb/explore.do?structureId=1SMT): +This code snippet will produce a list of all non-redundant interfaces present in the crystal lattice of PDB entry [1SMT](https://www.rcsb.org/structure/1SMT): ```java AtomCache cache = new AtomCache(); @@ -42,7 +42,7 @@ The algorithm to find all unique interfaces in the crystal works roughly like th + Searches all cells around the original one by applying crystal translations, if any 2 chains in that search is found to contact then the new contact is added to the final list. + The search is performend without repeating redundant symmetry operators, making sure that if a contact is found then it is a unique contact. -See [DemoCrystalInterfaces](https://github.com/biojava/biojava/blob/master/biojava3-structure/src/main/java/demo/DemoCrystalInterfaces.java) for a fully working demo of the example above. +See [DemoCrystalInterfaces](https://github.com/biojava/biojava/blob/master/biojava-structure/src/main/java/demo/DemoCrystalInterfaces.java) for a fully working demo of the example above. ## Clustering the interfaces One can also cluster the interfaces based on their similarity. The similarity is measured through contact overlap: number of common contacts over average number of contact in both chains. The clustering can be done as following: diff --git a/structure/firststeps.md b/structure/firststeps.md index 8effe51..ef13be2 100644 --- a/structure/firststeps.md +++ b/structure/firststeps.md @@ -6,14 +6,10 @@ First Steps The simplest way to load a PDB file is by using the [StructureIO](http://www.biojava.org/docs/api/org/biojava/nbio/structure/StructureIO.html) class. ```java - public static void main(String[] args){ - try { - Structure structure = StructureIO.getStructure("4HHB"); - // and let's print out how many atoms are in this structure - System.out.println(StructureTools.getNrAtoms(structure)); - } catch (Exception e){ - e.printStackTrace(); - } + public static void main(String[] args) throws Exception { + Structure structure = StructureIO.getStructure("4HHB"); + // and let's print out how many atoms are in this structure + System.out.println(StructureTools.getNrAtoms(structure)); } ``` @@ -53,23 +49,17 @@ Talking about startup properties, it is also good to mention the fact that many If you have the *biojava-structure-gui* module installed, you can quickly visualise a [Structure](http://www.biojava.org/docs/api/org/biojava/nbio/structure/Structure.html) via this: ```java - public static void main(String[] args){ - try { - - Structure struc = StructureIO.getStructure("4hhb"); - - StructureAlignmentJmol jmolPanel = new StructureAlignmentJmol(); - - jmolPanel.setStructure(struc); - - // send some commands to Jmol - jmolPanel.evalString("select * ; color chain;"); - jmolPanel.evalString("select *; spacefill off; wireframe off; cartoon on; "); - jmolPanel.evalString("select ligands; cartoon off; wireframe 0.3; spacefill 0.5; color cpk;"); - - } catch (Exception e){ - e.printStackTrace(); - } + public static void main(String[] args) throws Exception { + Structure struc = StructureIO.getStructure("4hhb"); + + StructureAlignmentJmol jmolPanel = new StructureAlignmentJmol(); + + jmolPanel.setStructure(struc); + + // send some commands to Jmol + jmolPanel.evalString("select * ; color chain;"); + jmolPanel.evalString("select *; spacefill off; wireframe off; cartoon on; "); + jmolPanel.evalString("select ligands; cartoon off; wireframe 0.3; spacefill 0.5; color cpk;"); } ``` @@ -91,15 +81,10 @@ This will result in the following view: By default many people work with the *asymmetric unit* of a protein. However for many studies the correct representation to look at is the *biological assembly* of a protein. You can request it by calling ```java - public static void main(String[] args){ - - try { - Structure structure = StructureIO.getBiologicalAssembly("1GAV"); - // and let's print out how many atoms are in this structure - System.out.println(StructureTools.getNrAtoms(structure)); - } catch (Exception e){ - e.printStackTrace(); - } + public static void main(String[] args) throws Exception { + Structure structure = StructureIO.getBiologicalAssembly("1GAV"); + // and let's print out how many atoms are in this structure + System.out.println(StructureTools.getNrAtoms(structure)); } ``` diff --git a/structure/mmcif.md b/structure/mmcif.md index 230488e..769b851 100644 --- a/structure/mmcif.md +++ b/structure/mmcif.md @@ -12,12 +12,15 @@ The mmCIF file format has been around for some time (see [Westbrook 2000][] and ## The Basics -BioJava provides you with both a mmCIF parser and a data model that reads PDB and mmCIF files into a biological and chemically meaningful data model (BioJava supports the [Chemical Components Dictionary](mmcif.md)). If you don't want to use that data model, you can still use BioJava's file parsers, and more on that later, let's start first with the most basic way of loading a protein structure. +BioJava uses the [CIFTools-java](https://github.com/rcsb/ciftools-java) library to parse mmCIF. BioJava then has its own data model that reads PDB and mmCIF files +into a biological and chemically meaningful data model (BioJava supports the [Chemical Components Dictionary](chemcomp.md)). +If you don't want to use that data model, you can still use the CIFTools-java parser, please refer to its documentation. +Let's start first with the most basic way of loading a protein structure. ## First Steps -The simplest way to load a PDB file is by using the [StructureIO](http://www.biojava.org/docs/api/org/biojava/nbio/structure/StructureIO.html) class. +The simplest way to load a PDBx/mmCIF file is by using the [StructureIO](http://www.biojava.org/docs/api/org/biojava/nbio/structure/StructureIO.html) class. ```java Structure structure = StructureIO.getStructure("4HHB"); @@ -25,9 +28,7 @@ The simplest way to load a PDB file is by using the [StructureIO](http://www.bio System.out.println(StructureTools.getNrAtoms(structure)); ``` - - -BioJava automatically downloaded the PDB file for hemoglobin [4HHB](http://www.rcsb.org/pdb/explore.do?structureId=4HHB) and copied it into a temporary location. This demonstrates two things: +BioJava automatically downloaded the PDB file for hemoglobin [4HHB](http://www.rcsb.org/pdb/explore.do?structureId=4HHB) and copied it into a temporary location. This demonstrates two things: + BioJava can automatically download and install files locally + BioJava by default writes those files into a temporary location (The system temp directory "java.io.tempdir"). @@ -38,14 +39,16 @@ If you already have a local PDB installation, you can configure where BioJava sh -DPDB_DIR=/wherever/you/want/ -## From PDB to mmCIF +## Switching AtomCache to use different file types -By default BioJava is using the PDB file format for parsing data. In order to switch it to use mmCIF, we can take control over the underlying [AtomCache](http://www.biojava.org/docs/api/org/biojava/nbio/structure/align/util/AtomCache.html) which manages your PDB ([and btw. also SCOP, CATH](externaldb.md)) installations. +By default BioJava is using the BCIF file format for parsing data. In order to switch it to use mmCIF, we can take control over +the underlying [AtomCache](http://www.biojava.org/docs/api/org/biojava/nbio/structure/align/util/AtomCache.html) which +manages your PDB ([and btw. also SCOP, CATH](externaldb.md)) installations. ```java AtomCache cache = new AtomCache(); - - cache.setUseMmCif(true); + + cache.setFiletype(StructureFiletype.CIF); // if you struggled to set the PDB_DIR property correctly in the previous step, // you could set it manually like this: @@ -59,7 +62,7 @@ By default BioJava is using the PDB file format for parsing data. In order to sw System.out.println(structure.getChains().size()); ``` -As you can see, the AtomCache will again download the missing mmCIF file for 4HHB in the background. +See other supported file types in the `StructureFileType` enum. ## URL based parsing of files @@ -67,13 +70,8 @@ StructureIO can also access files via URLs and fetch the data dynamically. E.g. ```java String u = "http://ftp.wwpdb.org/pub/pdb/data/biounit/mmCIF/divided/nw/4nwr-assembly1.cif.gz"; - try { - Structure s = StructureIO.getStructure(u); - - System.out.println(s); - } catch (Exception e) { - e.printStackTrace(); - } + Structure s = StructureIO.getStructure(u); + System.out.println(s); ``` ### Local URLs @@ -86,34 +84,12 @@ BioJava can also access local files, by specifying the URL as ## Low Level Access -If you want to learn how to use the BioJava mmCIF parser to populate your own data structure, let's first take a look this lower-level code: +You can load a BioJava `Structure` object using the ciftools-java parser with: ```java InputStream inStream = new FileInputStream(fileName); - - MMcifParser parser = new SimpleMMcifParser(); - - SimpleMMcifConsumer consumer = new SimpleMMcifConsumer(); - - // The Consumer builds up the BioJava - structure object. - // you could also hook in your own and build up you own data model. - parser.addMMcifConsumer(consumer); - - try { - parser.parse(new BufferedReader(new InputStreamReader(inStream))); - } catch (IOException e){ - e.printStackTrace(); - } - // now get the protein structure. - Structure cifStructure = consumer.getStructure(); -``` - -The parser operates similar to a XML parser by triggering "events". The [SimpleMMcifConsumer](http://www.biojava.org/docs/api/org/biojava/nbio/structure/io/mmcif/SimpleMMcifConsumer.html) listens to new categories being read from the file and then builds up the BioJava data model. - -To re-use the parser for your own datamodel, just implement the [MMcifConsumer](http://www.biojava.org/docs/api/org/biojava/nbio/structure/io/mmcif/MMcifConsumer.html) interface and add it to the [SimpleMMcifParser](http://www.biojava.org/docs/api/org/biojava/nbio/structure/io/mmcif/SimpleMMcifParser.html). -```java - parser.addMMcifConsumer(myOwnConsumerImplementation); + Structure cifStructure = CifStructureConverter.fromInputStream(inStream); ``` ## I Loaded a Structure Object, What Now? diff --git a/structure/secstruc.md b/structure/secstruc.md index 7216d84..fbd0f94 100644 --- a/structure/secstruc.md +++ b/structure/secstruc.md @@ -10,8 +10,8 @@ Secondary structure can be formally defined by the pattern of hydrogen bonds of More specifically, the secondary structure is defined by the patterns of hydrogen bonds formed between amine hydrogen (-NH) and carbonyl oxygen (C=O) atoms contained in the backbone peptide bonds of the protein. -For more info see the Wikipedia article on [protein secondary structure] -(https://en.wikipedia.org/wiki/Protein_secondary_structure). +For more info see the Wikipedia article +on [protein secondary structure](https://en.wikipedia.org/wiki/Protein_secondary_structure). ## Secondary Structure Annotation @@ -106,8 +106,8 @@ input Structure overriding any previous annotation, like in the DSSPParser. An e ssp.calculate(s, true); //true assigns the SS to the Structure ``` -BioJava Class: [org.biojava.nbio.structure.secstruc.SecStrucCalc] -(http://www.biojava.org/docs/api/org/biojava/nbio/structure/secstruc/SecStrucCalc.html) +BioJava Class: +[org.biojava.nbio.structure.secstruc.SecStrucCalc](http://www.biojava.org/docs/api/org/biojava/nbio/structure/secstruc/SecStrucCalc.html) ### Storage and Data Structures diff --git a/structure/seqres.md b/structure/seqres.md index db64971..2d03e04 100644 --- a/structure/seqres.md +++ b/structure/seqres.md @@ -5,12 +5,11 @@ How molecular sequences are linked to experimentally observed atoms. ## Sequences and Atoms -In many experiments not all atoms that are part of the molecule under study can be observed. As such the ATOM records in PDB oftein contain missing atoms or only the part of a molecule that could be experimentally determined. In case of multi-domain proteins the PDB often contains only one of the domains (and in some cases even shorter fragments). +In many experiments not all atoms that are part of the molecule under study can be observed. As such the ATOM records in PDB often contain missing atoms or only the part of a molecule that could be experimentally determined. In case of multi-domain proteins the PDB often contains only one of the domains (and in some cases even shorter fragments). -Let's take a look at an example. The [Protein Feature View](https://github.com/andreasprlic/proteinfeatureview) provides a graphical summary of how the regions that have been observed in an experiment and are available in the PDB map to UniProt. +Let's take a look at an example. The [Protein Feature View](https://github.com/andreasprlic/proteinfeatureview) provides a graphical summary of the regions that have been observed in an experiment and are available in the PDB map to UniProt. -![Screenshot of Protein Feature View at RCSB] -(https://raw.github.com/andreasprlic/proteinfeatureview/master/images/P06213.png "Insulin receptor - P06213 (INSR_HUMAN)") +![Screenshot of Protein Feature View at RCSB](https://raw.github.com/andreasprlic/proteinfeatureview/master/images/P06213.png "Insulin receptor - P06213 (INSR_HUMAN)") As you can see, there are three PDB entries (PDB IDs [3LOH](http://www.rcsb.org/pdb/explore.do?structureId=3LOH), [2HR7](http://www.rcsb.org/pdb/explore.do?structureId=2RH7), [3BU3](http://www.rcsb.org/pdb/explore.do?structureId=3BU3)) that cover different regions of the UniProt sequence for the insulin receptor. @@ -18,7 +17,7 @@ The blue-boxes are regions for which atoms records are available. For the grey r ## Seqres and Atom Records -The sequence that has been used in the experiment is stored in the **Seqres** records in the PDB. It is often not the same sequences as can be found in Uniprot, since it can contain cloning-artefacts and modifications that were necessary in order to crystallize a structure. +The sequence that has been used in the experiment is stored in the **Seqres** records in the PDB. It is often not the same sequence as can be found in Uniprot, since it can contain cloning-artefacts and modifications that were necessary in order to crystallize a structure. The **Atom** records provide coordinates where it was possible to observe them. diff --git a/structure/structure-data-model.md b/structure/structure-data-model.md index c8db2c0..6ea6ce4 100644 --- a/structure/structure-data-model.md +++ b/structure/structure-data-model.md @@ -28,7 +28,7 @@ Structure All `Structure` objects contain one or more `Models`. That means also X-ray structures contain a "virtual" model which serves as a container for the chains. This allows to represent multi-model X-ray structures, e.g. from time-series analysis. The most common way to access chains is via: ```java - List chains = structure.getChains(); + List chains = structure.getChains(); ``` This works for both NMR and X-ray based structures and by default the first `Model` is getting accessed. @@ -58,7 +58,7 @@ Here an example that loops over the whole data model and prints out the HEM grou for (Chain c : chains) { - System.out.println(" Chain: " + c.getChainID() + " # groups with atoms: " + c.getAtomGroups().size()); + System.out.println(" Chain: " + c.getId() + " # groups with atoms: " + c.getAtomGroups().size()); for (Group g: c.getAtomGroups()){ @@ -87,24 +87,24 @@ The [Group](http://www.biojava.org/docs/api/org/biojava/nbio/structure/Group.htm In order to get all amino acids that have been observed in a PDB chain, you can use the following utility method: ```java - Chain chain = s.getChainByPDB("A"); - List groups = chain.getAtomGroups("amino"); + Chain chain = structure.getPolyChainByPDB("A"); + List groups = chain.getAtomGroups(GroupType.AMINOACID); for (Group group : groups) { - AminoAcid aa = (AminoAcid) group; + SecStrucInfo secStrucInfo = (SecStrucInfo) group.getProperty(Group.SEC_STRUC); - // do something amino acid specific, e.g. print the secondary structure assignment - System.out.println(aa + " " + aa.getSecStruc()); + // print the secondary structure assignment + System.out.println(group + " -- " + secStrucInfo); } ``` In a similar way you can access all nucleotide groups by ```java - chain.getAtomGroups("nucleotide"); + chain.getAtomGroups(GroupType.NUCLEOTIDE); ``` The Hetatom groups are access in a similar fashion: ```java - chain.getAtomGroups("hetatm"); + chain.getAtomGroups(GroupType.HETATM); ``` @@ -112,10 +112,10 @@ Since all 3 types of groups are implementing the Group interface, you can also i ```java List allgroups = chain.getAtomGroups(); - for (Group group : groups) { - if ( group instanceof AminoAcid) { - AminoAcid aa = (AminoAcid) group; - System.out.println(aa.getSecStruc()); + for (Group group : allgroups) { + if (group.isAminoAcid()) { + SecStrucInfo secStrucInfo = (SecStrucInfo) group.getProperty(Group.SEC_STRUC); + System.out.println(group + " -- " + secStrucInfo); } } ```