diff --git a/README.md b/README.md
index a97ff6e..12924e3 100644
--- a/README.md
+++ b/README.md
@@ -1,18 +1,18 @@
Tutorial
===
-A brief introduction into [BioJava](https://github.com/biojava/biojava).
+A brief introduction into [BioJava](https://www.biojava.org).
-----
-The goal of this tutorial is to provide an educational introduction into some of the features that are provided by BioJava.
+The goal of this tutorial is to provide an educational introduction into some of the features that are provided by BioJava. This tutorial is still under development, hence not yet comprehensive for the entire library. Please also check other sources of [documentation](https://biojava.org/wiki/Documentation).
-At the moment this tutorial is still under development. Please check the [BioJava Cookbook](http://biojava.org/wikis/BioJava:CookBook4.0) for a more comprehensive collection of examples about what is possible with BioJava and how to do things.
+The examples within the tutorial are intended to work with the most recent version of BioJava. Please do submit a [new issue](https://github.com/biojava/biojava-tutorial/issues) if you find any problems.
-The tutorial is intended to work with the most recent version of BioJava, although most examples will work with BioJava 3.0 and higher.
+The tutorial is subdivided into several books, corresponding to the respective BioJava modules. Each book is further subdivided into several chapters that intend to describe the main functionality of the module in order of increasing complexity.
## Index
-Quick [Installation](installation.md)
+[Quick Installation](installation.md)
Book 1: [The Core Module](core/README.md), basic working with sequences.
@@ -24,20 +24,18 @@ Book 4: [The Genomics Module](genomics/README.md), working with genomic data.
Book 5: [The Protein-Disorder Module](protein-disorder/README.md), predicting protein-disorder.
-Book 6: [The ModFinder Module](modfinder/README.md), identifying potein modifications in 3D structures
+Book 6: [The ModFinder Module](modfinder/README.md), identifying protein modifications in 3D structures
## License
-The content of this tutorial is available under the [CC-BY](http://creativecommons.org/licenses/by/3.0/) license.
-
-[view license](license.md)
+The content of this tutorial is available under the [CC-BY](http://creativecommons.org/licenses/by/3.0/) license, available [here](license.md).
## Please Cite
-**BioJava: an open-source framework for bioinformatics in 2012**
-*Andreas Prlic; Andrew Yates; Spencer E. Bliven; Peter W. Rose; Julius Jacobsen; Peter V. Troshin; Mark Chapman; Jianjiong Gao; Chuan Hock Koh; Sylvain Foisy; Richard Holland; Gediminas Rimsa; Michael L. Heuer; H. Brandstatter-Muller; Philip E. Bourne; Scooter Willis*
-[Bioinformatics (2012) 28 (20): 2693-2695.](http://bioinformatics.oxfordjournals.org/content/28/20/2693.abstract)
-[](http://bioinformatics.oxfordjournals.org/content/28/20/2693.abstract) [](http://www.ncbi.nlm.nih.gov/pubmed/22877863)
+**BioJava 5: A community driven open-source bioinformatics library**
+*Aleix Lafita, Spencer Bliven, Andreas Prlić, Dmytro Guzenko, Peter W. Rose, Anthony Bradley, Paolo Pavan, Douglas Myers-Turnbull, Yana Valasatava, Michael Heuer, Matt Larson, Stephen K. Burley, & Jose M. Duarte*
+[PLOS Computational Biology (2019) 15 (2):e1006791.](https://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1006791)
+[](https://doi.org/10.1371/journal.pcbi.1006791) [](http://www.ncbi.nlm.nih.gov/pubmed/30735498)
diff --git a/alignment/README.md b/alignment/README.md
index 3ea8858..3f093fe 100644
--- a/alignment/README.md
+++ b/alignment/README.md
@@ -36,19 +36,16 @@ Chapter 5 - Reading and writing of multiple alignments
Chapter 6 - BLAST - why you don't need BioJava for parsing BLAST
-## Please cite
-
-**BioJava: an open-source framework for bioinformatics in 2012**
-*Andreas Prlic; Andrew Yates; Spencer E. Bliven; Peter W. Rose; Julius Jacobsen; Peter V. Troshin; Mark Chapman; Jianjiong Gao; Chuan Hock Koh; Sylvain Foisy; Richard Holland; Gediminas Rimsa; Michael L. Heuer; H. Brandstatter-Muller; Philip E. Bourne; Scooter Willis*
-[Bioinformatics (2012) 28 (20): 2693-2695.](http://bioinformatics.oxfordjournals.org/content/28/20/2693.abstract)
-[](http://bioinformatics.oxfordjournals.org/content/28/20/2693.abstract) [](http://www.ncbi.nlm.nih.gov/pubmed/22877863)
-
-
## License
-The content of this tutorial is available under the [CC-BY](http://creativecommons.org/licenses/by/3.0/) license.
+The content of this tutorial is available under the [CC-BY](http://creativecommons.org/licenses/by/3.0/) license, available [here](../license.md).
+
+## Please cite
-[view license](../license.md)
+**BioJava 5: A community driven open-source bioinformatics library**
+*Aleix Lafita, Spencer Bliven, Andreas Prlić, Dmytro Guzenko, Peter W. Rose, Anthony Bradley, Paolo Pavan, Douglas Myers-Turnbull, Yana Valasatava, Michael Heuer, Matt Larson, Stephen K. Burley, & Jose M. Duarte*
+[PLOS Computational Biology (2019) 15 (2):e1006791.](https://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1006791)
+[](https://doi.org/10.1371/journal.pcbi.1006791) [](http://www.ncbi.nlm.nih.gov/pubmed/30735498)
diff --git a/alignment/smithwaterman.md b/alignment/smithwaterman.md
index 0f38bf6..5de8acf 100644
--- a/alignment/smithwaterman.md
+++ b/alignment/smithwaterman.md
@@ -36,7 +36,7 @@ public static void main(String[] args) throws Exception {
}
private static ProteinSequence getSequenceForId(String uniProtId) throws Exception {
- URL uniprotFasta = new URL(String.format("http://www.uniprot.org/uniprot/%s.fasta", uniProtId));
+ URL uniprotFasta = new URL(String.format("https://www.uniprot.org/uniprot/%s.fasta", uniProtId));
ProteinSequence seq = FastaReaderHelper.readFastaProteinSequence(uniprotFasta.openStream()).get(uniProtId);
System.out.printf("id : %s %s%s%s", uniProtId, seq, System.getProperty("line.separator"), seq.getOriginalHeader());
System.out.println();
diff --git a/core/README.md b/core/README.md
index 3638712..7995c81 100644
--- a/core/README.md
+++ b/core/README.md
@@ -32,19 +32,16 @@ Chapter 3 - [Reading and Writing sequences](readwrite.md)
Chapter 4 - [Translating](translating.md) DNA and protein sequences.
-## Please cite
-
-**BioJava: an open-source framework for bioinformatics in 2012**
-*Andreas Prlic; Andrew Yates; Spencer E. Bliven; Peter W. Rose; Julius Jacobsen; Peter V. Troshin; Mark Chapman; Jianjiong Gao; Chuan Hock Koh; Sylvain Foisy; Richard Holland; Gediminas Rimsa; Michael L. Heuer; H. Brandstatter-Muller; Philip E. Bourne; Scooter Willis*
-[Bioinformatics (2012) 28 (20): 2693-2695.](http://bioinformatics.oxfordjournals.org/content/28/20/2693.abstract)
-[](http://bioinformatics.oxfordjournals.org/content/28/20/2693.abstract) [](http://www.ncbi.nlm.nih.gov/pubmed/22877863)
-
-
## License
-The content of this tutorial is available under the [CC-BY](http://creativecommons.org/licenses/by/3.0/) license.
+The content of this tutorial is available under the [CC-BY](http://creativecommons.org/licenses/by/3.0/) license, available [here](../license.md).
+
+## Please Cite
-[view license](../license.md)
+**BioJava 5: A community driven open-source bioinformatics library**
+*Aleix Lafita, Spencer Bliven, Andreas Prlić, Dmytro Guzenko, Peter W. Rose, Anthony Bradley, Paolo Pavan, Douglas Myers-Turnbull, Yana Valasatava, Michael Heuer, Matt Larson, Stephen K. Burley, & Jose M. Duarte*
+[PLOS Computational Biology (2019) 15 (2):e1006791.](https://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1006791)
+[](https://doi.org/10.1371/journal.pcbi.1006791) [](http://www.ncbi.nlm.nih.gov/pubmed/30735498)
diff --git a/core/readwrite.md b/core/readwrite.md
index 1ab278b..432a419 100644
--- a/core/readwrite.md
+++ b/core/readwrite.md
@@ -13,7 +13,7 @@ Here an example that parses a UniProt FASTA file into a protein sequence.
```java
public static ProteinSequence getSequenceForId(String uniProtId) throws Exception {
- URL uniprotFasta = new URL(String.format("http://www.uniprot.org/uniprot/%s.fasta", uniProtId));
+ URL uniprotFasta = new URL(String.format("https://www.uniprot.org/uniprot/%s.fasta", uniProtId));
ProteinSequence seq = FastaReaderHelper.readFastaProteinSequence(uniprotFasta.openStream()).get(uniProtId);
System.out.printf("id : %s %s%s%s", uniProtId, seq, System.getProperty("line.separator"), seq.getOriginalHeader());
System.out.println();
@@ -79,6 +79,27 @@ BioJava can also be used to parse large FASTA files. The example below can parse
}
```
+BioJava can also process large FASTA files using the Java streams API.
+
+```java
+ FastaStreamer
+ .from(path)
+ .stream()
+ .forEach(sequence -> System.out.printf("%s -> %ss\n", sequence.getOriginalHeader(), sequence.getSequenceAsString()));
+```
+
+If you need to specify a header parser other that `GenericFastaHeaderParser` or a sequence creater other than a
+`ProteinSequenceCreator`, these can be specified before streaming the contents as follows:
+
+```java
+ FastaStreamer
+ .from(path)
+ .withHeaderParser(new PlainFastaHeaderParser<>())
+ .withSequenceCreator(new CasePreservingProteinSequenceCreator(AminoAcidCompoundSet.getAminoAcidCompoundSet()))
+ .stream()
+ .forEach(sequence -> System.out.printf("%s -> %ss\n", sequence.getOriginalHeader(), sequence.getSequenceAsString()));
+```
+
diff --git a/core/translating.md b/core/translating.md
index 9b83643..10b953a 100644
--- a/core/translating.md
+++ b/core/translating.md
@@ -63,7 +63,7 @@ An example for how to parse a sequence from a String and using the Translation e
// define the Ambiguity Compound Sets
AmbiguityDNACompoundSet ambiguityDNACompoundSet = AmbiguityDNACompoundSet.getDNACompoundSet();
- CompoundSet nucleotideCompoundSet = AmbiguityRNACompoundSet.getDNACompoundSet();
+ CompoundSet nucleotideCompoundSet = AmbiguityRNACompoundSet.getRNACompoundSet();
FastaReader proxy =
new FastaReader(
diff --git a/genomics/README.md b/genomics/README.md
index d5a8470..a7ff27e 100644
--- a/genomics/README.md
+++ b/genomics/README.md
@@ -39,19 +39,16 @@ Chapter 5 - Reading [karyotype (cytoband)](karyotype.md) files
Chapter 6 - Reading genomic DNA sequences using UCSC's [.2bit file format](twobit.md)
-## Please cite
-
-**BioJava: an open-source framework for bioinformatics in 2012**
-*Andreas Prlic; Andrew Yates; Spencer E. Bliven; Peter W. Rose; Julius Jacobsen; Peter V. Troshin; Mark Chapman; Jianjiong Gao; Chuan Hock Koh; Sylvain Foisy; Richard Holland; Gediminas Rimsa; Michael L. Heuer; H. Brandstatter-Muller; Philip E. Bourne; Scooter Willis*
-[Bioinformatics (2012) 28 (20): 2693-2695.](http://bioinformatics.oxfordjournals.org/content/28/20/2693.abstract)
-[](http://bioinformatics.oxfordjournals.org/content/28/20/2693.abstract) [](http://www.ncbi.nlm.nih.gov/pubmed/22877863)
-
-
## License
-The content of this tutorial is available under the [CC-BY](http://creativecommons.org/licenses/by/3.0/) license.
+The content of this tutorial is available under the [CC-BY](http://creativecommons.org/licenses/by/3.0/) license, available [here](../license.md).
+
+## Please Cite
-[view license](../license.md)
+**BioJava 5: A community driven open-source bioinformatics library**
+*Aleix Lafita, Spencer Bliven, Andreas Prlić, Dmytro Guzenko, Peter W. Rose, Anthony Bradley, Paolo Pavan, Douglas Myers-Turnbull, Yana Valasatava, Michael Heuer, Matt Larson, Stephen K. Burley, & Jose M. Duarte*
+[PLOS Computational Biology (2019) 15 (2):e1006791.](https://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1006791)
+[](https://doi.org/10.1371/journal.pcbi.1006791) [](http://www.ncbi.nlm.nih.gov/pubmed/30735498)
diff --git a/modfinder/README.md b/modfinder/README.md
index 202ff31..ec8ed8c 100644
--- a/modfinder/README.md
+++ b/modfinder/README.md
@@ -27,24 +27,21 @@ Chapter 3 - [How to identify protein modifications in a structure](identify-prot
Chapter 4 - [How to define a new protein modification](add-protein-modification.md)
-## Please cite
+## License
+
+The content of this tutorial is available under the [CC-BY](http://creativecommons.org/licenses/by/3.0/) license, available [here](../license.md).
+
+## Please Cite
**BioJava-ModFinder: identification of protein modifications in 3D structures from the Protein Data Bank**
*Jianjiong Gao; Andreas Prlic; Chunxiao Bi; Wolfgang F. Bluhm; Dimitris Dimitropoulos; Dong Xu; Philip E. Bourne; Peter W. Rose*
[Bioinformatics. 2017 Feb 17.](https://academic.oup.com/bioinformatics/article-lookup/doi/10.1093/bioinformatics/btx101)
[](https://doi.org/10.1093/bioinformatics/btx101) [](http://www.ncbi.nlm.nih.gov/pubmed/28334105)
-**BioJava: an open-source framework for bioinformatics in 2012**
-*Andreas Prlic; Andrew Yates; Spencer E. Bliven; Peter W. Rose; Julius Jacobsen; Peter V. Troshin; Mark Chapman; Jianjiong Gao; Chuan Hock Koh; Sylvain Foisy; Richard Holland; Gediminas Rimsa; Michael L. Heuer; H. Brandstatter-Muller; Philip E. Bourne; Scooter Willis*
-[Bioinformatics (2012) 28 (20): 2693-2695.](http://bioinformatics.oxfordjournals.org/content/28/20/2693.abstract)
-[](http://bioinformatics.oxfordjournals.org/content/28/20/2693.abstract) [](http://www.ncbi.nlm.nih.gov/pubmed/22877863)
-
-
-## License
-
-The content of this tutorial is available under the [CC-BY](http://creativecommons.org/licenses/by/3.0/) license.
-
-[view license](../license.md)
+**BioJava 5: A community driven open-source bioinformatics library**
+*Aleix Lafita, Spencer Bliven, Andreas Prlić, Dmytro Guzenko, Peter W. Rose, Anthony Bradley, Paolo Pavan, Douglas Myers-Turnbull, Yana Valasatava, Michael Heuer, Matt Larson, Stephen K. Burley, & Jose M. Duarte*
+[PLOS Computational Biology (2019) 15 (2):e1006791.](https://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1006791)
+[](https://doi.org/10.1371/journal.pcbi.1006791) [](http://www.ncbi.nlm.nih.gov/pubmed/30735498)
diff --git a/protein-disorder/README.md b/protein-disorder/README.md
index 2238bb6..7bee8c3 100644
--- a/protein-disorder/README.md
+++ b/protein-disorder/README.md
@@ -92,18 +92,16 @@ Map ranges = Jronn.getDisorder(sequences);
```
-## Please cite
-
-**BioJava: an open-source framework for bioinformatics in 2012**
-*Andreas Prlic; Andrew Yates; Spencer E. Bliven; Peter W. Rose; Julius Jacobsen; Peter V. Troshin; Mark Chapman; Jianjiong Gao; Chuan Hock Koh; Sylvain Foisy; Richard Holland; Gediminas Rimsa; Michael L. Heuer; H. Brandstatter-Muller; Philip E. Bourne; Scooter Willis*
-[Bioinformatics (2012) 28 (20): 2693-2695.](http://bioinformatics.oxfordjournals.org/content/28/20/2693.abstract)
-[](http://bioinformatics.oxfordjournals.org/content/28/20/2693.abstract) [](http://www.ncbi.nlm.nih.gov/pubmed/22877863)
-
## License
-The content of this tutorial is available under the [CC-BY](http://creativecommons.org/licenses/by/3.0/) license.
+The content of this tutorial is available under the [CC-BY](http://creativecommons.org/licenses/by/3.0/) license, available [here](../license.md).
+
+## Please Cite
-[view license](../license.md)
+**BioJava 5: A community driven open-source bioinformatics library**
+*Aleix Lafita, Spencer Bliven, Andreas Prlić, Dmytro Guzenko, Peter W. Rose, Anthony Bradley, Paolo Pavan, Douglas Myers-Turnbull, Yana Valasatava, Michael Heuer, Matt Larson, Stephen K. Burley, & Jose M. Duarte*
+[PLOS Computational Biology (2019) 15 (2):e1006791.](https://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1006791)
+[](https://doi.org/10.1371/journal.pcbi.1006791) [](http://www.ncbi.nlm.nih.gov/pubmed/30735498)
diff --git a/structure/README.md b/structure/README.md
index 84df6be..9552ebc 100644
--- a/structure/README.md
+++ b/structure/README.md
@@ -64,22 +64,16 @@ Chapter 17 - [Special Cases](special.md)
Chapter 18 - [Lists](lists.md) of PDB IDs and PDB [Status Information](lists.md)
-### Author:
-
-[Andreas Prlić](https://github.com/andreasprlic)
-
-## Please cite
-
-**BioJava: an open-source framework for bioinformatics in 2012**
-*Andreas Prlic; Andrew Yates; Spencer E. Bliven; Peter W. Rose; Julius Jacobsen; Peter V. Troshin; Mark Chapman; Jianjiong Gao; Chuan Hock Koh; Sylvain Foisy; Richard Holland; Gediminas Rimsa; Michael L. Heuer; H. Brandstatter-Muller; Philip E. Bourne; Scooter Willis*
-[Bioinformatics (2012) 28 (20): 2693-2695.](http://bioinformatics.oxfordjournals.org/content/28/20/2693.abstract)
-[](http://bioinformatics.oxfordjournals.org/content/28/20/2693.abstract) [](http://www.ncbi.nlm.nih.gov/pubmed/22877863)
-
## License
-The content of this tutorial is available under the [CC-BY](http://creativecommons.org/licenses/by/3.0/) license.
+The content of this tutorial is available under the [CC-BY](http://creativecommons.org/licenses/by/3.0/) license, available [here](../license.md).
+
+## Please Cite
-[view license](../license.md)
+**BioJava 5: A community driven open-source bioinformatics library**
+*Aleix Lafita, Spencer Bliven, Andreas Prlić, Dmytro Guzenko, Peter W. Rose, Anthony Bradley, Paolo Pavan, Douglas Myers-Turnbull, Yana Valasatava, Michael Heuer, Matt Larson, Stephen K. Burley, & Jose M. Duarte*
+[PLOS Computational Biology (2019) 15 (2):e1006791.](https://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1006791)
+[](https://doi.org/10.1371/journal.pcbi.1006791) [](http://www.ncbi.nlm.nih.gov/pubmed/30735498)
diff --git a/structure/alignment.md b/structure/alignment.md
index 4f11c54..6053e4a 100644
--- a/structure/alignment.md
+++ b/structure/alignment.md
@@ -20,12 +20,12 @@ acid sequences converge on a common tertiary structure.
A **structural alignment** of other biological polymers can also be made in BioJava.
For example, nucleic acids can be structurally aligned to find common structural motifs,
-independent of sequence simililarity. This is specially important for RNAs, because their
+independent of sequence similarity. This is specially important for RNAs, because their
3D structure arrangement is important for their function.
For more info see the Wikipedia article on [structure alignment](http://en.wikipedia.org/wiki/Structural_alignment).
-## Alignment Algorithms supported by BioJava
+## Alignment Algorithms Supported by BioJava
BioJava comes with a number of algorithms for aligning structures. The following
five options are displayed by default in the graphical user interface (GUI),
@@ -45,9 +45,9 @@ in 3D. See below for descriptions of the algorithms.
Since BioJava version 4.1.0, multiple structures can be compared at the same time in
a **multiple structure alignment**, that can later be visualized in Jmol.
The algorithm is described in detail below. As an overview, it uses any pairwise alignment
-algorithm and a **reference** structure to per perform an alignment of all the structures.
+algorithm and a **reference** structure to perform an alignment of all the structures.
Then, it runs a **Monte Carlo** optimization to determine the residue equivalencies among
-all the strucutures, identifying conserved **structural motifs**.
+all the structures, identifying conserved **structural motifs**.
## Alignment User Interface
@@ -91,7 +91,7 @@ This code shows the following user interface:

The input format is a free text field, where the structure identifiers are
-indidcated, space separated. A **structure identifier** is a String that
+indicated, space separated. A **structure identifier** is a String that
uniquely identifies a structure. It is basically composed of the pdbID, the
chain letters and the ranges of residues of each chain. For the formal description
visit [StructureIdentifier](http://www.biojava.org/docs/api/org/biojava/nbio/structure/StructureIdentifier.html).
@@ -125,12 +125,12 @@ The Combinatorial Extension (CE) algorithm was originally developed by
1998](http://peds.oxfordjournals.org/content/11/9/739.short) [](http://www.ncbi.nlm.nih.gov/pubmed/9796821).
It works by identifying segments of the two structures with similar local
structure, and then combining those to try to align the most residues possible
-while keeping the overall RMSD of the superposition low.
+while keeping the overall root-mean-square deviation (RMSD) of the superposition low.
CE is a rigid-body alignment algorithm, which means that the structures being
compared are kept fixed during superposition. In some cases it may be desirable
to break large proteins up into domains prior to aligning them (by manually
-inputing a subrange, using the [SCOP or CATH databases](externaldb.md), or by
+inputting a subrange, using the [SCOP or CATH databases](externaldb.md), or by
decomposing the protein automatically using the [Protein Domain
Parser](http://www.biojava.org/docs/api/org/biojava/nbio/structure/domain/LocalProteinDomainParser.html)
algorithm).
@@ -146,10 +146,8 @@ to the C-terminal part of the other, and vice versa. CE-CP allows circularly
permuted proteins to be compared. For more information on circular
permutations, see the
[Wikipedia](http://en.wikipedia.org/wiki/Circular_permutation_in_proteins) or
-[Molecule of the Month]
-(http://www.pdb.org/pdb/101/motm.do?momID=124&evtc=Suggest&evta=Moleculeof%20the%20Month&evtl=TopBar)
-articles [![pubmed]
-(http://img.shields.io/badge/in-pubmed-blue.svg?style=flat)](http://www.ncbi.nlm.nih.gov/pubmed/22496628).
+[Molecule of the Month](https://pdb101.rcsb.org/motm/124)
+articles [](http://www.ncbi.nlm.nih.gov/pubmed/22496628).
For proteins without a circular permutation, CE-CP results look very similar to
@@ -173,8 +171,7 @@ It performs similarly to CE for most structures. The 'rigid' flavor uses a
rigid-body superposition and only considers alignments with matching sequence
order.
-BioJava class: [org.biojava.nbio.structure.align.fatcat.FatCatRigid]
-(www.biojava.org/docs/api/org/biojava/nbio/structure/align/fatcat/FatCatRigid.html)
+BioJava class: [org.biojava.nbio.structure.align.fatcat.FatCatRigid](https://www.biojava.org/docs/api/org/biojava/nbio/structure/align/fatcat/FatCatRigid.html)
### FATCAT - flexible
@@ -186,11 +183,9 @@ calmodulin with and without calcium bound can be much better aligned with
FATCAT-flexible than with one of the rigid alignment algorithms. The downside of
this is that it can lead to additional false positives in unrelated structures.
-
+
-BioJava class: [org.biojava.nbio.structure.align.fatcat.FatCatFlexible]
-(www.biojava.org/docs/api/org/biojava/nbio/structure/align/fatcat/FatCatFlexible.html)
+BioJava class: [org.biojava.nbio.structure.align.fatcat.FatCatFlexible](https://www.biojava.org/docs/api/org/biojava/nbio/structure/align/fatcat/FatCatFlexible.html)
### Smith-Waterman
@@ -204,8 +199,7 @@ locating gaps can lead to high RMSD in the resulting superposition due to a
small number of badly aligned residues. However, this method is faster than
the structure-based methods.
-BioJava Class: [org.biojava.nbio.structure.align.ce.CeCPMain]
-(http://www.biojava.org/docs/api/org/biojava/nbio/structure/align/ce/CeCPMain.html)
+BioJava Class: [org.biojava.nbio.structure.align.ce.CeCPMain](http://www.biojava.org/docs/api/org/biojava/nbio/structure/align/ce/CeCPMain.html)
### Other methods
@@ -250,43 +244,7 @@ by the pairwise alignment algorithm limitations.
The algorithm performs similarly to other multiple structure alignment algorithms for most protein families.
The parameters both for the pairwise aligner and the MC optimization can have an impact on the final result. There is not a unique set of parameters, because they usually depend on the specific use case. Thus, trying some parameter combinations, keeping in mind the effect they produce in the score function, is a good practice when doing any structure alignment.
-BioJava class: [org.biojava.nbio.structure.align.multiple.mc.MultipleMcMain]
-(www.biojava.org/docs/api/org/biojava/nbio/structure/align/multiple/mc/MultipleMcMain.html)
-
-## PDB-wide Database Searches
-
-The Alignment GUI also provides functionality for PDB-wide structural searches.
-This systematically compares a structure against a non-redundant set of all
-other structures in the PDB at either a chain or a domain level. Representatives
-are selected using the RCSB's clustering of proteins with 40% sequence identity,
-as described
-[here](http://www.rcsb.org/pdb/static.do?p=general_information/cluster/structureAll.jsp).
-Domains are selected using either SCOP (when available) or the
-ProteinDomainParser algorithm.
-
-
-
-To perform a database search, select the 'Database Search' tab, then choose a
-query structure based on PDB ID, SCOP domain id, or from a custom file. The
-output directory will be used to store results. These consist of individual
-alignments in compressed XML format, as well as a tab-delimited file of
-similarity scores and statistics. The statistics are displayed in an interactive
-results table, which allows the alignments to be sorted. The 'Align' column
-allows individual alignments to be visualized with the alignment GUI.
-
-
-
-Be aware that this process can be very time consuming. Before
-starting a manual search, it is worth considering whether a pre-computed result
-may be available online, for instance for
-[FATCAT-rigid](http://www.rcsb.org/pdb/static.do?p=general_information/cluster/structureAll.jsp)
-or [DALI](http://ekhidna.biocenter.helsinki.fi/dali/start). For custom files or
-specific domains, a few optimizations can reduce the time for a database search.
-Downloading PDB files is a considerable bottleneck. This can be solved by
-downloading all PDB files from the [FTP
-server](ftp://ftp.wwpdb.org/pub/pdb/data/structures/divided/pdb/) and setting
-the `PDB_DIR` environmental variable. This operation sped up the search from
-about 30 hours to less than 4 hours.
+BioJava class: [org.biojava.nbio.structure.align.multiple.mc.MultipleMcMain](https://www.biojava.org/docs/api/org/biojava/nbio/structure/align/multiple/mc/MultipleMcMain.html)
## Creating Alignments Programmatically
@@ -363,8 +321,7 @@ MultipleAlignmentJmolDisplay.display(result);
Many of the alignment algorithms are available in the form of command line
tools. These can be accessed through the main methods of the StructureAlignment
-classes. Tar bundles are also available with scripts for running
-[CE and FATCAT](http://source.rcsb.org/jfatcatserver/download.jsp).
+classes.
Example:
```bash
@@ -378,7 +335,7 @@ file in various formats.
## Alignment Data Model
-For details about the structure alignment data models in biojava, see [Structure Alignment Data Model](alignment-data-model.md)
+For details about the structure alignment data models in BioJava, see [Structure Alignment Data Model](alignment-data-model.md)
## Acknowledgements
diff --git a/structure/bioassembly.md b/structure/bioassembly.md
index ab667e5..de2c2c5 100644
--- a/structure/bioassembly.md
+++ b/structure/bioassembly.md
@@ -153,7 +153,7 @@ List bioAssemblies = StructureIO.getBiologicalAssemblies(pdbId);
## Further Reading
-The RCSB PDB web site has a great [tutorial on Biological Assemblies](http://www.rcsb.org/pdb/101/static101.do?p=education_discussion/Looking-at-Structures/bioassembly_tutorial.html).
+The RCSB PDB web site has a great [tutorial on Biological Assemblies](https://pdb101.rcsb.org/learn/guide-to-understanding-pdb-data/biological-assemblies).
diff --git a/structure/caching.md b/structure/caching.md
index fafec7d..7be2be1 100644
--- a/structure/caching.md
+++ b/structure/caching.md
@@ -31,6 +31,8 @@ you can configure the AtomCache by setting the PDB_DIR system property
-DPDB_DIR=/wherever/you/want/
+BioJava will also check for a `PDB_DIR` environmental variable. If you launch BioJava from the command line, it can be useful to include `export PDB_DIR=/wherever/you/want` in your `.bashrc` file.
+
An alternative is to hard-code the path in this way (but setting it as a property is better style)
```java
@@ -51,10 +53,8 @@ This example turns on the use of chemical components when loading a `Structure`.
AtomCache cache = new AtomCache();
cache.setPath("/tmp/");
-
+
FileParsingParameters params = cache.getFileParsingParams();
-
- params.setLoadChemCompInfo(true);
StructureIO.setAtomCache(cache);
@@ -78,10 +78,7 @@ The AtomCache not only provides access to PDB, it can also fetch Structure repre
There are quite a number of external database IDs that are supported here. See the
AtomCache documentation for more details on the supported options.
-
-
-
-
+The non-PDB files can be cached at a different location by setting the `PDB_CACHE_DIR` property (with `java -DPDB_CACHE_DIR=...`) or environmental variable.
diff --git a/structure/chemcomp.md b/structure/chemcomp.md
index fb4bb2a..92f7538 100644
--- a/structure/chemcomp.md
+++ b/structure/chemcomp.md
@@ -1,7 +1,7 @@
The Chemical Component Dictionary
=================================
-The [Chemical Component Dictionary](http://www.wwpdb.org/ccd.html) is an external reference file describing all residue and small molecule components found in PDB entries. This dictionary contains detailed chemical descriptions for standard and modified amino acids/nucleotides, small molecule ligands, and solvent molecules.
+The [Chemical Component Dictionary](http://www.wwpdb.org/ccd.html) is an external reference file describing all residue and small molecule components found in PDB entries. This dictionary contains detailed chemical descriptions for standard and modified amino acids/nucleotides, small molecule ligands, and solvent molecules.
### How Does BioJava Decide what Groups Are Amino Acids?
@@ -33,55 +33,28 @@ HOH is a group of type hetatm
As you can see, although MSE is flaged as HETATM in the PDB file, BioJava still represents it correctly as an amino acid. They key is that the [definition file for MSE](http://www.rcsb.org/pdb/files/ligand/MSE.cif) flags it as "L-PEPTIDE LINKING", which is being used by BioJava.
-
-
-
-
-
-
-
-
-
- Selenomethionine is a naturally occurring amino acid containing selenium. It has the ID MSE in the Chemical Component Dictionary. (image source: wikipedia)
-
-
-
-
-
+Note: Selenomethionine is a naturally occurring amino acid containing selenium. It has the ID MSE in the Chemical Component Dictionary.
### How to Access Chemical Component Definitions
-By default BioJava ships with a minimal representation of standard amino acids, which is useful when you just want to work with atoms and a basic data representation. However if you want to work with a correct representation (e.g. distinguish ligands from the polypeptide chain, correctly resolve chemically modified residues), it is good to tell the library to either
-
-1. Fetch missing **Chemical Component Definitions** on the fly (small download and parsing delays every time a new chemical compound is found), or
-2. Load all **Chemical Component Definitions** at startup (slow startup, but then no further delays later on, requires more memory)
+By default BioJava will retrieve the full chemical component definitions provided by the PDB. That way BioJava makes sure that the user gets a correct representation e.g. distinguish ligands from the polypeptide chain, correctly resolve chemically modified residues, etc.
-You can enable the first behaviour by doing using the [FileParsingParameters](http://www.biojava.org/docs/api/org/biojava/nbio/structure/io/FileParsingParameters.html) class:
+The behaviour is configurable by setting a property in the `ChemCompGroupFactory` singleton:
+1. Use a minimal built-in set of **Chemical Component Definitions**. Will only deal with most frequent cases of chemical components. Does not guarantee a correct representation, but it is fast and does not require network access.
```java
- AtomCache cache = new AtomCache();
-
- // by default all files are stored at a temporary location.
- // you can set this either via at startup with -DPDB_DIR=/path/to/files/
- // or hard code it this way:
- cache.setPath("/tmp/");
-
- FileParsingParameters params = new FileParsingParameters();
-
- params.setLoadChemCompInfo(true);
- cache.setFileParsingParams(params);
-
- StructureIO.setAtomCache(cache);
-
- Structure structure = StructureIO.getStructure(...);
+ ChemCompGroupFactory.setChemCompProvider(new ReducedChemCompProvider());
```
-
-If you want to enable the second behaviour (slow loading of all chem comps at startup, but no further small delays later on) you can use the same code but change the behaviour by switching the [ChemCompProvider](http://www.biojava.org/docs/api/org/biojava/nbio/structure/io/mmcif/ChemCompProvider.html) implementation in the [ChemCompGroupFactory](http://www.biojava.org/docs/api/org/biojava/nbio/structure/io/mmcif/ChemCompGroupFactory.html)
-
+2. Load all **Chemical Component Definitions** at startup (slow startup, but then no further delays later on, requires more memory)
```java
ChemCompGroupFactory.setChemCompProvider(new AllChemCompProvider());
```
+3. Fetch missing **Chemical Component Definitions** on the fly (small download and parsing delays every time a new chemical compound is found). Default behaviour since 4.2.0. Note that the chemical component files are cached in the local file system for subsequent uses.
+```java
+ ChemCompGroupFactory.setChemCompProvider(new DownloadChemCompProvider());
+```
+
diff --git a/structure/contact-map.md b/structure/contact-map.md
index 57b6818..bb9236d 100644
--- a/structure/contact-map.md
+++ b/structure/contact-map.md
@@ -9,7 +9,7 @@ Contacts are a useful tool to analyse protein structures. They simplify the 3-Di
## Getting the contact map of a protein chain
-This code snippet will produce the set of contacts between all C alpha atoms for chain A of PDB entry [1SMT](http://www.rcsb.org/pdb/explore.do?structureId=1SMT):
+This code snippet will produce the set of contacts between all C alpha atoms for chain A of PDB entry [1SMT](https://www.rcsb.org/structure/1SMT):
```java
AtomCache cache = new AtomCache();
@@ -51,7 +51,7 @@ One can also find the contacting atoms between two protein chains. For instance
```
-See [DemoContacts](https://github.com/biojava/biojava/blob/master/biojava3-structure/src/main/java/demo/DemoContacts.java) for a fully working demo of the examples above.
+See [DemoContacts](https://github.com/biojava/biojava/blob/master/biojava-structure/src/main/java/demo/DemoContacts.java) for a fully working demo of the examples above.
diff --git a/structure/crystal-contacts.md b/structure/crystal-contacts.md
index cf1fcbe..f610610 100644
--- a/structure/crystal-contacts.md
+++ b/structure/crystal-contacts.md
@@ -11,7 +11,7 @@ Looking at crystal contacts can also be important in order to assess the quality
## Getting the set of unique contacts in the crystal lattice
-This code snippet will produce a list of all non-redundant interfaces present in the crystal lattice of PDB entry [1SMT](http://www.rcsb.org/pdb/explore.do?structureId=1SMT):
+This code snippet will produce a list of all non-redundant interfaces present in the crystal lattice of PDB entry [1SMT](https://www.rcsb.org/structure/1SMT):
```java
AtomCache cache = new AtomCache();
@@ -42,7 +42,7 @@ The algorithm to find all unique interfaces in the crystal works roughly like th
+ Searches all cells around the original one by applying crystal translations, if any 2 chains in that search is found to contact then the new contact is added to the final list.
+ The search is performend without repeating redundant symmetry operators, making sure that if a contact is found then it is a unique contact.
-See [DemoCrystalInterfaces](https://github.com/biojava/biojava/blob/master/biojava3-structure/src/main/java/demo/DemoCrystalInterfaces.java) for a fully working demo of the example above.
+See [DemoCrystalInterfaces](https://github.com/biojava/biojava/blob/master/biojava-structure/src/main/java/demo/DemoCrystalInterfaces.java) for a fully working demo of the example above.
## Clustering the interfaces
One can also cluster the interfaces based on their similarity. The similarity is measured through contact overlap: number of common contacts over average number of contact in both chains. The clustering can be done as following:
diff --git a/structure/firststeps.md b/structure/firststeps.md
index 8effe51..ef13be2 100644
--- a/structure/firststeps.md
+++ b/structure/firststeps.md
@@ -6,14 +6,10 @@ First Steps
The simplest way to load a PDB file is by using the [StructureIO](http://www.biojava.org/docs/api/org/biojava/nbio/structure/StructureIO.html) class.
```java
- public static void main(String[] args){
- try {
- Structure structure = StructureIO.getStructure("4HHB");
- // and let's print out how many atoms are in this structure
- System.out.println(StructureTools.getNrAtoms(structure));
- } catch (Exception e){
- e.printStackTrace();
- }
+ public static void main(String[] args) throws Exception {
+ Structure structure = StructureIO.getStructure("4HHB");
+ // and let's print out how many atoms are in this structure
+ System.out.println(StructureTools.getNrAtoms(structure));
}
```
@@ -53,23 +49,17 @@ Talking about startup properties, it is also good to mention the fact that many
If you have the *biojava-structure-gui* module installed, you can quickly visualise a [Structure](http://www.biojava.org/docs/api/org/biojava/nbio/structure/Structure.html) via this:
```java
- public static void main(String[] args){
- try {
-
- Structure struc = StructureIO.getStructure("4hhb");
-
- StructureAlignmentJmol jmolPanel = new StructureAlignmentJmol();
-
- jmolPanel.setStructure(struc);
-
- // send some commands to Jmol
- jmolPanel.evalString("select * ; color chain;");
- jmolPanel.evalString("select *; spacefill off; wireframe off; cartoon on; ");
- jmolPanel.evalString("select ligands; cartoon off; wireframe 0.3; spacefill 0.5; color cpk;");
-
- } catch (Exception e){
- e.printStackTrace();
- }
+ public static void main(String[] args) throws Exception {
+ Structure struc = StructureIO.getStructure("4hhb");
+
+ StructureAlignmentJmol jmolPanel = new StructureAlignmentJmol();
+
+ jmolPanel.setStructure(struc);
+
+ // send some commands to Jmol
+ jmolPanel.evalString("select * ; color chain;");
+ jmolPanel.evalString("select *; spacefill off; wireframe off; cartoon on; ");
+ jmolPanel.evalString("select ligands; cartoon off; wireframe 0.3; spacefill 0.5; color cpk;");
}
```
@@ -91,15 +81,10 @@ This will result in the following view:
By default many people work with the *asymmetric unit* of a protein. However for many studies the correct representation to look at is the *biological assembly* of a protein. You can request it by calling
```java
- public static void main(String[] args){
-
- try {
- Structure structure = StructureIO.getBiologicalAssembly("1GAV");
- // and let's print out how many atoms are in this structure
- System.out.println(StructureTools.getNrAtoms(structure));
- } catch (Exception e){
- e.printStackTrace();
- }
+ public static void main(String[] args) throws Exception {
+ Structure structure = StructureIO.getBiologicalAssembly("1GAV");
+ // and let's print out how many atoms are in this structure
+ System.out.println(StructureTools.getNrAtoms(structure));
}
```
diff --git a/structure/installation.md b/structure/installation.md
index 536d764..e585df8 100644
--- a/structure/installation.md
+++ b/structure/installation.md
@@ -36,6 +36,25 @@ If you run
on your project, the BioJava dependencies will be automatically downloaded and installed for you.
+### (Optional) Configuration
+
+BioJava can be configured through several properties:
+
+| Property | Description |
+| --- | --- |
+| `PDB_DIR` | Directory for caching structure files from the PDB. Mirrors the PDB's FTP server directory structure, with `PDB_DIR` equivalent to ftp://ftp.wwpdb.org/pub/pdb/. Default: temp directory |
+| `PDB_CACHE_DIR` | Cache directory for other files related to the structure package. Default: temp directory |
+
+These can be set either as java properties or as environmental variables. For example:
+
+```
+# This could be added to .bashrc
+export PDB_DIR=...
+# Or override for a particular execution
+java -DPDB_DIR=... -cp ...
+```
+
+Note that your IDE may ignore `.bashrc` settings, but should have a preference for passing VM arguments.
diff --git a/structure/mmcif.md b/structure/mmcif.md
index 230488e..769b851 100644
--- a/structure/mmcif.md
+++ b/structure/mmcif.md
@@ -12,12 +12,15 @@ The mmCIF file format has been around for some time (see [Westbrook 2000][] and
## The Basics
-BioJava provides you with both a mmCIF parser and a data model that reads PDB and mmCIF files into a biological and chemically meaningful data model (BioJava supports the [Chemical Components Dictionary](mmcif.md)). If you don't want to use that data model, you can still use BioJava's file parsers, and more on that later, let's start first with the most basic way of loading a protein structure.
+BioJava uses the [CIFTools-java](https://github.com/rcsb/ciftools-java) library to parse mmCIF. BioJava then has its own data model that reads PDB and mmCIF files
+into a biological and chemically meaningful data model (BioJava supports the [Chemical Components Dictionary](chemcomp.md)).
+If you don't want to use that data model, you can still use the CIFTools-java parser, please refer to its documentation.
+Let's start first with the most basic way of loading a protein structure.
## First Steps
-The simplest way to load a PDB file is by using the [StructureIO](http://www.biojava.org/docs/api/org/biojava/nbio/structure/StructureIO.html) class.
+The simplest way to load a PDBx/mmCIF file is by using the [StructureIO](http://www.biojava.org/docs/api/org/biojava/nbio/structure/StructureIO.html) class.
```java
Structure structure = StructureIO.getStructure("4HHB");
@@ -25,9 +28,7 @@ The simplest way to load a PDB file is by using the [StructureIO](http://www.bio
System.out.println(StructureTools.getNrAtoms(structure));
```
-
-
-BioJava automatically downloaded the PDB file for hemoglobin [4HHB](http://www.rcsb.org/pdb/explore.do?structureId=4HHB) and copied it into a temporary location. This demonstrates two things:
+BioJava automatically downloaded the PDB file for hemoglobin [4HHB](http://www.rcsb.org/pdb/explore.do?structureId=4HHB) and copied it into a temporary location. This demonstrates two things:
+ BioJava can automatically download and install files locally
+ BioJava by default writes those files into a temporary location (The system temp directory "java.io.tempdir").
@@ -38,14 +39,16 @@ If you already have a local PDB installation, you can configure where BioJava sh
-DPDB_DIR=/wherever/you/want/
-## From PDB to mmCIF
+## Switching AtomCache to use different file types
-By default BioJava is using the PDB file format for parsing data. In order to switch it to use mmCIF, we can take control over the underlying [AtomCache](http://www.biojava.org/docs/api/org/biojava/nbio/structure/align/util/AtomCache.html) which manages your PDB ([and btw. also SCOP, CATH](externaldb.md)) installations.
+By default BioJava is using the BCIF file format for parsing data. In order to switch it to use mmCIF, we can take control over
+the underlying [AtomCache](http://www.biojava.org/docs/api/org/biojava/nbio/structure/align/util/AtomCache.html) which
+manages your PDB ([and btw. also SCOP, CATH](externaldb.md)) installations.
```java
AtomCache cache = new AtomCache();
-
- cache.setUseMmCif(true);
+
+ cache.setFiletype(StructureFiletype.CIF);
// if you struggled to set the PDB_DIR property correctly in the previous step,
// you could set it manually like this:
@@ -59,7 +62,7 @@ By default BioJava is using the PDB file format for parsing data. In order to sw
System.out.println(structure.getChains().size());
```
-As you can see, the AtomCache will again download the missing mmCIF file for 4HHB in the background.
+See other supported file types in the `StructureFileType` enum.
## URL based parsing of files
@@ -67,13 +70,8 @@ StructureIO can also access files via URLs and fetch the data dynamically. E.g.
```java
String u = "http://ftp.wwpdb.org/pub/pdb/data/biounit/mmCIF/divided/nw/4nwr-assembly1.cif.gz";
- try {
- Structure s = StructureIO.getStructure(u);
-
- System.out.println(s);
- } catch (Exception e) {
- e.printStackTrace();
- }
+ Structure s = StructureIO.getStructure(u);
+ System.out.println(s);
```
### Local URLs
@@ -86,34 +84,12 @@ BioJava can also access local files, by specifying the URL as
## Low Level Access
-If you want to learn how to use the BioJava mmCIF parser to populate your own data structure, let's first take a look this lower-level code:
+You can load a BioJava `Structure` object using the ciftools-java parser with:
```java
InputStream inStream = new FileInputStream(fileName);
-
- MMcifParser parser = new SimpleMMcifParser();
-
- SimpleMMcifConsumer consumer = new SimpleMMcifConsumer();
-
- // The Consumer builds up the BioJava - structure object.
- // you could also hook in your own and build up you own data model.
- parser.addMMcifConsumer(consumer);
-
- try {
- parser.parse(new BufferedReader(new InputStreamReader(inStream)));
- } catch (IOException e){
- e.printStackTrace();
- }
-
// now get the protein structure.
- Structure cifStructure = consumer.getStructure();
-```
-
-The parser operates similar to a XML parser by triggering "events". The [SimpleMMcifConsumer](http://www.biojava.org/docs/api/org/biojava/nbio/structure/io/mmcif/SimpleMMcifConsumer.html) listens to new categories being read from the file and then builds up the BioJava data model.
-
-To re-use the parser for your own datamodel, just implement the [MMcifConsumer](http://www.biojava.org/docs/api/org/biojava/nbio/structure/io/mmcif/MMcifConsumer.html) interface and add it to the [SimpleMMcifParser](http://www.biojava.org/docs/api/org/biojava/nbio/structure/io/mmcif/SimpleMMcifParser.html).
-```java
- parser.addMMcifConsumer(myOwnConsumerImplementation);
+ Structure cifStructure = CifStructureConverter.fromInputStream(inStream);
```
## I Loaded a Structure Object, What Now?
diff --git a/structure/secstruc.md b/structure/secstruc.md
index 7216d84..fbd0f94 100644
--- a/structure/secstruc.md
+++ b/structure/secstruc.md
@@ -10,8 +10,8 @@ Secondary structure can be formally defined by the pattern of hydrogen bonds of
More specifically, the secondary structure is defined by the patterns of hydrogen bonds formed between
amine hydrogen (-NH) and carbonyl oxygen (C=O) atoms contained in the backbone peptide bonds of the protein.
-For more info see the Wikipedia article on [protein secondary structure]
-(https://en.wikipedia.org/wiki/Protein_secondary_structure).
+For more info see the Wikipedia article
+on [protein secondary structure](https://en.wikipedia.org/wiki/Protein_secondary_structure).
## Secondary Structure Annotation
@@ -106,8 +106,8 @@ input Structure overriding any previous annotation, like in the DSSPParser. An e
ssp.calculate(s, true); //true assigns the SS to the Structure
```
-BioJava Class: [org.biojava.nbio.structure.secstruc.SecStrucCalc]
-(http://www.biojava.org/docs/api/org/biojava/nbio/structure/secstruc/SecStrucCalc.html)
+BioJava Class:
+[org.biojava.nbio.structure.secstruc.SecStrucCalc](http://www.biojava.org/docs/api/org/biojava/nbio/structure/secstruc/SecStrucCalc.html)
### Storage and Data Structures
diff --git a/structure/seqres.md b/structure/seqres.md
index db64971..2d03e04 100644
--- a/structure/seqres.md
+++ b/structure/seqres.md
@@ -5,12 +5,11 @@ How molecular sequences are linked to experimentally observed atoms.
## Sequences and Atoms
-In many experiments not all atoms that are part of the molecule under study can be observed. As such the ATOM records in PDB oftein contain missing atoms or only the part of a molecule that could be experimentally determined. In case of multi-domain proteins the PDB often contains only one of the domains (and in some cases even shorter fragments).
+In many experiments not all atoms that are part of the molecule under study can be observed. As such the ATOM records in PDB often contain missing atoms or only the part of a molecule that could be experimentally determined. In case of multi-domain proteins the PDB often contains only one of the domains (and in some cases even shorter fragments).
-Let's take a look at an example. The [Protein Feature View](https://github.com/andreasprlic/proteinfeatureview) provides a graphical summary of how the regions that have been observed in an experiment and are available in the PDB map to UniProt.
+Let's take a look at an example. The [Protein Feature View](https://github.com/andreasprlic/proteinfeatureview) provides a graphical summary of the regions that have been observed in an experiment and are available in the PDB map to UniProt.
-![Screenshot of Protein Feature View at RCSB]
-(https://raw.github.com/andreasprlic/proteinfeatureview/master/images/P06213.png "Insulin receptor - P06213 (INSR_HUMAN)")
+")
As you can see, there are three PDB entries (PDB IDs [3LOH](http://www.rcsb.org/pdb/explore.do?structureId=3LOH), [2HR7](http://www.rcsb.org/pdb/explore.do?structureId=2RH7), [3BU3](http://www.rcsb.org/pdb/explore.do?structureId=3BU3)) that cover different regions of the UniProt sequence for the insulin receptor.
@@ -18,7 +17,7 @@ The blue-boxes are regions for which atoms records are available. For the grey r
## Seqres and Atom Records
-The sequence that has been used in the experiment is stored in the **Seqres** records in the PDB. It is often not the same sequences as can be found in Uniprot, since it can contain cloning-artefacts and modifications that were necessary in order to crystallize a structure.
+The sequence that has been used in the experiment is stored in the **Seqres** records in the PDB. It is often not the same sequence as can be found in Uniprot, since it can contain cloning-artefacts and modifications that were necessary in order to crystallize a structure.
The **Atom** records provide coordinates where it was possible to observe them.
diff --git a/structure/structure-data-model.md b/structure/structure-data-model.md
index edfd882..6ea6ce4 100644
--- a/structure/structure-data-model.md
+++ b/structure/structure-data-model.md
@@ -28,7 +28,7 @@ Structure
All `Structure` objects contain one or more `Models`. That means also X-ray structures contain a "virtual" model which serves as a container for the chains. This allows to represent multi-model X-ray structures, e.g. from time-series analysis. The most common way to access chains is via:
```java
- List chains = structure.getChains();
+ List chains = structure.getChains();
```
This works for both NMR and X-ray based structures and by default the first `Model` is getting accessed.
@@ -58,7 +58,7 @@ Here an example that loops over the whole data model and prints out the HEM grou
for (Chain c : chains) {
- System.out.println(" Chain: " + c.getChainID() + " # groups with atoms: " + c.getAtomGroups().size());
+ System.out.println(" Chain: " + c.getId() + " # groups with atoms: " + c.getAtomGroups().size());
for (Group g: c.getAtomGroups()){
@@ -87,24 +87,24 @@ The [Group](http://www.biojava.org/docs/api/org/biojava/nbio/structure/Group.htm
In order to get all amino acids that have been observed in a PDB chain, you can use the following utility method:
```java
- Chain chain = s.getChainByPDB("A");
- List groups = chain.getAtomGroups("amino");
+ Chain chain = structure.getPolyChainByPDB("A");
+ List groups = chain.getAtomGroups(GroupType.AMINOACID);
for (Group group : groups) {
- AminoAcid aa = (AminoAcid) group;
+ SecStrucInfo secStrucInfo = (SecStrucInfo) group.getProperty(Group.SEC_STRUC);
- // do something amino acid specific, e.g. print the secondary structure assignment
- System.out.println(aa + " " + aa.getSecStruc());
+ // print the secondary structure assignment
+ System.out.println(group + " -- " + secStrucInfo);
}
```
In a similar way you can access all nucleotide groups by
```java
- chain.getAtomGroups("nucleotide");
+ chain.getAtomGroups(GroupType.NUCLEOTIDE);
```
The Hetatom groups are access in a similar fashion:
```java
- chain.getAtomGroups("hetatm");
+ chain.getAtomGroups(GroupType.HETATM);
```
@@ -112,10 +112,10 @@ Since all 3 types of groups are implementing the Group interface, you can also i
```java
List allgroups = chain.getAtomGroups();
- for (Group group : groups) {
- if ( group instanceof AminoAcid) {
- AminoAcid aa = (AminoAcid) group;
- System.out.println(aa.getSecStruc());
+ for (Group group : allgroups) {
+ if (group.isAminoAcid()) {
+ SecStrucInfo secStrucInfo = (SecStrucInfo) group.getProperty(Group.SEC_STRUC);
+ System.out.println(group + " -- " + secStrucInfo);
}
}
```
@@ -126,7 +126,7 @@ The detection of the groups works really well in connection with the [Chemical C
## Entities and Chains
-Entities (in the BioJava API called compounds) are the distinct chemical components of structures in the PDB.
+Entities are the distinct chemical components of structures in the PDB.
Unlike chains, entities do not include duplicate copies and each entity is different from every other
entity in the structure. There are different types of entities. Polymer entities include Protein, DNA,
and RNA. Ligands are smaller chemical components that are not part of a polymer entity.
@@ -140,15 +140,15 @@ and beta. Each of the entities has two copies (= chains) in the structure. IN 4H
has the two chains with the IDs A, and C and beta the chains B, and D. In total, hemoglobin is
built up out of four chains.
-This prints all the compounds/entities in a structure
+This prints all the entities in a structure
```java
Structure structure = StructureIO.getStructure("4hhb");
System.out.println(structure);
- System.out.println(" # of compounds (entities) " + structure.getCompounds().size());
+ System.out.println(" # of compounds (entities) " + structure.getEntityInfos().size());
- for ( Compound entity: structure.getCompounds()) {
+ for ( EntityInfo entity: structure.getEntityInfos()) {
System.out.println(" " + entity);
}
```
diff --git a/structure/symmetry.md b/structure/symmetry.md
index 7816bac..cfe5186 100644
--- a/structure/symmetry.md
+++ b/structure/symmetry.md
@@ -1,64 +1,63 @@
Protein Symmetry using BioJava
================================================================
-BioJava can be used to detect, analyze, and visualize **symmetry** and
-**pseudo-symmetry** in the **quaternary** (biological assembly) and tertiary
-(**internal**) structural levels.
+BioJava can be used to detect, analyze, and visualize **symmetry** and
+**pseudo-symmetry** in the **quaternary** (biological assembly) and tertiary
+(**internal**) structural levels of proteins.
## Quaternary Symmetry
-The **quaternary symmetry** of a structure defines the relations between
-its individual chains or groups of chains. For a more extensive explanation
-about symmetery visit the [PDB help page]
-(http://www.rcsb.org/pdb/staticHelp.do?p=help/viewers/jmol_symmetry_view.html).
+The **quaternary symmetry** of a structure defines the relation and arrangement of the individual chains or groups of chains that are part of a biological assembly.
+For a more exhaustive explanation about protein quaternary symmetery and the different types visit the [PDB help page](http://www.rcsb.org/pdb/staticHelp.do?p=help/viewers/jmol_symmetry_view.html).
-In the **quaternary symmetry** detection problem, we are given a set of chains
-with its `Atom` coordinates and we are asked to find the higest overall symmetry that
-relates them. The solution is divided into the following steps:
+In the **quaternary symmetry** detection problem, we are given a set of chains (subunits) that are part of a biological assembly as input, defined by their atomic coordinates, and we are required to find the higest overall symmetry group that
+relates them as ouptut.
+The solution is divided into the following steps:
1. First, we need to identify the chains that are identical (or similar
-in the pseudo-symmetry case). For that, we perform a pairwise alignment of all
-chains and determine **clusters of identical chains**.
-2. Next, we reduce the each chains to a single point, its **centroid** (center of mass).
-3. After that, we try different **symmetry relations** to superimpose the chain centroids
-and obtain their RMSD.
-4. At last, based on the parameters (cutoffs), we determine the **overall symmetry** of the
+in the pseudo-symmetry case). For that purpose, we perform a pairwise alignment of all
+chains and identify **clusters of identical or similar subunits**.
+2. Next, we reduce each of the polypeptide chains to a single point, their **centroid** (center of mass).
+3. Afterwards, we try different **symmetry operations** using a grid search to superimpose the chain centroids
+and score them using the RMSD.
+4. Finally, based on the parameters (cutoffs), we determine the **overall symmetry** of the
structure, with the symmetry relations obtained in the previous step.
5. In case of asymmetric structure, we discard combinatorially a number of chains and try
-to detect any **local symmetries** present.
+to detect any **local symmetries** present (symmetry that does not involve all subunits of the biological assembly).
The **quaternary symmetry** detection algorithm is implemented in the biojava class
[QuatSymmetryDetector](http://www.biojava.org/docs/api/org/biojava/nbio/structure/symmetry/core/QuatSymmetryDetector).
An example of how to use it programatically is shown below:
```java
-//First download the structure in the biological assembly form
+// First download the structure in the biological assembly form
Structure s;
-//Set some parameters if needed different than DEFAULT - see descriptions
+// Set some parameters if needed different than DEFAULT - see descriptions
QuatSymmetryParameters parameters = new QuatSymmetryParameters();
-parameters.setVerbose(true); //print information
+SubunitClustererParameters clusterParams = new SubunitClustererParameters();
-//Instantiate the detector
-QuatSymmetryDetector detector = QuatSymmetryDetector(structure, parameters);
+// Instantiate the detector
+QuatSymmetryDetector detector = QuatSymmetryDetector(s, parameters, clusterParams);
-//The getters calculate the quaternary symmetry automatically
-List globalResults = detector.getGlobalSymmetry();
-List> localResults = detector.getLocalSymmetries();
+// Static methods in QuatSymmetryDetector perform the calculation
+QuatSymmetryResults globalResults = QuatSymmetryDetector.getGlobalSymmetry(s, parameters, clusterParams);
+List localResults = QuatSymmetryDetector.getLocalSymmetries(s, parameters, clusterParams);
```
-The return type are `List` because there can be multiple valid options for the
-quaternary symmetry. The local results `List` is empty if there exist no local
-symmetry in the structure, and the global results `List` has always size bigger
-than 1, returning a C1 point group in the case of asymmetric structure.
+See also the [demo](https://github.com/biojava/biojava/blob/885600670be75b7f6bc5216bff52a93f43fff09e/biojava-structure/src/main/java/demo/DemoSymmetry.java#L37-L59) provided in **BioJava** for a real case working example.
+
+The returned `QuatSymmetryResults` object contains all the information of the subunit clustering and structural symmetry.
+This object will be used later to obtain axes of symmetry, point group name, stoichiometry or even display the results in Jmol.
+The return object of quaternary symmetry (`QuatSymmetryResults`) contains the
+In case of asymmetrical structure, the result is a C1 point group.
+The return type of the local symmetry is a `List` because there can be multiple valid options of local symmetry.
+The list will be empty if there exist no local symmetries in the structure.
-The `QuatSymmetryResults` object contains all the information of the symmetry.
-This object will be used later to obtain axes of symmetry, point group name,
-stoichiometry or even display the results in Jmol.
### Global Symmetry
-In **global symmetry** all chains have to be part of the symmetry description.
+In the **global symmetry** mode all chains have to be part of the symmetry result.
#### Point Group
@@ -76,51 +75,50 @@ components.
### Local Symmetry
-In **local symmetry** a number of chains is left out, so that the symmetry
-only applies to a subset of chains.
+In **local symmetry** a number of chains is left out, so that the symmetry only applies to a subset of chains.

### Pseudo-Symmetry
In **pseudo-symmetry** the chains related by the symmetry are not completely
-identical, but they share a sequence similarity above the pseudo-symmetry
+identical, but they share a sequence or structural similarity above the pseudo-symmetry
similarity threshold.
-If we consider hemoglobin, at a 95% sequence identity threshold the alpha and
-beta subunits are considered different, which correspond to an A2B2 stoichiometry
-and a C2 point group. At the structural similarity level, all four chains are
-considered homologous (~45% sequence identity) with an A4 pseudostoichiometry and
-D2 pseudosymmetry.
+If we consider hemoglobin, at a 95% sequence identity threshold the alpha and
+beta subunits are considered different, which correspond to an A2B2 stoichiometry
+and a C2 point group. At the structural similarity level, all four chains are
+considered homologous (~45% sequence identity) with an A4 pseudostoichiometry and
+D2 pseudosymmetry.

## Internal Symmetry
-**Internal symmetry** refers to the symmetry present in a single chain, that is,
-the tertiary structure. The algorithm implemented in biojava to detect internal
+**Internal symmetry** refers to the symmetry present in a single chain, that is,
+the tertiary structure. The algorithm implemented in biojava to detect internal
symmetry is called **CE-Symm**.
### CE-Symm
-The **CE-Symm** algorithm was originally developed by [Myers-Turnbull D., Bliven SE.,
+The **CE-Symm** algorithm was originally developed by [Myers-Turnbull D., Bliven SE.,
Rose PW., Aziz ZK., Youkharibache P., Bourne PE. & Prlić A. in 2014]
(http://www.sciencedirect.com/science/article/pii/S0022283614001557) [](http://www.ncbi.nlm.nih.gov/pubmed/24681267).
As the name of the algorithm explicitly states, **CE-Symm** uses the Combinatorial
-Extension (**CE**) algorithm to generate an alignment of the structure chain to itself,
-disabling the identity alignment (the diagonal of the **DotPlot** representation of a
-structure alignment). This allows the identification of alternative self-alignments,
+Extension (**CE**) algorithm to generate an alignment of the structure chain to itself,
+disabling the identity alignment (the diagonal of the **DotPlot** representation of a
+structure alignment). This allows the identification of alternative self-alignments,
which are related to symmetry and/or structural repeats inside the chain.
-By a procedure called **refinement**, the subunits of the chain that are part of the symmetry
+By a procedure called **refinement**, the subunits of the chain that are part of the symmetry
are defined and a **multiple alignment** is created. This process can be thought as to
divide the chain into other subchains, and then superimposing each subchain to each other to
create a multiple alignment of the subunits, respecting the symmetry axes.
The **internal symmetry** detection algorithm is implemented in the biojava class
[CeSymm](http://www.biojava.org/docs/api/org/biojava/nbio/structure/symmetry/internal/CeSymm).
-It returns a MultipleAlignment, see the explanation of the model in [Data Models](alignment-data-model.md),
-that describes the internal subunits multiple alignment. In case of no symmetry detected, the
+It returns a `MultipleAlignment` object, see the explanation of the model in [Data Models](alignment-data-model.md),
+that describes the similarity of the internal repeats. In case of no symmetry detected, the
returned alignment represents the optimal self-alignment produced by the first step of the **CE-Symm**
algorithm.
@@ -157,7 +155,7 @@ System.out.println(pg.getSymmetry());
To enable some extra features in the display, a `SymmetryDisplay`
class has been created, although the `MultipleAlignmentDisplay` method
-can also be used for that purpose (it will not show symmetry axes or
+can also be used for that purpose (it will not show symmetry axes or
symmetry menus).
Lastly, the `SymmetryGUI` class in the **structure-gui** package
@@ -167,7 +165,7 @@ to the GUI to trigger structure alignments.
### Symmetry Display
The symmetry display is similar to the **quaternary symmetry**, because
-part of the code is shared. See for example this beta-propeller (1U6D),
+part of the code is shared. See for example this beta-propeller (1U6D),
where the repeated beta-sheets are connected by a linker forming a C6
point group internal symmetry:
@@ -176,10 +174,10 @@ point group internal symmetry:
#### Hierarchical Symmetry
One additional feature of the **internal symmetry** display is the representation
-of hierarchical symmetries and repeats. Contrary to point groups, some structures
-have different **levels** of symmetry. That is, the whole strucutre has, e.g. C2
-symmetry and, at the same time, each of the two parts has C2 symmetry, but the axes
-of both levels are not related by a point group (i.e. they do not cross to a single
+of hierarchical symmetries and repeats. Contrary to point groups, some structures
+have different **levels** of symmetry. That is, the whole strucutre has, e.g. C2
+symmetry and, at the same time, each of the two parts has C2 symmetry, but the axes
+of both levels are not related by a point group (i.e. they do not cross to a single
point).
A very clear example are the beta-gamma-crystallins, like 4GCR:
@@ -188,33 +186,63 @@ A very clear example are the beta-gamma-crystallins, like 4GCR:
#### Subunit Multiple Alignment
-Another feature of the display is the option to show the **multiple alignment** of
+Another feature of the display is the option to show the **multiple alignment** of
the symmetry related subunits created during the **refinement** process. Search for
-the option *Subunit Superposition* in the *symmetry* menu of the Jmol window. For
+the option *Subunit Superposition* in the *symmetry* menu of the Jmol window. For
the previous example the display looks like that:

-The subunit display highlights the differences and similarities between the symmetry
+The subunit display highlights the differences and similarities between the symmetry
related subunits of the chain, and helps the user to identify conseved and divergent
regions, with the help of the *Sequence Alignment Panel*.
-## Combined Global Symmetry
+## Quaternary + Internal Overall Symmetry
-Finally, the internal and quaternary symmetries can be combined to obtain the global
+Finally, the internal and quaternary symmetries can be merged to obtain the
overall combined symmetry. As we have seen before, the protein 1VYM is a DNA-clamp that
-has three chains relates by C3 symmetry. Each chain is internally C2 symmetric, and each
-part of the C2 internal symmetry is C2 symmetric, so a case of **hierarchical symmetry**
-(C2 + C2). Once we have divided the whole structure into its asymmetric parts, we can
-analyze the global symmetry that related each one of them. The interesting result is that
-in some cases, the internal symmetry **multiplies** the point group of the quaternary symmetry.
-What seemed a C3 + C2 + C2 is combined into a D6 overall symmetry, as we can see in the figure
-below:
+has three chains arranged in a C3 symmetry.
+Each chain is internally fourfold symmetric with two levels of symmetry. We can analyze the overall symmetry of the structure by considering together the C3 quaternary symmetry and the fourfold internal symmetry.
+In this case, the internal symmetry **augments** the point group of the quaternary symmetry to a D6 overall symmetry, as we can see in the figure below:

-These results can give hints about the function and evolution of proteins and biological
-structures.
+An example of how to toggle the **combined symmetry** (quaternary + internal symmetries) programatically is shown below:
+
+```java
+// First download the structure in the biological assembly form
+Structure s;
+
+// Initialize default parameters
+QuatSymmetryParameters parameters = new QuatSymmetryParameters();
+SubunitClustererParameters clusterParams = new SubunitClustererParameters();
+
+// In SubunitClustererParameters set the clustering method to STRUCTURE and the internal symmetry option to true
+clusterParams.setClustererMethod(SubunitClustererMethod.STRUCTURE);
+clusterParams.setInternalSymmetry(true);
+
+// You can lower the default structural coverage to improve the recall
+clusterParams.setStructureCoverageThreshold(0.75);
+
+// Instantiate the detector
+QuatSymmetryDetector detector = QuatSymmetryDetector(s, parameters, clusterParams);
+
+// Static methods in QuatSymmetryDetector perform the calculation
+QuatSymmetryResults overallResults = QuatSymmetryDetector.getGlobalSymmetry(s, parameters, clusterParams);
+
+```
+
+See also the [test](https://github.com/biocryst/biojava/blob/df22da37a86a0dba3fb35bee7e17300d402ab469/biojava-integrationtest/src/test/java/org/biojava/nbio/structure/test/symmetry/TestQuatSymmetryDetectorExamples.java#L167-L192) provided in **BioJava** for a real case working example.
+
+
+## Please Cite
+
+**Analyzing the symmetrical arrangement of structural repeats in proteins with CE-Symm**
+*Spencer E Bliven, Aleix Lafita, Peter W Rose, Guido Capitani, Andreas Prlić, & Philip E Bourne*
+[PLOS Computational Biology (2019) 15 (4):e1006842.](https://journals.plos.org/ploscompbiol/article/citation?id=10.1371/journal.pcbi.1006842)
+[](https://doi.org/10.1371/journal.pcbi.1006842) [](http://www.ncbi.nlm.nih.gov/pubmed/31009453)
+
+