From fad66650c3593a7fa72f19962e0726f80a58e0a9 Mon Sep 17 00:00:00 2001 From: Spencer Bliven Date: Wed, 26 Jul 2017 16:35:51 +0200 Subject: [PATCH 01/21] Add PDB_CACHE_DIR --- structure/caching.md | 7 +++---- 1 file changed, 3 insertions(+), 4 deletions(-) diff --git a/structure/caching.md b/structure/caching.md index fafec7d..e2da072 100644 --- a/structure/caching.md +++ b/structure/caching.md @@ -31,6 +31,8 @@ you can configure the AtomCache by setting the PDB_DIR system property -DPDB_DIR=/wherever/you/want/ +BioJava will also check for a `PDB_DIR` environmental variable. If you launch BioJava from the command line, it can be useful to include `export PDB_DIR=/wherever/you/want` in your `.bashrc` file. + An alternative is to hard-code the path in this way (but setting it as a property is better style) ```java @@ -78,10 +80,7 @@ The AtomCache not only provides access to PDB, it can also fetch Structure repre There are quite a number of external database IDs that are supported here. See the AtomCache documentation for more details on the supported options. - - - - +The non-PDB files can be cached at a different location by setting the `PDB_CACHE_DIR` property (with `java -DPDB_CACHE_DIR=...`) or environmental variable. From 99ab7b5a18cfb399650365dbfb5181841cfadb08 Mon Sep 17 00:00:00 2001 From: Spencer Bliven Date: Wed, 26 Jul 2017 17:04:05 +0200 Subject: [PATCH 02/21] Add configuration section --- structure/installation.md | 19 +++++++++++++++++++ 1 file changed, 19 insertions(+) diff --git a/structure/installation.md b/structure/installation.md index 536d764..e585df8 100644 --- a/structure/installation.md +++ b/structure/installation.md @@ -36,6 +36,25 @@ If you run on your project, the BioJava dependencies will be automatically downloaded and installed for you. +### (Optional) Configuration + +BioJava can be configured through several properties: + +| Property | Description | +| --- | --- | +| `PDB_DIR` | Directory for caching structure files from the PDB. Mirrors the PDB's FTP server directory structure, with `PDB_DIR` equivalent to ftp://ftp.wwpdb.org/pub/pdb/. Default: temp directory | +| `PDB_CACHE_DIR` | Cache directory for other files related to the structure package. Default: temp directory | + +These can be set either as java properties or as environmental variables. For example: + +``` +# This could be added to .bashrc +export PDB_DIR=... +# Or override for a particular execution +java -DPDB_DIR=... -cp ... +``` + +Note that your IDE may ignore `.bashrc` settings, but should have a preference for passing VM arguments. From b2a9031866e612d23f466cbd9931779c1152954d Mon Sep 17 00:00:00 2001 From: Jose Manuel Duarte Date: Thu, 5 Oct 2017 11:20:28 -0700 Subject: [PATCH 03/21] Update symmetry.md --- structure/symmetry.md | 3 +-- 1 file changed, 1 insertion(+), 2 deletions(-) diff --git a/structure/symmetry.md b/structure/symmetry.md index 7816bac..1940106 100644 --- a/structure/symmetry.md +++ b/structure/symmetry.md @@ -9,8 +9,7 @@ BioJava can be used to detect, analyze, and visualize **symmetry** and The **quaternary symmetry** of a structure defines the relations between its individual chains or groups of chains. For a more extensive explanation -about symmetery visit the [PDB help page] -(http://www.rcsb.org/pdb/staticHelp.do?p=help/viewers/jmol_symmetry_view.html). +about symmetery visit the [PDB help page](http://www.rcsb.org/pdb/staticHelp.do?p=help/viewers/jmol_symmetry_view.html). In the **quaternary symmetry** detection problem, we are given a set of chains with its `Atom` coordinates and we are asked to find the higest overall symmetry that From 1091686e5fb5c7727da5c06db8ee702d5fb17c41 Mon Sep 17 00:00:00 2001 From: Jose Manuel Duarte Date: Thu, 5 Oct 2017 15:45:24 -0700 Subject: [PATCH 04/21] Updating symmetry section --- structure/symmetry.md | 74 +++++++++++++++++++++++-------------------- 1 file changed, 39 insertions(+), 35 deletions(-) diff --git a/structure/symmetry.md b/structure/symmetry.md index 1940106..ef6a8f8 100644 --- a/structure/symmetry.md +++ b/structure/symmetry.md @@ -1,14 +1,14 @@ Protein Symmetry using BioJava ================================================================ -BioJava can be used to detect, analyze, and visualize **symmetry** and -**pseudo-symmetry** in the **quaternary** (biological assembly) and tertiary +BioJava can be used to detect, analyze, and visualize **symmetry** and +**pseudo-symmetry** in the **quaternary** (biological assembly) and tertiary (**internal**) structural levels. ## Quaternary Symmetry -The **quaternary symmetry** of a structure defines the relations between -its individual chains or groups of chains. For a more extensive explanation +The **quaternary symmetry** of a structure defines the relations between +its individual chains or groups of chains. For a more extensive explanation about symmetery visit the [PDB help page](http://www.rcsb.org/pdb/staticHelp.do?p=help/viewers/jmol_symmetry_view.html). In the **quaternary symmetry** detection problem, we are given a set of chains @@ -19,7 +19,7 @@ relates them. The solution is divided into the following steps: in the pseudo-symmetry case). For that, we perform a pairwise alignment of all chains and determine **clusters of identical chains**. 2. Next, we reduce the each chains to a single point, its **centroid** (center of mass). -3. After that, we try different **symmetry relations** to superimpose the chain centroids +3. After that, we try different **symmetry relations** to superimpose the chain centroids and obtain their RMSD. 4. At last, based on the parameters (cutoffs), we determine the **overall symmetry** of the structure, with the symmetry relations obtained in the previous step. @@ -36,16 +36,20 @@ Structure s; //Set some parameters if needed different than DEFAULT - see descriptions QuatSymmetryParameters parameters = new QuatSymmetryParameters(); -parameters.setVerbose(true); //print information +SubunitClustererParameters clusterParams = new SubunitClustererParameters(); //Instantiate the detector -QuatSymmetryDetector detector = QuatSymmetryDetector(structure, parameters); +QuatSymmetryDetector detector = QuatSymmetryDetector(s, parameters, clusterParams); -//The getters calculate the quaternary symmetry automatically -List globalResults = detector.getGlobalSymmetry(); -List> localResults = detector.getLocalSymmetries(); +//Static methods in QuatSymmetryDetector perform the calculation +QuatSymmetryResults globalResults = QuatSymmetryDetector.getGlobalSymmetry(s, parameters, clusterParams); +List localResults = QuatSymmetryDetector.getLocalSymmetries(s, parameters, clusterParams); ``` +See also the demo in the BioJava repo: + +https://github.com/biojava/biojava/blob/885600670be75b7f6bc5216bff52a93f43fff09e/biojava-structure/src/main/java/demo/DemoSymmetry.java#L37-L59 + The return type are `List` because there can be multiple valid options for the quaternary symmetry. The local results `List` is empty if there exist no local symmetry in the structure, and the global results `List` has always size bigger @@ -83,35 +87,35 @@ only applies to a subset of chains. ### Pseudo-Symmetry In **pseudo-symmetry** the chains related by the symmetry are not completely -identical, but they share a sequence similarity above the pseudo-symmetry +identical, but they share a sequence similarity above the pseudo-symmetry similarity threshold. -If we consider hemoglobin, at a 95% sequence identity threshold the alpha and -beta subunits are considered different, which correspond to an A2B2 stoichiometry -and a C2 point group. At the structural similarity level, all four chains are -considered homologous (~45% sequence identity) with an A4 pseudostoichiometry and -D2 pseudosymmetry. +If we consider hemoglobin, at a 95% sequence identity threshold the alpha and +beta subunits are considered different, which correspond to an A2B2 stoichiometry +and a C2 point group. At the structural similarity level, all four chains are +considered homologous (~45% sequence identity) with an A4 pseudostoichiometry and +D2 pseudosymmetry. ![PDB ID 4HHB](img/symm_pseudo.png) ## Internal Symmetry -**Internal symmetry** refers to the symmetry present in a single chain, that is, -the tertiary structure. The algorithm implemented in biojava to detect internal +**Internal symmetry** refers to the symmetry present in a single chain, that is, +the tertiary structure. The algorithm implemented in biojava to detect internal symmetry is called **CE-Symm**. ### CE-Symm -The **CE-Symm** algorithm was originally developed by [Myers-Turnbull D., Bliven SE., +The **CE-Symm** algorithm was originally developed by [Myers-Turnbull D., Bliven SE., Rose PW., Aziz ZK., Youkharibache P., Bourne PE. & Prlić A. in 2014] (http://www.sciencedirect.com/science/article/pii/S0022283614001557) [![pubmed](http://img.shields.io/badge/in-pubmed-blue.svg?style=flat)](http://www.ncbi.nlm.nih.gov/pubmed/24681267). As the name of the algorithm explicitly states, **CE-Symm** uses the Combinatorial -Extension (**CE**) algorithm to generate an alignment of the structure chain to itself, -disabling the identity alignment (the diagonal of the **DotPlot** representation of a -structure alignment). This allows the identification of alternative self-alignments, +Extension (**CE**) algorithm to generate an alignment of the structure chain to itself, +disabling the identity alignment (the diagonal of the **DotPlot** representation of a +structure alignment). This allows the identification of alternative self-alignments, which are related to symmetry and/or structural repeats inside the chain. -By a procedure called **refinement**, the subunits of the chain that are part of the symmetry +By a procedure called **refinement**, the subunits of the chain that are part of the symmetry are defined and a **multiple alignment** is created. This process can be thought as to divide the chain into other subchains, and then superimposing each subchain to each other to create a multiple alignment of the subunits, respecting the symmetry axes. @@ -156,7 +160,7 @@ System.out.println(pg.getSymmetry()); To enable some extra features in the display, a `SymmetryDisplay` class has been created, although the `MultipleAlignmentDisplay` method -can also be used for that purpose (it will not show symmetry axes or +can also be used for that purpose (it will not show symmetry axes or symmetry menus). Lastly, the `SymmetryGUI` class in the **structure-gui** package @@ -166,7 +170,7 @@ to the GUI to trigger structure alignments. ### Symmetry Display The symmetry display is similar to the **quaternary symmetry**, because -part of the code is shared. See for example this beta-propeller (1U6D), +part of the code is shared. See for example this beta-propeller (1U6D), where the repeated beta-sheets are connected by a linker forming a C6 point group internal symmetry: @@ -175,10 +179,10 @@ point group internal symmetry: #### Hierarchical Symmetry One additional feature of the **internal symmetry** display is the representation -of hierarchical symmetries and repeats. Contrary to point groups, some structures -have different **levels** of symmetry. That is, the whole strucutre has, e.g. C2 -symmetry and, at the same time, each of the two parts has C2 symmetry, but the axes -of both levels are not related by a point group (i.e. they do not cross to a single +of hierarchical symmetries and repeats. Contrary to point groups, some structures +have different **levels** of symmetry. That is, the whole strucutre has, e.g. C2 +symmetry and, at the same time, each of the two parts has C2 symmetry, but the axes +of both levels are not related by a point group (i.e. they do not cross to a single point). A very clear example are the beta-gamma-crystallins, like 4GCR: @@ -187,14 +191,14 @@ A very clear example are the beta-gamma-crystallins, like 4GCR: #### Subunit Multiple Alignment -Another feature of the display is the option to show the **multiple alignment** of +Another feature of the display is the option to show the **multiple alignment** of the symmetry related subunits created during the **refinement** process. Search for -the option *Subunit Superposition* in the *symmetry* menu of the Jmol window. For +the option *Subunit Superposition* in the *symmetry* menu of the Jmol window. For the previous example the display looks like that: ![PDB ID 4GCR](img/symm_subunits.png) -The subunit display highlights the differences and similarities between the symmetry +The subunit display highlights the differences and similarities between the symmetry related subunits of the chain, and helps the user to identify conseved and divergent regions, with the help of the *Sequence Alignment Panel*. @@ -202,9 +206,9 @@ regions, with the help of the *Sequence Alignment Panel*. Finally, the internal and quaternary symmetries can be combined to obtain the global overall combined symmetry. As we have seen before, the protein 1VYM is a DNA-clamp that -has three chains relates by C3 symmetry. Each chain is internally C2 symmetric, and each -part of the C2 internal symmetry is C2 symmetric, so a case of **hierarchical symmetry** -(C2 + C2). Once we have divided the whole structure into its asymmetric parts, we can +has three chains relates by C3 symmetry. Each chain is internally C2 symmetric, and each +part of the C2 internal symmetry is C2 symmetric, so a case of **hierarchical symmetry** +(C2 + C2). Once we have divided the whole structure into its asymmetric parts, we can analyze the global symmetry that related each one of them. The interesting result is that in some cases, the internal symmetry **multiplies** the point group of the quaternary symmetry. What seemed a C3 + C2 + C2 is combined into a D6 overall symmetry, as we can see in the figure From ffc8f4bad71d7a3834c8059130236f6f11b9f9c5 Mon Sep 17 00:00:00 2001 From: Aleix Lafita Date: Fri, 2 Feb 2018 10:21:35 +0000 Subject: [PATCH 05/21] Explain parameter options for combined symmetry In order to visualize point groups formed by quaternary and internal symmetry. --- structure/symmetry.md | 106 ++++++++++++++++++++++++------------------ 1 file changed, 61 insertions(+), 45 deletions(-) diff --git a/structure/symmetry.md b/structure/symmetry.md index ef6a8f8..7404392 100644 --- a/structure/symmetry.md +++ b/structure/symmetry.md @@ -3,65 +3,61 @@ Protein Symmetry using BioJava BioJava can be used to detect, analyze, and visualize **symmetry** and **pseudo-symmetry** in the **quaternary** (biological assembly) and tertiary -(**internal**) structural levels. +(**internal**) structural levels of proteins. ## Quaternary Symmetry -The **quaternary symmetry** of a structure defines the relations between -its individual chains or groups of chains. For a more extensive explanation -about symmetery visit the [PDB help page](http://www.rcsb.org/pdb/staticHelp.do?p=help/viewers/jmol_symmetry_view.html). +The **quaternary symmetry** of a structure defines the relation and arrangement of the individual chains or groups of chains that are part of a biological assembly. +For a more exhaustive explanation about protein quaternary symmetery and the different types visit the [PDB help page](http://www.rcsb.org/pdb/staticHelp.do?p=help/viewers/jmol_symmetry_view.html). -In the **quaternary symmetry** detection problem, we are given a set of chains -with its `Atom` coordinates and we are asked to find the higest overall symmetry that -relates them. The solution is divided into the following steps: +In the **quaternary symmetry** detection problem, we are given a set of chains (subunits) that are part of a biological assembly as input, defined by their atomic coordinates, and we are required to find the higest overall symmetry group that +relates them as ouptut. +The solution is divided into the following steps: 1. First, we need to identify the chains that are identical (or similar -in the pseudo-symmetry case). For that, we perform a pairwise alignment of all -chains and determine **clusters of identical chains**. -2. Next, we reduce the each chains to a single point, its **centroid** (center of mass). -3. After that, we try different **symmetry relations** to superimpose the chain centroids -and obtain their RMSD. -4. At last, based on the parameters (cutoffs), we determine the **overall symmetry** of the +in the pseudo-symmetry case). For that purpose, we perform a pairwise alignment of all +chains and identify **clusters of identical or similar subunits**. +2. Next, we reduce each of the polypeptide chains to a single point, their **centroid** (center of mass). +3. Afterwards, we try different **symmetry operations** using a grid search to superimpose the chain centroids +and score them using the RMSD. +4. Finally, based on the parameters (cutoffs), we determine the **overall symmetry** of the structure, with the symmetry relations obtained in the previous step. 5. In case of asymmetric structure, we discard combinatorially a number of chains and try -to detect any **local symmetries** present. +to detect any **local symmetries** present (symmetry that does not involve all subunits of the biological assembly). The **quaternary symmetry** detection algorithm is implemented in the biojava class [QuatSymmetryDetector](http://www.biojava.org/docs/api/org/biojava/nbio/structure/symmetry/core/QuatSymmetryDetector). An example of how to use it programatically is shown below: ```java -//First download the structure in the biological assembly form +// First download the structure in the biological assembly form Structure s; -//Set some parameters if needed different than DEFAULT - see descriptions +// Set some parameters if needed different than DEFAULT - see descriptions QuatSymmetryParameters parameters = new QuatSymmetryParameters(); SubunitClustererParameters clusterParams = new SubunitClustererParameters(); -//Instantiate the detector +// Instantiate the detector QuatSymmetryDetector detector = QuatSymmetryDetector(s, parameters, clusterParams); -//Static methods in QuatSymmetryDetector perform the calculation +// Static methods in QuatSymmetryDetector perform the calculation QuatSymmetryResults globalResults = QuatSymmetryDetector.getGlobalSymmetry(s, parameters, clusterParams); List localResults = QuatSymmetryDetector.getLocalSymmetries(s, parameters, clusterParams); ``` -See also the demo in the BioJava repo: +See also the [demo](https://github.com/biojava/biojava/blob/885600670be75b7f6bc5216bff52a93f43fff09e/biojava-structure/src/main/java/demo/DemoSymmetry.java#L37-L59) provided in **BioJava** for a real case working example. -https://github.com/biojava/biojava/blob/885600670be75b7f6bc5216bff52a93f43fff09e/biojava-structure/src/main/java/demo/DemoSymmetry.java#L37-L59 +The returned `QuatSymmetryResults` object contains all the information of the subunit clustering and structural symmetry. +This object will be used later to obtain axes of symmetry, point group name, stoichiometry or even display the results in Jmol. +The return object of quaternary symmetry (`QuatSymmetryResults`) contains the +In case of asymmetrical structure, the result is a C1 point group. +The return type of the local symmetry is a `List` because there can be multiple valid options of local symmetry. +The list will be empty if there exist no local symmetries in the structure. -The return type are `List` because there can be multiple valid options for the -quaternary symmetry. The local results `List` is empty if there exist no local -symmetry in the structure, and the global results `List` has always size bigger -than 1, returning a C1 point group in the case of asymmetric structure. - -The `QuatSymmetryResults` object contains all the information of the symmetry. -This object will be used later to obtain axes of symmetry, point group name, -stoichiometry or even display the results in Jmol. ### Global Symmetry -In **global symmetry** all chains have to be part of the symmetry description. +In the **global symmetry** mode all chains have to be part of the symmetry result. #### Point Group @@ -79,15 +75,14 @@ components. ### Local Symmetry -In **local symmetry** a number of chains is left out, so that the symmetry -only applies to a subset of chains. +In **local symmetry** a number of chains is left out, so that the symmetry only applies to a subset of chains. ![PDB ID 4F88](img/symm_local.png) ### Pseudo-Symmetry In **pseudo-symmetry** the chains related by the symmetry are not completely -identical, but they share a sequence similarity above the pseudo-symmetry +identical, but they share a sequence or structural similarity above the pseudo-symmetry similarity threshold. If we consider hemoglobin, at a 95% sequence identity threshold the alpha and @@ -122,8 +117,8 @@ create a multiple alignment of the subunits, respecting the symmetry axes. The **internal symmetry** detection algorithm is implemented in the biojava class [CeSymm](http://www.biojava.org/docs/api/org/biojava/nbio/structure/symmetry/internal/CeSymm). -It returns a MultipleAlignment, see the explanation of the model in [Data Models](alignment-data-model.md), -that describes the internal subunits multiple alignment. In case of no symmetry detected, the +It returns a `MultipleAlignment` object, see the explanation of the model in [Data Models](alignment-data-model.md), +that describes the similarity of the internal repeats. In case of no symmetry detected, the returned alignment represents the optimal self-alignment produced by the first step of the **CE-Symm** algorithm. @@ -202,22 +197,43 @@ The subunit display highlights the differences and similarities between the symm related subunits of the chain, and helps the user to identify conseved and divergent regions, with the help of the *Sequence Alignment Panel*. -## Combined Global Symmetry +## Quaternary + Internal Overall Symmetry -Finally, the internal and quaternary symmetries can be combined to obtain the global +Finally, the internal and quaternary symmetries can be merged to obtain the overall combined symmetry. As we have seen before, the protein 1VYM is a DNA-clamp that -has three chains relates by C3 symmetry. Each chain is internally C2 symmetric, and each -part of the C2 internal symmetry is C2 symmetric, so a case of **hierarchical symmetry** -(C2 + C2). Once we have divided the whole structure into its asymmetric parts, we can -analyze the global symmetry that related each one of them. The interesting result is that -in some cases, the internal symmetry **multiplies** the point group of the quaternary symmetry. -What seemed a C3 + C2 + C2 is combined into a D6 overall symmetry, as we can see in the figure -below: +has three chains arranged in a C3 symmetry. +Each chain is internally fourfold symmetric with two levels of symmetry. We can analyze the overall symmetry of the structure by considering together the C3 quaternary symmetry and the fourfold internal symmetry. +In this case, the internal symmetry **augments** the point group of the quaternary symmetry to a D6 overall symmetry, as we can see in the figure below: ![PDB ID 1VYM](img/symm_combined.png) -These results can give hints about the function and evolution of proteins and biological -structures. +An example of how to toggle the **combined symmetry** (quaternary + internal symmetries) programatically is shown below: + +```java +// First download the structure in the biological assembly form +Structure s; + +// Initialize default parameters +QuatSymmetryParameters parameters = new QuatSymmetryParameters(); +SubunitClustererParameters clusterParams = new SubunitClustererParameters(); + +// In SubunitClustererParameters set the clustering method to STRUCTURE and the internal symmetry option to true +clusterParams.setClustererMethod(SubunitClustererMethod.STRUCTURE); +clusterParams.setInternalSymmetry(true); + +// You can lower the default structural coverage to improve the recall +clusterParams.setStructureCoverageThreshold(0.75); + +// Instantiate the detector +QuatSymmetryDetector detector = QuatSymmetryDetector(s, parameters, clusterParams); + +// Static methods in QuatSymmetryDetector perform the calculation +QuatSymmetryResults overallResults = QuatSymmetryDetector.getGlobalSymmetry(s, parameters, clusterParams); + +``` + +See also the [test](https://github.com/biocryst/biojava/blob/df22da37a86a0dba3fb35bee7e17300d402ab469/biojava-integrationtest/src/test/java/org/biojava/nbio/structure/test/symmetry/TestQuatSymmetryDetectorExamples.java#L167-L192) provided in **BioJava** for a real case working example. + From 592f1fe1842dba3dac14ad15f30ffff517ea55eb Mon Sep 17 00:00:00 2001 From: Jose Manuel Duarte Date: Tue, 20 Feb 2018 11:48:11 -0800 Subject: [PATCH 06/21] Changing docs to solve issue #27 --- structure/chemcomp.md | 35 +++++++++++------------------------ 1 file changed, 11 insertions(+), 24 deletions(-) diff --git a/structure/chemcomp.md b/structure/chemcomp.md index fb4bb2a..8b665c8 100644 --- a/structure/chemcomp.md +++ b/structure/chemcomp.md @@ -1,7 +1,7 @@ The Chemical Component Dictionary ================================= -The [Chemical Component Dictionary](http://www.wwpdb.org/ccd.html) is an external reference file describing all residue and small molecule components found in PDB entries. This dictionary contains detailed chemical descriptions for standard and modified amino acids/nucleotides, small molecule ligands, and solvent molecules. +The [Chemical Component Dictionary](http://www.wwpdb.org/ccd.html) is an external reference file describing all residue and small molecule components found in PDB entries. This dictionary contains detailed chemical descriptions for standard and modified amino acids/nucleotides, small molecule ligands, and solvent molecules. ### How Does BioJava Decide what Groups Are Amino Acids? @@ -52,36 +52,23 @@ As you can see, although MSE is flaged as HETATM in the PDB file, BioJava still ### How to Access Chemical Component Definitions -By default BioJava ships with a minimal representation of standard amino acids, which is useful when you just want to work with atoms and a basic data representation. However if you want to work with a correct representation (e.g. distinguish ligands from the polypeptide chain, correctly resolve chemically modified residues), it is good to tell the library to either +By default BioJava will retrieve the full chemical component definitions provided by the Protein Data Bank (see http://www.wwpdb.org/data/ccd). That way BioJava makes sure that the user gets a correct representation e.g. distinguish ligands from the polypeptide chain, correctly resolve chemically modified residues, etc. -1. Fetch missing **Chemical Component Definitions** on the fly (small download and parsing delays every time a new chemical compound is found), or -2. Load all **Chemical Component Definitions** at startup (slow startup, but then no further delays later on, requires more memory) - -You can enable the first behaviour by doing using the [FileParsingParameters](http://www.biojava.org/docs/api/org/biojava/nbio/structure/io/FileParsingParameters.html) class: +The behaviour is configurable by setting a property in the `ChemCompGroupFactory` singleton: +1. Use a minimal built-in set of **Chemical Component Definitions**. Will only deal with most frequent cases of chemical components. Does not guarantee a correct representation, but it is fast and does not require network access. ```java - AtomCache cache = new AtomCache(); - - // by default all files are stored at a temporary location. - // you can set this either via at startup with -DPDB_DIR=/path/to/files/ - // or hard code it this way: - cache.setPath("/tmp/"); - - FileParsingParameters params = new FileParsingParameters(); - - params.setLoadChemCompInfo(true); - cache.setFileParsingParams(params); - - StructureIO.setAtomCache(cache); - - Structure structure = StructureIO.getStructure(...); + ChemCompGroupFactory.setChemCompProvider(new ReducedChemCompProvider()); ``` - -If you want to enable the second behaviour (slow loading of all chem comps at startup, but no further small delays later on) you can use the same code but change the behaviour by switching the [ChemCompProvider](http://www.biojava.org/docs/api/org/biojava/nbio/structure/io/mmcif/ChemCompProvider.html) implementation in the [ChemCompGroupFactory](http://www.biojava.org/docs/api/org/biojava/nbio/structure/io/mmcif/ChemCompGroupFactory.html) - +2. Load all **Chemical Component Definitions** at startup (slow startup, but then no further delays later on, requires more memory) ```java ChemCompGroupFactory.setChemCompProvider(new AllChemCompProvider()); ``` +3. Fetch missing **Chemical Component Definitions** on the fly (small download and parsing delays every time a new chemical compound is found). Default behaviour since 4.2.0. +```java + ChemCompGroupFactory.setChemCompProvider(new DownloadChemCompProvider()); +``` + From 12a993ee0cf8406785b4fed59236fe5f3b4112fc Mon Sep 17 00:00:00 2001 From: Jose Manuel Duarte Date: Tue, 20 Feb 2018 11:54:43 -0800 Subject: [PATCH 07/21] Another fix --- structure/chemcomp.md | 20 +++----------------- 1 file changed, 3 insertions(+), 17 deletions(-) diff --git a/structure/chemcomp.md b/structure/chemcomp.md index 8b665c8..92f7538 100644 --- a/structure/chemcomp.md +++ b/structure/chemcomp.md @@ -33,26 +33,12 @@ HOH is a group of type hetatm As you can see, although MSE is flaged as HETATM in the PDB file, BioJava still represents it correctly as an amino acid. They key is that the [definition file for MSE](http://www.rcsb.org/pdb/files/ligand/MSE.cif) flags it as "L-PEPTIDE LINKING", which is being used by BioJava. - - - - -
- -Selenomethionine is a naturally occurring amino acid containing selenium - - - - - Selenomethionine is a naturally occurring amino acid containing selenium. It has the ID MSE in the Chemical Component Dictionary. (image source: wikipedia) - - -
+Note: Selenomethionine is a naturally occurring amino acid containing selenium. It has the ID MSE in the Chemical Component Dictionary. ### How to Access Chemical Component Definitions -By default BioJava will retrieve the full chemical component definitions provided by the Protein Data Bank (see http://www.wwpdb.org/data/ccd). That way BioJava makes sure that the user gets a correct representation e.g. distinguish ligands from the polypeptide chain, correctly resolve chemically modified residues, etc. +By default BioJava will retrieve the full chemical component definitions provided by the PDB. That way BioJava makes sure that the user gets a correct representation e.g. distinguish ligands from the polypeptide chain, correctly resolve chemically modified residues, etc. The behaviour is configurable by setting a property in the `ChemCompGroupFactory` singleton: @@ -64,7 +50,7 @@ The behaviour is configurable by setting a property in the `ChemCompGroupFactory ```java ChemCompGroupFactory.setChemCompProvider(new AllChemCompProvider()); ``` -3. Fetch missing **Chemical Component Definitions** on the fly (small download and parsing delays every time a new chemical compound is found). Default behaviour since 4.2.0. +3. Fetch missing **Chemical Component Definitions** on the fly (small download and parsing delays every time a new chemical compound is found). Default behaviour since 4.2.0. Note that the chemical component files are cached in the local file system for subsequent uses. ```java ChemCompGroupFactory.setChemCompProvider(new DownloadChemCompProvider()); ``` From 24ba882918116aa96fcebd98dd74082b83251268 Mon Sep 17 00:00:00 2001 From: Jose Manuel Duarte Date: Fri, 22 Jun 2018 19:09:40 -0700 Subject: [PATCH 08/21] Update structure-data-model.md --- structure/structure-data-model.md | 8 ++++---- 1 file changed, 4 insertions(+), 4 deletions(-) diff --git a/structure/structure-data-model.md b/structure/structure-data-model.md index edfd882..c8db2c0 100644 --- a/structure/structure-data-model.md +++ b/structure/structure-data-model.md @@ -126,7 +126,7 @@ The detection of the groups works really well in connection with the [Chemical C ## Entities and Chains -Entities (in the BioJava API called compounds) are the distinct chemical components of structures in the PDB. +Entities are the distinct chemical components of structures in the PDB. Unlike chains, entities do not include duplicate copies and each entity is different from every other entity in the structure. There are different types of entities. Polymer entities include Protein, DNA, and RNA. Ligands are smaller chemical components that are not part of a polymer entity. @@ -140,15 +140,15 @@ and beta. Each of the entities has two copies (= chains) in the structure. IN 4H has the two chains with the IDs A, and C and beta the chains B, and D. In total, hemoglobin is built up out of four chains. -This prints all the compounds/entities in a structure +This prints all the entities in a structure ```java Structure structure = StructureIO.getStructure("4hhb"); System.out.println(structure); - System.out.println(" # of compounds (entities) " + structure.getCompounds().size()); + System.out.println(" # of compounds (entities) " + structure.getEntityInfos().size()); - for ( Compound entity: structure.getCompounds()) { + for ( EntityInfo entity: structure.getEntityInfos()) { System.out.println(" " + entity); } ``` From 3552d2d41f2c98c33a6ead8c28e45bcd49b5d36a Mon Sep 17 00:00:00 2001 From: Aleix Lafita Date: Tue, 7 May 2019 18:01:38 +0100 Subject: [PATCH 09/21] Update BioJava citations everywhere --- README.md | 16 +++++++--------- alignment/README.md | 17 +++++++---------- core/README.md | 17 +++++++---------- genomics/README.md | 17 +++++++---------- modfinder/README.md | 21 +++++++++------------ protein-disorder/README.md | 16 +++++++--------- structure/README.md | 20 +++++++------------- 7 files changed, 51 insertions(+), 73 deletions(-) diff --git a/README.md b/README.md index a97ff6e..dba04c8 100644 --- a/README.md +++ b/README.md @@ -1,7 +1,7 @@ Tutorial === -A brief introduction into [BioJava](https://github.com/biojava/biojava). +A brief introduction into [BioJava](https://www.biojava.org). ----- The goal of this tutorial is to provide an educational introduction into some of the features that are provided by BioJava. @@ -12,7 +12,7 @@ The tutorial is intended to work with the most recent version of BioJava, althou ## Index -Quick [Installation](installation.md) +[Quick Installation](installation.md) Book 1: [The Core Module](core/README.md), basic working with sequences. @@ -28,16 +28,14 @@ Book 6: [The ModFinder Module](modfinder/README.md), identifying potein modifica ## License -The content of this tutorial is available under the [CC-BY](http://creativecommons.org/licenses/by/3.0/) license. - -[view license](license.md) +The content of this tutorial is available under the [CC-BY](http://creativecommons.org/licenses/by/3.0/) license, available [here](license.md). ## Please Cite -**BioJava: an open-source framework for bioinformatics in 2012**
-*Andreas Prlic; Andrew Yates; Spencer E. Bliven; Peter W. Rose; Julius Jacobsen; Peter V. Troshin; Mark Chapman; Jianjiong Gao; Chuan Hock Koh; Sylvain Foisy; Richard Holland; Gediminas Rimsa; Michael L. Heuer; H. Brandstatter-Muller; Philip E. Bourne; Scooter Willis*
-[Bioinformatics (2012) 28 (20): 2693-2695.](http://bioinformatics.oxfordjournals.org/content/28/20/2693.abstract)
-[![doi](http://img.shields.io/badge/doi-10.1093%2Fbioinformatics%2Fbts494-blue.svg?style=flat)](http://bioinformatics.oxfordjournals.org/content/28/20/2693.abstract) [![pubmed](http://img.shields.io/badge/pubmed-22877863-blue.svg?style=flat)](http://www.ncbi.nlm.nih.gov/pubmed/22877863) +**BioJava 5: A community driven open-source bioinformatics library**
+*Aleix Lafita, Spencer Bliven, Andreas Prlić, Dmytro Guzenko, Peter W. Rose, Anthony Bradley, Paolo Pavan, Douglas Myers-Turnbull, Yana Valasatava, Michael Heuer, Matt Larson, Stephen K. Burley, & Jose M. Duarte*
+[PLOS Computational Biology (2019) 15 (2):e1006791.](https://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1006791)
+[![doi](https://img.shields.io/badge/doi-10.1371%2Fjournal.pcbi.1006791-blue.svg?style=flat)](https://doi.org/10.1371/journal.pcbi.1006791) [![pubmed](https://img.shields.io/badge/pubmed-30735498-blue.svg?style=flat)](http://www.ncbi.nlm.nih.gov/pubmed/30735498) diff --git a/alignment/README.md b/alignment/README.md index 3ea8858..3f093fe 100644 --- a/alignment/README.md +++ b/alignment/README.md @@ -36,19 +36,16 @@ Chapter 5 - Reading and writing of multiple alignments Chapter 6 - BLAST - why you don't need BioJava for parsing BLAST -## Please cite - -**BioJava: an open-source framework for bioinformatics in 2012**
-*Andreas Prlic; Andrew Yates; Spencer E. Bliven; Peter W. Rose; Julius Jacobsen; Peter V. Troshin; Mark Chapman; Jianjiong Gao; Chuan Hock Koh; Sylvain Foisy; Richard Holland; Gediminas Rimsa; Michael L. Heuer; H. Brandstatter-Muller; Philip E. Bourne; Scooter Willis*
-[Bioinformatics (2012) 28 (20): 2693-2695.](http://bioinformatics.oxfordjournals.org/content/28/20/2693.abstract)
-[![doi](http://img.shields.io/badge/doi-10.1093%2Fbioinformatics%2Fbts494-blue.svg?style=flat)](http://bioinformatics.oxfordjournals.org/content/28/20/2693.abstract) [![pubmed](http://img.shields.io/badge/pubmed-22877863-blue.svg?style=flat)](http://www.ncbi.nlm.nih.gov/pubmed/22877863) - - ## License -The content of this tutorial is available under the [CC-BY](http://creativecommons.org/licenses/by/3.0/) license. +The content of this tutorial is available under the [CC-BY](http://creativecommons.org/licenses/by/3.0/) license, available [here](../license.md). + +## Please cite -[view license](../license.md) +**BioJava 5: A community driven open-source bioinformatics library**
+*Aleix Lafita, Spencer Bliven, Andreas Prlić, Dmytro Guzenko, Peter W. Rose, Anthony Bradley, Paolo Pavan, Douglas Myers-Turnbull, Yana Valasatava, Michael Heuer, Matt Larson, Stephen K. Burley, & Jose M. Duarte*
+[PLOS Computational Biology (2019) 15 (2):e1006791.](https://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1006791)
+[![doi](https://img.shields.io/badge/doi-10.1371%2Fjournal.pcbi.1006791-blue.svg?style=flat)](https://doi.org/10.1371/journal.pcbi.1006791) [![pubmed](https://img.shields.io/badge/pubmed-30735498-blue.svg?style=flat)](http://www.ncbi.nlm.nih.gov/pubmed/30735498) diff --git a/core/README.md b/core/README.md index 3638712..0fd20de 100644 --- a/core/README.md +++ b/core/README.md @@ -32,19 +32,16 @@ Chapter 3 - [Reading and Writing sequences](readwrite.md) Chapter 4 - [Translating](translating.md) DNA and protein sequences. -## Please cite - -**BioJava: an open-source framework for bioinformatics in 2012**
-*Andreas Prlic; Andrew Yates; Spencer E. Bliven; Peter W. Rose; Julius Jacobsen; Peter V. Troshin; Mark Chapman; Jianjiong Gao; Chuan Hock Koh; Sylvain Foisy; Richard Holland; Gediminas Rimsa; Michael L. Heuer; H. Brandstatter-Muller; Philip E. Bourne; Scooter Willis*
-[Bioinformatics (2012) 28 (20): 2693-2695.](http://bioinformatics.oxfordjournals.org/content/28/20/2693.abstract)
-[![doi](http://img.shields.io/badge/doi-10.1093%2Fbioinformatics%2Fbts494-blue.svg?style=flat)](http://bioinformatics.oxfordjournals.org/content/28/20/2693.abstract) [![pubmed](http://img.shields.io/badge/pubmed-22877863-blue.svg?style=flat)](http://www.ncbi.nlm.nih.gov/pubmed/22877863) - - ## License -The content of this tutorial is available under the [CC-BY](http://creativecommons.org/licenses/by/3.0/) license. +The content of this tutorial is available under the [CC-BY](http://creativecommons.org/licenses/by/3.0/) license, available [here](license.md). + +## Please Cite -[view license](../license.md) +**BioJava 5: A community driven open-source bioinformatics library**
+*Aleix Lafita, Spencer Bliven, Andreas Prlić, Dmytro Guzenko, Peter W. Rose, Anthony Bradley, Paolo Pavan, Douglas Myers-Turnbull, Yana Valasatava, Michael Heuer, Matt Larson, Stephen K. Burley, & Jose M. Duarte*
+[PLOS Computational Biology (2019) 15 (2):e1006791.](https://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1006791)
+[![doi](https://img.shields.io/badge/doi-10.1371%2Fjournal.pcbi.1006791-blue.svg?style=flat)](https://doi.org/10.1371/journal.pcbi.1006791) [![pubmed](https://img.shields.io/badge/pubmed-30735498-blue.svg?style=flat)](http://www.ncbi.nlm.nih.gov/pubmed/30735498) diff --git a/genomics/README.md b/genomics/README.md index d5a8470..37c44d8 100644 --- a/genomics/README.md +++ b/genomics/README.md @@ -39,19 +39,16 @@ Chapter 5 - Reading [karyotype (cytoband)](karyotype.md) files Chapter 6 - Reading genomic DNA sequences using UCSC's [.2bit file format](twobit.md) -## Please cite - -**BioJava: an open-source framework for bioinformatics in 2012**
-*Andreas Prlic; Andrew Yates; Spencer E. Bliven; Peter W. Rose; Julius Jacobsen; Peter V. Troshin; Mark Chapman; Jianjiong Gao; Chuan Hock Koh; Sylvain Foisy; Richard Holland; Gediminas Rimsa; Michael L. Heuer; H. Brandstatter-Muller; Philip E. Bourne; Scooter Willis*
-[Bioinformatics (2012) 28 (20): 2693-2695.](http://bioinformatics.oxfordjournals.org/content/28/20/2693.abstract)
-[![doi](http://img.shields.io/badge/doi-10.1093%2Fbioinformatics%2Fbts494-blue.svg?style=flat)](http://bioinformatics.oxfordjournals.org/content/28/20/2693.abstract) [![pubmed](http://img.shields.io/badge/pubmed-22877863-blue.svg?style=flat)](http://www.ncbi.nlm.nih.gov/pubmed/22877863) - - ## License -The content of this tutorial is available under the [CC-BY](http://creativecommons.org/licenses/by/3.0/) license. +The content of this tutorial is available under the [CC-BY](http://creativecommons.org/licenses/by/3.0/) license, available [here](license.md). + +## Please Cite -[view license](../license.md) +**BioJava 5: A community driven open-source bioinformatics library**
+*Aleix Lafita, Spencer Bliven, Andreas Prlić, Dmytro Guzenko, Peter W. Rose, Anthony Bradley, Paolo Pavan, Douglas Myers-Turnbull, Yana Valasatava, Michael Heuer, Matt Larson, Stephen K. Burley, & Jose M. Duarte*
+[PLOS Computational Biology (2019) 15 (2):e1006791.](https://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1006791)
+[![doi](https://img.shields.io/badge/doi-10.1371%2Fjournal.pcbi.1006791-blue.svg?style=flat)](https://doi.org/10.1371/journal.pcbi.1006791) [![pubmed](https://img.shields.io/badge/pubmed-30735498-blue.svg?style=flat)](http://www.ncbi.nlm.nih.gov/pubmed/30735498) diff --git a/modfinder/README.md b/modfinder/README.md index 202ff31..4226061 100644 --- a/modfinder/README.md +++ b/modfinder/README.md @@ -27,24 +27,21 @@ Chapter 3 - [How to identify protein modifications in a structure](identify-prot Chapter 4 - [How to define a new protein modification](add-protein-modification.md) -## Please cite +## License + +The content of this tutorial is available under the [CC-BY](http://creativecommons.org/licenses/by/3.0/) license, available [here](license.md). + +## Please Cite **BioJava-ModFinder: identification of protein modifications in 3D structures from the Protein Data Bank**
*Jianjiong Gao; Andreas Prlic; Chunxiao Bi; Wolfgang F. Bluhm; Dimitris Dimitropoulos; Dong Xu; Philip E. Bourne; Peter W. Rose*
[Bioinformatics. 2017 Feb 17.](https://academic.oup.com/bioinformatics/article-lookup/doi/10.1093/bioinformatics/btx101)
[![doi](http://img.shields.io/badge/doi-10.1093%2Fbioinformatics%2Fbtx101-blue.svg?style=flat)](https://doi.org/10.1093/bioinformatics/btx101) [![pubmed](http://img.shields.io/badge/pubmed-28334105-blue.svg?style=flat)](http://www.ncbi.nlm.nih.gov/pubmed/28334105) -**BioJava: an open-source framework for bioinformatics in 2012**
-*Andreas Prlic; Andrew Yates; Spencer E. Bliven; Peter W. Rose; Julius Jacobsen; Peter V. Troshin; Mark Chapman; Jianjiong Gao; Chuan Hock Koh; Sylvain Foisy; Richard Holland; Gediminas Rimsa; Michael L. Heuer; H. Brandstatter-Muller; Philip E. Bourne; Scooter Willis*
-[Bioinformatics (2012) 28 (20): 2693-2695.](http://bioinformatics.oxfordjournals.org/content/28/20/2693.abstract)
-[![doi](http://img.shields.io/badge/doi-10.1093%2Fbioinformatics%2Fbts494-blue.svg?style=flat)](http://bioinformatics.oxfordjournals.org/content/28/20/2693.abstract) [![pubmed](http://img.shields.io/badge/pubmed-22877863-blue.svg?style=flat)](http://www.ncbi.nlm.nih.gov/pubmed/22877863) - - -## License - -The content of this tutorial is available under the [CC-BY](http://creativecommons.org/licenses/by/3.0/) license. - -[view license](../license.md) +**BioJava 5: A community driven open-source bioinformatics library**
+*Aleix Lafita, Spencer Bliven, Andreas Prlić, Dmytro Guzenko, Peter W. Rose, Anthony Bradley, Paolo Pavan, Douglas Myers-Turnbull, Yana Valasatava, Michael Heuer, Matt Larson, Stephen K. Burley, & Jose M. Duarte*
+[PLOS Computational Biology (2019) 15 (2):e1006791.](https://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1006791)
+[![doi](https://img.shields.io/badge/doi-10.1371%2Fjournal.pcbi.1006791-blue.svg?style=flat)](https://doi.org/10.1371/journal.pcbi.1006791) [![pubmed](https://img.shields.io/badge/pubmed-30735498-blue.svg?style=flat)](http://www.ncbi.nlm.nih.gov/pubmed/30735498) diff --git a/protein-disorder/README.md b/protein-disorder/README.md index 2238bb6..888ae04 100644 --- a/protein-disorder/README.md +++ b/protein-disorder/README.md @@ -92,18 +92,16 @@ Map ranges = Jronn.getDisorder(sequences); ``` -## Please cite - -**BioJava: an open-source framework for bioinformatics in 2012**
-*Andreas Prlic; Andrew Yates; Spencer E. Bliven; Peter W. Rose; Julius Jacobsen; Peter V. Troshin; Mark Chapman; Jianjiong Gao; Chuan Hock Koh; Sylvain Foisy; Richard Holland; Gediminas Rimsa; Michael L. Heuer; H. Brandstatter-Muller; Philip E. Bourne; Scooter Willis*
-[Bioinformatics (2012) 28 (20): 2693-2695.](http://bioinformatics.oxfordjournals.org/content/28/20/2693.abstract)
-[![doi](http://img.shields.io/badge/doi-10.1093%2Fbioinformatics%2Fbts494-blue.svg?style=flat)](http://bioinformatics.oxfordjournals.org/content/28/20/2693.abstract) [![pubmed](http://img.shields.io/badge/pubmed-22877863-blue.svg?style=flat)](http://www.ncbi.nlm.nih.gov/pubmed/22877863) - ## License -The content of this tutorial is available under the [CC-BY](http://creativecommons.org/licenses/by/3.0/) license. +The content of this tutorial is available under the [CC-BY](http://creativecommons.org/licenses/by/3.0/) license, available [here](license.md). + +## Please Cite -[view license](../license.md) +**BioJava 5: A community driven open-source bioinformatics library**
+*Aleix Lafita, Spencer Bliven, Andreas Prlić, Dmytro Guzenko, Peter W. Rose, Anthony Bradley, Paolo Pavan, Douglas Myers-Turnbull, Yana Valasatava, Michael Heuer, Matt Larson, Stephen K. Burley, & Jose M. Duarte*
+[PLOS Computational Biology (2019) 15 (2):e1006791.](https://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1006791)
+[![doi](https://img.shields.io/badge/doi-10.1371%2Fjournal.pcbi.1006791-blue.svg?style=flat)](https://doi.org/10.1371/journal.pcbi.1006791) [![pubmed](https://img.shields.io/badge/pubmed-30735498-blue.svg?style=flat)](http://www.ncbi.nlm.nih.gov/pubmed/30735498) diff --git a/structure/README.md b/structure/README.md index 84df6be..03af437 100644 --- a/structure/README.md +++ b/structure/README.md @@ -64,22 +64,16 @@ Chapter 17 - [Special Cases](special.md) Chapter 18 - [Lists](lists.md) of PDB IDs and PDB [Status Information](lists.md) -### Author: - -[Andreas Prlić](https://github.com/andreasprlic) - -## Please cite - -**BioJava: an open-source framework for bioinformatics in 2012**
-*Andreas Prlic; Andrew Yates; Spencer E. Bliven; Peter W. Rose; Julius Jacobsen; Peter V. Troshin; Mark Chapman; Jianjiong Gao; Chuan Hock Koh; Sylvain Foisy; Richard Holland; Gediminas Rimsa; Michael L. Heuer; H. Brandstatter-Muller; Philip E. Bourne; Scooter Willis*
-[Bioinformatics (2012) 28 (20): 2693-2695.](http://bioinformatics.oxfordjournals.org/content/28/20/2693.abstract)
-[![doi](http://img.shields.io/badge/doi-10.1093%2Fbioinformatics%2Fbts494-blue.svg?style=flat)](http://bioinformatics.oxfordjournals.org/content/28/20/2693.abstract) [![pubmed](http://img.shields.io/badge/pubmed-22877863-blue.svg?style=flat)](http://www.ncbi.nlm.nih.gov/pubmed/22877863) - ## License -The content of this tutorial is available under the [CC-BY](http://creativecommons.org/licenses/by/3.0/) license. +The content of this tutorial is available under the [CC-BY](http://creativecommons.org/licenses/by/3.0/) license, available [here](license.md). + +## Please Cite -[view license](../license.md) +**BioJava 5: A community driven open-source bioinformatics library**
+*Aleix Lafita, Spencer Bliven, Andreas Prlić, Dmytro Guzenko, Peter W. Rose, Anthony Bradley, Paolo Pavan, Douglas Myers-Turnbull, Yana Valasatava, Michael Heuer, Matt Larson, Stephen K. Burley, & Jose M. Duarte*
+[PLOS Computational Biology (2019) 15 (2):e1006791.](https://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1006791)
+[![doi](https://img.shields.io/badge/doi-10.1371%2Fjournal.pcbi.1006791-blue.svg?style=flat)](https://doi.org/10.1371/journal.pcbi.1006791) [![pubmed](https://img.shields.io/badge/pubmed-30735498-blue.svg?style=flat)](http://www.ncbi.nlm.nih.gov/pubmed/30735498) From abbe4c924a7815bc84bcaecdb9a40575c5753f64 Mon Sep 17 00:00:00 2001 From: Aleix Lafita Date: Tue, 7 May 2019 18:11:43 +0100 Subject: [PATCH 10/21] Add symmetry citation --- structure/symmetry.md | 9 +++++++++ 1 file changed, 9 insertions(+) diff --git a/structure/symmetry.md b/structure/symmetry.md index 7404392..cfe5186 100644 --- a/structure/symmetry.md +++ b/structure/symmetry.md @@ -235,6 +235,15 @@ QuatSymmetryResults overallResults = QuatSymmetryDetector.getGlobalSymmetry(s, p See also the [test](https://github.com/biocryst/biojava/blob/df22da37a86a0dba3fb35bee7e17300d402ab469/biojava-integrationtest/src/test/java/org/biojava/nbio/structure/test/symmetry/TestQuatSymmetryDetectorExamples.java#L167-L192) provided in **BioJava** for a real case working example. +## Please Cite + +**Analyzing the symmetrical arrangement of structural repeats in proteins with CE-Symm**
+*Spencer E Bliven, Aleix Lafita, Peter W Rose, Guido Capitani, Andreas Prlić, & Philip E Bourne*
+[PLOS Computational Biology (2019) 15 (4):e1006842.](https://journals.plos.org/ploscompbiol/article/citation?id=10.1371/journal.pcbi.1006842)
+[![doi](https://img.shields.io/badge/doi-10.1371%2Fjournal.pcbi.1006842-blue.svg?style=flat)](https://doi.org/10.1371/journal.pcbi.1006842) [![pubmed](https://img.shields.io/badge/pubmed-31009453-blue.svg?style=flat)](http://www.ncbi.nlm.nih.gov/pubmed/31009453) + + + --- From 85d620d06f30119e7558df44cefdb124a0fda3cd Mon Sep 17 00:00:00 2001 From: Aleix Lafita Date: Tue, 7 May 2019 18:25:59 +0100 Subject: [PATCH 11/21] Modify README --- README.md | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) diff --git a/README.md b/README.md index dba04c8..13a8f30 100644 --- a/README.md +++ b/README.md @@ -4,11 +4,11 @@ A brief introduction into [BioJava](https://www.biojava.org). ----- -The goal of this tutorial is to provide an educational introduction into some of the features that are provided by BioJava. +The goal of this tutorial is to provide an educational introduction into some of the features that are provided by BioJava. This tutorial is still under development, hence not yet comprehensive for the entire library. Please also check other sources of [documentation](https://biojava.org/wiki/Documentation). -At the moment this tutorial is still under development. Please check the [BioJava Cookbook](http://biojava.org/wikis/BioJava:CookBook4.0) for a more comprehensive collection of examples about what is possible with BioJava and how to do things. +The examples within the tutorial are intended to work with the most recent version of BioJava, although most examples will work with BioJava 3.0 and higher. Please do submit a [new issue](https://github.com/biojava/biojava-tutorial/issues) if you find any problems. -The tutorial is intended to work with the most recent version of BioJava, although most examples will work with BioJava 3.0 and higher. +The tutorial is subdivided into several books, corresponding to the respective BioJava modules. Each book is further subdivided into several chapters that intend to describe the main functionality of the module in order of increasing complexity. ## Index From fc9f288c58ca1d6eb0a0f33e90df86ded94aa801 Mon Sep 17 00:00:00 2001 From: Jose Duarte Date: Fri, 29 Oct 2021 16:49:01 -0700 Subject: [PATCH 12/21] Updates for biojava 6.0.0 release --- structure/firststeps.md | 53 +++++++++++++++-------------------------- structure/mmcif.md | 39 +++++------------------------- 2 files changed, 25 insertions(+), 67 deletions(-) diff --git a/structure/firststeps.md b/structure/firststeps.md index 8effe51..ef13be2 100644 --- a/structure/firststeps.md +++ b/structure/firststeps.md @@ -6,14 +6,10 @@ First Steps The simplest way to load a PDB file is by using the [StructureIO](http://www.biojava.org/docs/api/org/biojava/nbio/structure/StructureIO.html) class. ```java - public static void main(String[] args){ - try { - Structure structure = StructureIO.getStructure("4HHB"); - // and let's print out how many atoms are in this structure - System.out.println(StructureTools.getNrAtoms(structure)); - } catch (Exception e){ - e.printStackTrace(); - } + public static void main(String[] args) throws Exception { + Structure structure = StructureIO.getStructure("4HHB"); + // and let's print out how many atoms are in this structure + System.out.println(StructureTools.getNrAtoms(structure)); } ``` @@ -53,23 +49,17 @@ Talking about startup properties, it is also good to mention the fact that many If you have the *biojava-structure-gui* module installed, you can quickly visualise a [Structure](http://www.biojava.org/docs/api/org/biojava/nbio/structure/Structure.html) via this: ```java - public static void main(String[] args){ - try { - - Structure struc = StructureIO.getStructure("4hhb"); - - StructureAlignmentJmol jmolPanel = new StructureAlignmentJmol(); - - jmolPanel.setStructure(struc); - - // send some commands to Jmol - jmolPanel.evalString("select * ; color chain;"); - jmolPanel.evalString("select *; spacefill off; wireframe off; cartoon on; "); - jmolPanel.evalString("select ligands; cartoon off; wireframe 0.3; spacefill 0.5; color cpk;"); - - } catch (Exception e){ - e.printStackTrace(); - } + public static void main(String[] args) throws Exception { + Structure struc = StructureIO.getStructure("4hhb"); + + StructureAlignmentJmol jmolPanel = new StructureAlignmentJmol(); + + jmolPanel.setStructure(struc); + + // send some commands to Jmol + jmolPanel.evalString("select * ; color chain;"); + jmolPanel.evalString("select *; spacefill off; wireframe off; cartoon on; "); + jmolPanel.evalString("select ligands; cartoon off; wireframe 0.3; spacefill 0.5; color cpk;"); } ``` @@ -91,15 +81,10 @@ This will result in the following view: By default many people work with the *asymmetric unit* of a protein. However for many studies the correct representation to look at is the *biological assembly* of a protein. You can request it by calling ```java - public static void main(String[] args){ - - try { - Structure structure = StructureIO.getBiologicalAssembly("1GAV"); - // and let's print out how many atoms are in this structure - System.out.println(StructureTools.getNrAtoms(structure)); - } catch (Exception e){ - e.printStackTrace(); - } + public static void main(String[] args) throws Exception { + Structure structure = StructureIO.getBiologicalAssembly("1GAV"); + // and let's print out how many atoms are in this structure + System.out.println(StructureTools.getNrAtoms(structure)); } ``` diff --git a/structure/mmcif.md b/structure/mmcif.md index 230488e..fc1b94d 100644 --- a/structure/mmcif.md +++ b/structure/mmcif.md @@ -44,8 +44,8 @@ By default BioJava is using the PDB file format for parsing data. In order to sw ```java AtomCache cache = new AtomCache(); - - cache.setUseMmCif(true); + + cache.setFiletype(StructureFiletype.CIF); // if you struggled to set the PDB_DIR property correctly in the previous step, // you could set it manually like this: @@ -67,13 +67,8 @@ StructureIO can also access files via URLs and fetch the data dynamically. E.g. ```java String u = "http://ftp.wwpdb.org/pub/pdb/data/biounit/mmCIF/divided/nw/4nwr-assembly1.cif.gz"; - try { - Structure s = StructureIO.getStructure(u); - - System.out.println(s); - } catch (Exception e) { - e.printStackTrace(); - } + Structure s = StructureIO.getStructure(u); + System.out.println(s); ``` ### Local URLs @@ -86,34 +81,12 @@ BioJava can also access local files, by specifying the URL as ## Low Level Access -If you want to learn how to use the BioJava mmCIF parser to populate your own data structure, let's first take a look this lower-level code: +You can load a BioJava `Structure` object using the ciftools-java parser with: ```java InputStream inStream = new FileInputStream(fileName); - - MMcifParser parser = new SimpleMMcifParser(); - - SimpleMMcifConsumer consumer = new SimpleMMcifConsumer(); - - // The Consumer builds up the BioJava - structure object. - // you could also hook in your own and build up you own data model. - parser.addMMcifConsumer(consumer); - - try { - parser.parse(new BufferedReader(new InputStreamReader(inStream))); - } catch (IOException e){ - e.printStackTrace(); - } - // now get the protein structure. - Structure cifStructure = consumer.getStructure(); -``` - -The parser operates similar to a XML parser by triggering "events". The [SimpleMMcifConsumer](http://www.biojava.org/docs/api/org/biojava/nbio/structure/io/mmcif/SimpleMMcifConsumer.html) listens to new categories being read from the file and then builds up the BioJava data model. - -To re-use the parser for your own datamodel, just implement the [MMcifConsumer](http://www.biojava.org/docs/api/org/biojava/nbio/structure/io/mmcif/MMcifConsumer.html) interface and add it to the [SimpleMMcifParser](http://www.biojava.org/docs/api/org/biojava/nbio/structure/io/mmcif/SimpleMMcifParser.html). -```java - parser.addMMcifConsumer(myOwnConsumerImplementation); + Structure cifStructure = CifStructureConverter.fromInputStream(inStream); ``` ## I Loaded a Structure Object, What Now? From fcea68b6606a86e0a34e2d6c483c092c7f404748 Mon Sep 17 00:00:00 2001 From: Jose Duarte Date: Fri, 29 Oct 2021 16:59:11 -0700 Subject: [PATCH 13/21] Another update --- structure/mmcif.md | 19 +++++++++++-------- 1 file changed, 11 insertions(+), 8 deletions(-) diff --git a/structure/mmcif.md b/structure/mmcif.md index fc1b94d..9fcd6a8 100644 --- a/structure/mmcif.md +++ b/structure/mmcif.md @@ -12,12 +12,15 @@ The mmCIF file format has been around for some time (see [Westbrook 2000][] and ## The Basics -BioJava provides you with both a mmCIF parser and a data model that reads PDB and mmCIF files into a biological and chemically meaningful data model (BioJava supports the [Chemical Components Dictionary](mmcif.md)). If you don't want to use that data model, you can still use BioJava's file parsers, and more on that later, let's start first with the most basic way of loading a protein structure. +BioJava uses the [CIFTools-java](https://github.com/rcsb/ciftools-java) library to parse mmCIF. BioJava then has its own data model that reads PDB and mmCIF files +into a biological and chemically meaningful data model (BioJava supports the [Chemical Components Dictionary](mmcif.md)). +If you don't want to use that data model, you can still use the CIFTools-java parser, please refer to its documentation. +Let's start first with the most basic way of loading a protein structure. ## First Steps -The simplest way to load a PDB file is by using the [StructureIO](http://www.biojava.org/docs/api/org/biojava/nbio/structure/StructureIO.html) class. +The simplest way to load a PDBx/mmCIF file is by using the [StructureIO](http://www.biojava.org/docs/api/org/biojava/nbio/structure/StructureIO.html) class. ```java Structure structure = StructureIO.getStructure("4HHB"); @@ -25,9 +28,7 @@ The simplest way to load a PDB file is by using the [StructureIO](http://www.bio System.out.println(StructureTools.getNrAtoms(structure)); ``` - - -BioJava automatically downloaded the PDB file for hemoglobin [4HHB](http://www.rcsb.org/pdb/explore.do?structureId=4HHB) and copied it into a temporary location. This demonstrates two things: +BioJava automatically downloaded the PDB file for hemoglobin [4HHB](http://www.rcsb.org/pdb/explore.do?structureId=4HHB) and copied it into a temporary location. This demonstrates two things: + BioJava can automatically download and install files locally + BioJava by default writes those files into a temporary location (The system temp directory "java.io.tempdir"). @@ -38,9 +39,11 @@ If you already have a local PDB installation, you can configure where BioJava sh -DPDB_DIR=/wherever/you/want/ -## From PDB to mmCIF +## Switching AtomCache to use different file types -By default BioJava is using the PDB file format for parsing data. In order to switch it to use mmCIF, we can take control over the underlying [AtomCache](http://www.biojava.org/docs/api/org/biojava/nbio/structure/align/util/AtomCache.html) which manages your PDB ([and btw. also SCOP, CATH](externaldb.md)) installations. +By default BioJava is using the BCIF file format for parsing data. In order to switch it to use mmCIF, we can take control over +the underlying [AtomCache](http://www.biojava.org/docs/api/org/biojava/nbio/structure/align/util/AtomCache.html) which +manages your PDB ([and btw. also SCOP, CATH](externaldb.md)) installations. ```java AtomCache cache = new AtomCache(); @@ -59,7 +62,7 @@ By default BioJava is using the PDB file format for parsing data. In order to sw System.out.println(structure.getChains().size()); ``` -As you can see, the AtomCache will again download the missing mmCIF file for 4HHB in the background. +See other supported file types in the `StructureFileType` enum. ## URL based parsing of files From bd69ee89af8c36acae65ccc83ce9fcb6fea704a7 Mon Sep 17 00:00:00 2001 From: Jose Manuel Duarte Date: Fri, 29 Oct 2021 17:01:03 -0700 Subject: [PATCH 14/21] Update README.md --- README.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/README.md b/README.md index 13a8f30..4f2cde8 100644 --- a/README.md +++ b/README.md @@ -6,7 +6,7 @@ A brief introduction into [BioJava](https://www.biojava.org). The goal of this tutorial is to provide an educational introduction into some of the features that are provided by BioJava. This tutorial is still under development, hence not yet comprehensive for the entire library. Please also check other sources of [documentation](https://biojava.org/wiki/Documentation). -The examples within the tutorial are intended to work with the most recent version of BioJava, although most examples will work with BioJava 3.0 and higher. Please do submit a [new issue](https://github.com/biojava/biojava-tutorial/issues) if you find any problems. +The examples within the tutorial are intended to work with the most recent version of BioJava. Please do submit a [new issue](https://github.com/biojava/biojava-tutorial/issues) if you find any problems. The tutorial is subdivided into several books, corresponding to the respective BioJava modules. Each book is further subdivided into several chapters that intend to describe the main functionality of the module in order of increasing complexity. From 72eba19d108581389a17aa4968c2fedd309bae68 Mon Sep 17 00:00:00 2001 From: lemorai Date: Sat, 6 Nov 2021 15:16:38 +0100 Subject: [PATCH 15/21] Fix README license links --- core/README.md | 2 +- genomics/README.md | 2 +- modfinder/README.md | 2 +- protein-disorder/README.md | 2 +- structure/README.md | 2 +- 5 files changed, 5 insertions(+), 5 deletions(-) diff --git a/core/README.md b/core/README.md index 0fd20de..7995c81 100644 --- a/core/README.md +++ b/core/README.md @@ -34,7 +34,7 @@ Chapter 4 - [Translating](translating.md) DNA and protein sequences. ## License -The content of this tutorial is available under the [CC-BY](http://creativecommons.org/licenses/by/3.0/) license, available [here](license.md). +The content of this tutorial is available under the [CC-BY](http://creativecommons.org/licenses/by/3.0/) license, available [here](../license.md). ## Please Cite diff --git a/genomics/README.md b/genomics/README.md index 37c44d8..a7ff27e 100644 --- a/genomics/README.md +++ b/genomics/README.md @@ -41,7 +41,7 @@ Chapter 6 - Reading genomic DNA sequences using UCSC's [.2bit file format](twobi ## License -The content of this tutorial is available under the [CC-BY](http://creativecommons.org/licenses/by/3.0/) license, available [here](license.md). +The content of this tutorial is available under the [CC-BY](http://creativecommons.org/licenses/by/3.0/) license, available [here](../license.md). ## Please Cite diff --git a/modfinder/README.md b/modfinder/README.md index 4226061..ec8ed8c 100644 --- a/modfinder/README.md +++ b/modfinder/README.md @@ -29,7 +29,7 @@ Chapter 4 - [How to define a new protein modification](add-protein-modification. ## License -The content of this tutorial is available under the [CC-BY](http://creativecommons.org/licenses/by/3.0/) license, available [here](license.md). +The content of this tutorial is available under the [CC-BY](http://creativecommons.org/licenses/by/3.0/) license, available [here](../license.md). ## Please Cite diff --git a/protein-disorder/README.md b/protein-disorder/README.md index 888ae04..7bee8c3 100644 --- a/protein-disorder/README.md +++ b/protein-disorder/README.md @@ -94,7 +94,7 @@ Map ranges = Jronn.getDisorder(sequences); ## License -The content of this tutorial is available under the [CC-BY](http://creativecommons.org/licenses/by/3.0/) license, available [here](license.md). +The content of this tutorial is available under the [CC-BY](http://creativecommons.org/licenses/by/3.0/) license, available [here](../license.md). ## Please Cite diff --git a/structure/README.md b/structure/README.md index 03af437..9552ebc 100644 --- a/structure/README.md +++ b/structure/README.md @@ -66,7 +66,7 @@ Chapter 18 - [Lists](lists.md) of PDB IDs and PDB [Status Information](lists.md) ## License -The content of this tutorial is available under the [CC-BY](http://creativecommons.org/licenses/by/3.0/) license, available [here](license.md). +The content of this tutorial is available under the [CC-BY](http://creativecommons.org/licenses/by/3.0/) license, available [here](../license.md). ## Please Cite From 59e83651b9ae998c22d54f1aae6f567af337258c Mon Sep 17 00:00:00 2001 From: lemorai Date: Sat, 6 Nov 2021 21:37:50 +0100 Subject: [PATCH 16/21] Fix broken examples --- alignment/smithwaterman.md | 2 +- core/readwrite.md | 2 +- core/translating.md | 2 +- 3 files changed, 3 insertions(+), 3 deletions(-) diff --git a/alignment/smithwaterman.md b/alignment/smithwaterman.md index 0f38bf6..5de8acf 100644 --- a/alignment/smithwaterman.md +++ b/alignment/smithwaterman.md @@ -36,7 +36,7 @@ public static void main(String[] args) throws Exception { } private static ProteinSequence getSequenceForId(String uniProtId) throws Exception { - URL uniprotFasta = new URL(String.format("http://www.uniprot.org/uniprot/%s.fasta", uniProtId)); + URL uniprotFasta = new URL(String.format("https://www.uniprot.org/uniprot/%s.fasta", uniProtId)); ProteinSequence seq = FastaReaderHelper.readFastaProteinSequence(uniprotFasta.openStream()).get(uniProtId); System.out.printf("id : %s %s%s%s", uniProtId, seq, System.getProperty("line.separator"), seq.getOriginalHeader()); System.out.println(); diff --git a/core/readwrite.md b/core/readwrite.md index 1ab278b..81898b0 100644 --- a/core/readwrite.md +++ b/core/readwrite.md @@ -13,7 +13,7 @@ Here an example that parses a UniProt FASTA file into a protein sequence. ```java public static ProteinSequence getSequenceForId(String uniProtId) throws Exception { - URL uniprotFasta = new URL(String.format("http://www.uniprot.org/uniprot/%s.fasta", uniProtId)); + URL uniprotFasta = new URL(String.format("https://www.uniprot.org/uniprot/%s.fasta", uniProtId)); ProteinSequence seq = FastaReaderHelper.readFastaProteinSequence(uniprotFasta.openStream()).get(uniProtId); System.out.printf("id : %s %s%s%s", uniProtId, seq, System.getProperty("line.separator"), seq.getOriginalHeader()); System.out.println(); diff --git a/core/translating.md b/core/translating.md index 9b83643..10b953a 100644 --- a/core/translating.md +++ b/core/translating.md @@ -63,7 +63,7 @@ An example for how to parse a sequence from a String and using the Translation e // define the Ambiguity Compound Sets AmbiguityDNACompoundSet ambiguityDNACompoundSet = AmbiguityDNACompoundSet.getDNACompoundSet(); - CompoundSet nucleotideCompoundSet = AmbiguityRNACompoundSet.getDNACompoundSet(); + CompoundSet nucleotideCompoundSet = AmbiguityRNACompoundSet.getRNACompoundSet(); FastaReader proxy = new FastaReader( From 22cfb22e732306c5ef1014ae197b39911e2fe087 Mon Sep 17 00:00:00 2001 From: lemorai Date: Sun, 7 Nov 2021 17:13:12 +0100 Subject: [PATCH 17/21] Update/fix examples in structure-data-model --- structure/structure-data-model.md | 26 +++++++++++++------------- 1 file changed, 13 insertions(+), 13 deletions(-) diff --git a/structure/structure-data-model.md b/structure/structure-data-model.md index c8db2c0..6ea6ce4 100644 --- a/structure/structure-data-model.md +++ b/structure/structure-data-model.md @@ -28,7 +28,7 @@ Structure All `Structure` objects contain one or more `Models`. That means also X-ray structures contain a "virtual" model which serves as a container for the chains. This allows to represent multi-model X-ray structures, e.g. from time-series analysis. The most common way to access chains is via: ```java - List chains = structure.getChains(); + List chains = structure.getChains(); ``` This works for both NMR and X-ray based structures and by default the first `Model` is getting accessed. @@ -58,7 +58,7 @@ Here an example that loops over the whole data model and prints out the HEM grou for (Chain c : chains) { - System.out.println(" Chain: " + c.getChainID() + " # groups with atoms: " + c.getAtomGroups().size()); + System.out.println(" Chain: " + c.getId() + " # groups with atoms: " + c.getAtomGroups().size()); for (Group g: c.getAtomGroups()){ @@ -87,24 +87,24 @@ The [Group](http://www.biojava.org/docs/api/org/biojava/nbio/structure/Group.htm In order to get all amino acids that have been observed in a PDB chain, you can use the following utility method: ```java - Chain chain = s.getChainByPDB("A"); - List groups = chain.getAtomGroups("amino"); + Chain chain = structure.getPolyChainByPDB("A"); + List groups = chain.getAtomGroups(GroupType.AMINOACID); for (Group group : groups) { - AminoAcid aa = (AminoAcid) group; + SecStrucInfo secStrucInfo = (SecStrucInfo) group.getProperty(Group.SEC_STRUC); - // do something amino acid specific, e.g. print the secondary structure assignment - System.out.println(aa + " " + aa.getSecStruc()); + // print the secondary structure assignment + System.out.println(group + " -- " + secStrucInfo); } ``` In a similar way you can access all nucleotide groups by ```java - chain.getAtomGroups("nucleotide"); + chain.getAtomGroups(GroupType.NUCLEOTIDE); ``` The Hetatom groups are access in a similar fashion: ```java - chain.getAtomGroups("hetatm"); + chain.getAtomGroups(GroupType.HETATM); ``` @@ -112,10 +112,10 @@ Since all 3 types of groups are implementing the Group interface, you can also i ```java List allgroups = chain.getAtomGroups(); - for (Group group : groups) { - if ( group instanceof AminoAcid) { - AminoAcid aa = (AminoAcid) group; - System.out.println(aa.getSecStruc()); + for (Group group : allgroups) { + if (group.isAminoAcid()) { + SecStrucInfo secStrucInfo = (SecStrucInfo) group.getProperty(Group.SEC_STRUC); + System.out.println(group + " -- " + secStrucInfo); } } ``` From ac793c4dfb0cade9e562476072df05466e2fade7 Mon Sep 17 00:00:00 2001 From: lemorai Date: Sun, 14 Nov 2021 19:35:09 +0100 Subject: [PATCH 18/21] Fixed documentation for parts of the structure chapter - Corrected typos - Fixed several links with a whitespace between their descriptions and link references - Removed description of PDB-wide database searches: its support was removed with commit biojava/biojava@13716a375d7eb1211ceb7cf6d43e35aaed252c9c because the RCSB PDB search API was discontinued in 2020 --- structure/alignment.md | 75 +++++++++--------------------------------- structure/caching.md | 4 +-- structure/mmcif.md | 2 +- structure/seqres.md | 9 +++-- 4 files changed, 22 insertions(+), 68 deletions(-) diff --git a/structure/alignment.md b/structure/alignment.md index 4f11c54..6053e4a 100644 --- a/structure/alignment.md +++ b/structure/alignment.md @@ -20,12 +20,12 @@ acid sequences converge on a common tertiary structure. A **structural alignment** of other biological polymers can also be made in BioJava. For example, nucleic acids can be structurally aligned to find common structural motifs, -independent of sequence simililarity. This is specially important for RNAs, because their +independent of sequence similarity. This is specially important for RNAs, because their 3D structure arrangement is important for their function. For more info see the Wikipedia article on [structure alignment](http://en.wikipedia.org/wiki/Structural_alignment). -## Alignment Algorithms supported by BioJava +## Alignment Algorithms Supported by BioJava BioJava comes with a number of algorithms for aligning structures. The following five options are displayed by default in the graphical user interface (GUI), @@ -45,9 +45,9 @@ in 3D. See below for descriptions of the algorithms. Since BioJava version 4.1.0, multiple structures can be compared at the same time in a **multiple structure alignment**, that can later be visualized in Jmol. The algorithm is described in detail below. As an overview, it uses any pairwise alignment -algorithm and a **reference** structure to per perform an alignment of all the structures. +algorithm and a **reference** structure to perform an alignment of all the structures. Then, it runs a **Monte Carlo** optimization to determine the residue equivalencies among -all the strucutures, identifying conserved **structural motifs**. +all the structures, identifying conserved **structural motifs**. ## Alignment User Interface @@ -91,7 +91,7 @@ This code shows the following user interface: ![Multiple Alignment GUI](img/multiple_gui.png) The input format is a free text field, where the structure identifiers are -indidcated, space separated. A **structure identifier** is a String that +indicated, space separated. A **structure identifier** is a String that uniquely identifies a structure. It is basically composed of the pdbID, the chain letters and the ranges of residues of each chain. For the formal description visit [StructureIdentifier](http://www.biojava.org/docs/api/org/biojava/nbio/structure/StructureIdentifier.html). @@ -125,12 +125,12 @@ The Combinatorial Extension (CE) algorithm was originally developed by 1998](http://peds.oxfordjournals.org/content/11/9/739.short) [![pubmed](http://img.shields.io/badge/in-pubmed-blue.svg?style=flat)](http://www.ncbi.nlm.nih.gov/pubmed/9796821). It works by identifying segments of the two structures with similar local structure, and then combining those to try to align the most residues possible -while keeping the overall RMSD of the superposition low. +while keeping the overall root-mean-square deviation (RMSD) of the superposition low. CE is a rigid-body alignment algorithm, which means that the structures being compared are kept fixed during superposition. In some cases it may be desirable to break large proteins up into domains prior to aligning them (by manually -inputing a subrange, using the [SCOP or CATH databases](externaldb.md), or by +inputting a subrange, using the [SCOP or CATH databases](externaldb.md), or by decomposing the protein automatically using the [Protein Domain Parser](http://www.biojava.org/docs/api/org/biojava/nbio/structure/domain/LocalProteinDomainParser.html) algorithm). @@ -146,10 +146,8 @@ to the C-terminal part of the other, and vice versa. CE-CP allows circularly permuted proteins to be compared. For more information on circular permutations, see the [Wikipedia](http://en.wikipedia.org/wiki/Circular_permutation_in_proteins) or -[Molecule of the Month] -(http://www.pdb.org/pdb/101/motm.do?momID=124&evtc=Suggest&evta=Moleculeof%20the%20Month&evtl=TopBar) -articles [![pubmed] -(http://img.shields.io/badge/in-pubmed-blue.svg?style=flat)](http://www.ncbi.nlm.nih.gov/pubmed/22496628). +[Molecule of the Month](https://pdb101.rcsb.org/motm/124) +articles [![pubmed](http://img.shields.io/badge/in-pubmed-blue.svg?style=flat)](http://www.ncbi.nlm.nih.gov/pubmed/22496628). For proteins without a circular permutation, CE-CP results look very similar to @@ -173,8 +171,7 @@ It performs similarly to CE for most structures. The 'rigid' flavor uses a rigid-body superposition and only considers alignments with matching sequence order. -BioJava class: [org.biojava.nbio.structure.align.fatcat.FatCatRigid] -(www.biojava.org/docs/api/org/biojava/nbio/structure/align/fatcat/FatCatRigid.html) +BioJava class: [org.biojava.nbio.structure.align.fatcat.FatCatRigid](https://www.biojava.org/docs/api/org/biojava/nbio/structure/align/fatcat/FatCatRigid.html) ### FATCAT - flexible @@ -186,11 +183,9 @@ calmodulin with and without calcium bound can be much better aligned with FATCAT-flexible than with one of the rigid alignment algorithms. The downside of this is that it can lead to additional false positives in unrelated structures. -![(Left) Rigid and (Right) flexible alignments of -calmodulin](img/1cfd_1cll_fatcat.png) +![(Left) Rigid and (Right) flexible alignments of calmodulin](img/1cfd_1cll_fatcat.png) -BioJava class: [org.biojava.nbio.structure.align.fatcat.FatCatFlexible] -(www.biojava.org/docs/api/org/biojava/nbio/structure/align/fatcat/FatCatFlexible.html) +BioJava class: [org.biojava.nbio.structure.align.fatcat.FatCatFlexible](https://www.biojava.org/docs/api/org/biojava/nbio/structure/align/fatcat/FatCatFlexible.html) ### Smith-Waterman @@ -204,8 +199,7 @@ locating gaps can lead to high RMSD in the resulting superposition due to a small number of badly aligned residues. However, this method is faster than the structure-based methods. -BioJava Class: [org.biojava.nbio.structure.align.ce.CeCPMain] -(http://www.biojava.org/docs/api/org/biojava/nbio/structure/align/ce/CeCPMain.html) +BioJava Class: [org.biojava.nbio.structure.align.ce.CeCPMain](http://www.biojava.org/docs/api/org/biojava/nbio/structure/align/ce/CeCPMain.html) ### Other methods @@ -250,43 +244,7 @@ by the pairwise alignment algorithm limitations. The algorithm performs similarly to other multiple structure alignment algorithms for most protein families. The parameters both for the pairwise aligner and the MC optimization can have an impact on the final result. There is not a unique set of parameters, because they usually depend on the specific use case. Thus, trying some parameter combinations, keeping in mind the effect they produce in the score function, is a good practice when doing any structure alignment. -BioJava class: [org.biojava.nbio.structure.align.multiple.mc.MultipleMcMain] -(www.biojava.org/docs/api/org/biojava/nbio/structure/align/multiple/mc/MultipleMcMain.html) - -## PDB-wide Database Searches - -The Alignment GUI also provides functionality for PDB-wide structural searches. -This systematically compares a structure against a non-redundant set of all -other structures in the PDB at either a chain or a domain level. Representatives -are selected using the RCSB's clustering of proteins with 40% sequence identity, -as described -[here](http://www.rcsb.org/pdb/static.do?p=general_information/cluster/structureAll.jsp). -Domains are selected using either SCOP (when available) or the -ProteinDomainParser algorithm. - -![Database Search GUI](img/database_search.png) - -To perform a database search, select the 'Database Search' tab, then choose a -query structure based on PDB ID, SCOP domain id, or from a custom file. The -output directory will be used to store results. These consist of individual -alignments in compressed XML format, as well as a tab-delimited file of -similarity scores and statistics. The statistics are displayed in an interactive -results table, which allows the alignments to be sorted. The 'Align' column -allows individual alignments to be visualized with the alignment GUI. - -![Database Search Results](img/database_search_results.png) - -Be aware that this process can be very time consuming. Before -starting a manual search, it is worth considering whether a pre-computed result -may be available online, for instance for -[FATCAT-rigid](http://www.rcsb.org/pdb/static.do?p=general_information/cluster/structureAll.jsp) -or [DALI](http://ekhidna.biocenter.helsinki.fi/dali/start). For custom files or -specific domains, a few optimizations can reduce the time for a database search. -Downloading PDB files is a considerable bottleneck. This can be solved by -downloading all PDB files from the [FTP -server](ftp://ftp.wwpdb.org/pub/pdb/data/structures/divided/pdb/) and setting -the `PDB_DIR` environmental variable. This operation sped up the search from -about 30 hours to less than 4 hours. +BioJava class: [org.biojava.nbio.structure.align.multiple.mc.MultipleMcMain](https://www.biojava.org/docs/api/org/biojava/nbio/structure/align/multiple/mc/MultipleMcMain.html) ## Creating Alignments Programmatically @@ -363,8 +321,7 @@ MultipleAlignmentJmolDisplay.display(result); Many of the alignment algorithms are available in the form of command line tools. These can be accessed through the main methods of the StructureAlignment -classes. Tar bundles are also available with scripts for running -[CE and FATCAT](http://source.rcsb.org/jfatcatserver/download.jsp). +classes. Example: ```bash @@ -378,7 +335,7 @@ file in various formats. ## Alignment Data Model -For details about the structure alignment data models in biojava, see [Structure Alignment Data Model](alignment-data-model.md) +For details about the structure alignment data models in BioJava, see [Structure Alignment Data Model](alignment-data-model.md) ## Acknowledgements diff --git a/structure/caching.md b/structure/caching.md index e2da072..7be2be1 100644 --- a/structure/caching.md +++ b/structure/caching.md @@ -53,10 +53,8 @@ This example turns on the use of chemical components when loading a `Structure`. AtomCache cache = new AtomCache(); cache.setPath("/tmp/"); - + FileParsingParameters params = cache.getFileParsingParams(); - - params.setLoadChemCompInfo(true); StructureIO.setAtomCache(cache); diff --git a/structure/mmcif.md b/structure/mmcif.md index 9fcd6a8..769b851 100644 --- a/structure/mmcif.md +++ b/structure/mmcif.md @@ -13,7 +13,7 @@ The mmCIF file format has been around for some time (see [Westbrook 2000][] and ## The Basics BioJava uses the [CIFTools-java](https://github.com/rcsb/ciftools-java) library to parse mmCIF. BioJava then has its own data model that reads PDB and mmCIF files -into a biological and chemically meaningful data model (BioJava supports the [Chemical Components Dictionary](mmcif.md)). +into a biological and chemically meaningful data model (BioJava supports the [Chemical Components Dictionary](chemcomp.md)). If you don't want to use that data model, you can still use the CIFTools-java parser, please refer to its documentation. Let's start first with the most basic way of loading a protein structure. diff --git a/structure/seqres.md b/structure/seqres.md index db64971..2d03e04 100644 --- a/structure/seqres.md +++ b/structure/seqres.md @@ -5,12 +5,11 @@ How molecular sequences are linked to experimentally observed atoms. ## Sequences and Atoms -In many experiments not all atoms that are part of the molecule under study can be observed. As such the ATOM records in PDB oftein contain missing atoms or only the part of a molecule that could be experimentally determined. In case of multi-domain proteins the PDB often contains only one of the domains (and in some cases even shorter fragments). +In many experiments not all atoms that are part of the molecule under study can be observed. As such the ATOM records in PDB often contain missing atoms or only the part of a molecule that could be experimentally determined. In case of multi-domain proteins the PDB often contains only one of the domains (and in some cases even shorter fragments). -Let's take a look at an example. The [Protein Feature View](https://github.com/andreasprlic/proteinfeatureview) provides a graphical summary of how the regions that have been observed in an experiment and are available in the PDB map to UniProt. +Let's take a look at an example. The [Protein Feature View](https://github.com/andreasprlic/proteinfeatureview) provides a graphical summary of the regions that have been observed in an experiment and are available in the PDB map to UniProt. -![Screenshot of Protein Feature View at RCSB] -(https://raw.github.com/andreasprlic/proteinfeatureview/master/images/P06213.png "Insulin receptor - P06213 (INSR_HUMAN)") +![Screenshot of Protein Feature View at RCSB](https://raw.github.com/andreasprlic/proteinfeatureview/master/images/P06213.png "Insulin receptor - P06213 (INSR_HUMAN)") As you can see, there are three PDB entries (PDB IDs [3LOH](http://www.rcsb.org/pdb/explore.do?structureId=3LOH), [2HR7](http://www.rcsb.org/pdb/explore.do?structureId=2RH7), [3BU3](http://www.rcsb.org/pdb/explore.do?structureId=3BU3)) that cover different regions of the UniProt sequence for the insulin receptor. @@ -18,7 +17,7 @@ The blue-boxes are regions for which atoms records are available. For the grey r ## Seqres and Atom Records -The sequence that has been used in the experiment is stored in the **Seqres** records in the PDB. It is often not the same sequences as can be found in Uniprot, since it can contain cloning-artefacts and modifications that were necessary in order to crystallize a structure. +The sequence that has been used in the experiment is stored in the **Seqres** records in the PDB. It is often not the same sequence as can be found in Uniprot, since it can contain cloning-artefacts and modifications that were necessary in order to crystallize a structure. The **Atom** records provide coordinates where it was possible to observe them. From 405084a0a3f34b8153f502692dd591caa4f7b816 Mon Sep 17 00:00:00 2001 From: josemduarte Date: Tue, 12 Dec 2023 09:41:57 -0800 Subject: [PATCH 19/21] Fixing links --- structure/contact-map.md | 4 ++-- structure/crystal-contacts.md | 4 ++-- 2 files changed, 4 insertions(+), 4 deletions(-) diff --git a/structure/contact-map.md b/structure/contact-map.md index 57b6818..bb9236d 100644 --- a/structure/contact-map.md +++ b/structure/contact-map.md @@ -9,7 +9,7 @@ Contacts are a useful tool to analyse protein structures. They simplify the 3-Di ## Getting the contact map of a protein chain -This code snippet will produce the set of contacts between all C alpha atoms for chain A of PDB entry [1SMT](http://www.rcsb.org/pdb/explore.do?structureId=1SMT): +This code snippet will produce the set of contacts between all C alpha atoms for chain A of PDB entry [1SMT](https://www.rcsb.org/structure/1SMT): ```java AtomCache cache = new AtomCache(); @@ -51,7 +51,7 @@ One can also find the contacting atoms between two protein chains. For instance ``` -See [DemoContacts](https://github.com/biojava/biojava/blob/master/biojava3-structure/src/main/java/demo/DemoContacts.java) for a fully working demo of the examples above. +See [DemoContacts](https://github.com/biojava/biojava/blob/master/biojava-structure/src/main/java/demo/DemoContacts.java) for a fully working demo of the examples above. diff --git a/structure/crystal-contacts.md b/structure/crystal-contacts.md index cf1fcbe..f610610 100644 --- a/structure/crystal-contacts.md +++ b/structure/crystal-contacts.md @@ -11,7 +11,7 @@ Looking at crystal contacts can also be important in order to assess the quality ## Getting the set of unique contacts in the crystal lattice -This code snippet will produce a list of all non-redundant interfaces present in the crystal lattice of PDB entry [1SMT](http://www.rcsb.org/pdb/explore.do?structureId=1SMT): +This code snippet will produce a list of all non-redundant interfaces present in the crystal lattice of PDB entry [1SMT](https://www.rcsb.org/structure/1SMT): ```java AtomCache cache = new AtomCache(); @@ -42,7 +42,7 @@ The algorithm to find all unique interfaces in the crystal works roughly like th + Searches all cells around the original one by applying crystal translations, if any 2 chains in that search is found to contact then the new contact is added to the final list. + The search is performend without repeating redundant symmetry operators, making sure that if a contact is found then it is a unique contact. -See [DemoCrystalInterfaces](https://github.com/biojava/biojava/blob/master/biojava3-structure/src/main/java/demo/DemoCrystalInterfaces.java) for a fully working demo of the example above. +See [DemoCrystalInterfaces](https://github.com/biojava/biojava/blob/master/biojava-structure/src/main/java/demo/DemoCrystalInterfaces.java) for a fully working demo of the example above. ## Clustering the interfaces One can also cluster the interfaces based on their similarity. The similarity is measured through contact overlap: number of common contacts over average number of contact in both chains. The clustering can be done as following: From 219af9a6b87b7b06e99a890b086442d3086825c6 Mon Sep 17 00:00:00 2001 From: Gary Murphy Date: Tue, 9 Jan 2024 12:00:32 -0600 Subject: [PATCH 20/21] Fixed some typos and broken links --- README.md | 2 +- structure/bioassembly.md | 2 +- structure/secstruc.md | 8 ++++---- 3 files changed, 6 insertions(+), 6 deletions(-) diff --git a/README.md b/README.md index 4f2cde8..12924e3 100644 --- a/README.md +++ b/README.md @@ -24,7 +24,7 @@ Book 4: [The Genomics Module](genomics/README.md), working with genomic data. Book 5: [The Protein-Disorder Module](protein-disorder/README.md), predicting protein-disorder. -Book 6: [The ModFinder Module](modfinder/README.md), identifying potein modifications in 3D structures +Book 6: [The ModFinder Module](modfinder/README.md), identifying protein modifications in 3D structures ## License diff --git a/structure/bioassembly.md b/structure/bioassembly.md index ab667e5..de2c2c5 100644 --- a/structure/bioassembly.md +++ b/structure/bioassembly.md @@ -153,7 +153,7 @@ List bioAssemblies = StructureIO.getBiologicalAssemblies(pdbId); ## Further Reading -The RCSB PDB web site has a great [tutorial on Biological Assemblies](http://www.rcsb.org/pdb/101/static101.do?p=education_discussion/Looking-at-Structures/bioassembly_tutorial.html). +The RCSB PDB web site has a great [tutorial on Biological Assemblies](https://pdb101.rcsb.org/learn/guide-to-understanding-pdb-data/biological-assemblies). diff --git a/structure/secstruc.md b/structure/secstruc.md index 7216d84..fbd0f94 100644 --- a/structure/secstruc.md +++ b/structure/secstruc.md @@ -10,8 +10,8 @@ Secondary structure can be formally defined by the pattern of hydrogen bonds of More specifically, the secondary structure is defined by the patterns of hydrogen bonds formed between amine hydrogen (-NH) and carbonyl oxygen (C=O) atoms contained in the backbone peptide bonds of the protein. -For more info see the Wikipedia article on [protein secondary structure] -(https://en.wikipedia.org/wiki/Protein_secondary_structure). +For more info see the Wikipedia article +on [protein secondary structure](https://en.wikipedia.org/wiki/Protein_secondary_structure). ## Secondary Structure Annotation @@ -106,8 +106,8 @@ input Structure overriding any previous annotation, like in the DSSPParser. An e ssp.calculate(s, true); //true assigns the SS to the Structure ``` -BioJava Class: [org.biojava.nbio.structure.secstruc.SecStrucCalc] -(http://www.biojava.org/docs/api/org/biojava/nbio/structure/secstruc/SecStrucCalc.html) +BioJava Class: +[org.biojava.nbio.structure.secstruc.SecStrucCalc](http://www.biojava.org/docs/api/org/biojava/nbio/structure/secstruc/SecStrucCalc.html) ### Storage and Data Structures From 4ad8eb6c8267a6c2ae53e40651bbc0e9d0e8c82e Mon Sep 17 00:00:00 2001 From: Gary Murphy Date: Tue, 30 Jan 2024 07:35:49 -0600 Subject: [PATCH 21/21] Added an example for FastaStreamer --- core/readwrite.md | 21 +++++++++++++++++++++ 1 file changed, 21 insertions(+) diff --git a/core/readwrite.md b/core/readwrite.md index 81898b0..432a419 100644 --- a/core/readwrite.md +++ b/core/readwrite.md @@ -79,6 +79,27 @@ BioJava can also be used to parse large FASTA files. The example below can parse } ``` +BioJava can also process large FASTA files using the Java streams API. + +```java + FastaStreamer + .from(path) + .stream() + .forEach(sequence -> System.out.printf("%s -> %ss\n", sequence.getOriginalHeader(), sequence.getSequenceAsString())); +``` + +If you need to specify a header parser other that `GenericFastaHeaderParser` or a sequence creater other than a +`ProteinSequenceCreator`, these can be specified before streaming the contents as follows: + +```java + FastaStreamer + .from(path) + .withHeaderParser(new PlainFastaHeaderParser<>()) + .withSequenceCreator(new CasePreservingProteinSequenceCreator(AminoAcidCompoundSet.getAminoAcidCompoundSet())) + .stream() + .forEach(sequence -> System.out.printf("%s -> %ss\n", sequence.getOriginalHeader(), sequence.getSequenceAsString())); +``` +