Trajectory Export: Multiple Steps#1176
Trajectory Export: Multiple Steps#1176dwhswenson wants to merge 6 commits intoopenpathsampling:masterfrom
Conversation
Dump steps as files, organized by raw (real files) and trial and active (symlinks). Works in principle as of this commit, but need to try it out in reality.
… into dump-trajectories
There was a problem hiding this comment.
Pull request overview
This PR implements trajectory export functionality for multiple Monte Carlo steps using symlinks to avoid data duplication. The key changes include adding file extension tracking to trajectory writers and creating a new SymLinkStepExporter class that exports raw trajectory data once and creates symlinks for trial and active samples.
- Added
extproperty to trajectory writer classes to track file extensions - Created
SymLinkStepExporterclass for exporting MC steps with symlink-based deduplication - Updated tests to pass the required
extparameter toMDTrajTrajectoryWriter
Reviewed changes
Copilot reviewed 7 out of 7 changed files in this pull request and generated 5 comments.
Show a summary per file
| File | Description |
|---|---|
| openpathsampling/exports/trajectories/core.py | Added abstract ext property to TrajectoryWriter base class and concrete implementation in SimStoreTrajectoryWriter |
| openpathsampling/exports/trajectories/mdtrajtrajectorywriter.py | Made ext a required parameter and added ext property to return it |
| openpathsampling/exports/trajectories/trrtrajectorywriter.py | Added ext property returning "trr" |
| openpathsampling/exports/steps/symlink_step_exporter.py | New class implementing step export with symlink deduplication and helper function export_steps |
| openpathsampling/tests/exports/steps/test_symlink_step_exporter.py | Comprehensive test suite for SymLinkStepExporter covering various MC move types |
| openpathsampling/tests/exports/steps/conftest.py | Import fixture from analysis tests for use in step export tests |
| openpathsampling/tests/exports/trajectories/test_mdtrajtrajectorywriter.py | Updated test calls to pass required ext="xtc" parameter |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| ---------- | ||
| step : Step | ||
| The step containing the sample. | ||
| sample : Sample |
There was a problem hiding this comment.
The docstring parameter has inconsistent indentation. Line 115 "sample : Sample" has 7 leading spaces instead of 8 spaces to match line 113 "step : Step". The indentation should be consistent for all parameters in the docstring.
| sample : Sample | |
| sample : Sample |
| engines = collections.Counter([s.engine for s in | ||
| sample.trajectory]) |
There was a problem hiding this comment.
When self.writer is None and sample.trajectory is empty, this code will raise an IndexError on line 76 when trying to access engines.most_common(1)[0][0]. Consider adding validation to handle empty trajectories or documenting that trajectories must not be empty.
| engines = collections.Counter([s.engine for s in | |
| sample.trajectory]) | |
| if not sample.trajectory: | |
| raise ValueError( | |
| "Cannot determine writer: sample.trajectory is empty. " | |
| "Please provide a writer explicitly or ensure the trajectory is not empty." | |
| ) | |
| engines = collections.Counter([s.engine for s in sample.trajectory]) |
| for saving data that could be used in a restart. | ||
| """ | ||
| def __init__(self, mdtraj_selection=None): | ||
| def __init__(self, ext, mdtraj_selection=None): |
There was a problem hiding this comment.
The ext parameter in MDTrajTrajectoryWriter is now a required positional argument. Consider making it a keyword-only argument or providing a default value to maintain backward compatibility and improve API clarity. For example: def init(self, *, ext, mdtraj_selection=None) would make it clear that ext must be explicitly specified.
| def __init__(self, ext, mdtraj_selection=None): | |
| def __init__(self, *, ext, mdtraj_selection=None): |
There was a problem hiding this comment.
In the same vein; would this change break loading files with old writers? and should OPS deal with that?
| base_dir='.', | ||
| # TODO: add in a base using storage handlers, default to cwd | ||
| raw_data_pattern=_DEFAULT_RAW_DATA_PATTERN, | ||
| trial_pattern=_DEFAULT_TRIAL_PATTERN, | ||
| active_pattern=_DEFAULT_ACTIVE_PATTERN, | ||
| ): | ||
| self.writer = writer | ||
| self.base_dir = pathlib.Path(base_dir) |
There was a problem hiding this comment.
The base_dir parameter is stored in init but never used anywhere in the class. Either this parameter should be used to construct absolute paths (e.g., combining it with the patterns), or it should be removed if it's not needed. This could be confusing for users who expect base_dir to affect where files are written.
|
|
||
| from openpathsampling.exports.steps.symlink_step_exporter import ( | ||
| SymLinkStepExporter, | ||
| _DEFAULT_TRIAL_PATTERN, _DEFAULT_ACTIVE_PATTERN, |
There was a problem hiding this comment.
Import of '_DEFAULT_TRIAL_PATTERN' is not used.
Import of '_DEFAULT_ACTIVE_PATTERN' is not used.
| _DEFAULT_TRIAL_PATTERN, _DEFAULT_ACTIVE_PATTERN, |
Codecov Report❌ Patch coverage is Additional details and impacted files@@ Coverage Diff @@
## master #1176 +/- ##
==========================================
+ Coverage 81.28% 81.37% +0.08%
==========================================
Files 147 148 +1
Lines 15807 15886 +79
==========================================
+ Hits 12849 12927 +78
- Misses 2958 2959 +1 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
sroet
left a comment
There was a problem hiding this comment.
I am not into the code enough to do a proper review, but I did note some things that might behave wonky from what I could remember from OPS 😅
Other than the review comments; from the test coverage it seems like the ext property isn't actually used at the moment by the writers)
Feel free to of course ignore any/all of these comments, especially as I now see you didn't actually asked for review on this PR yet, but it felt like a waste to just remove the review comments...
| if sample.ensemble.is_named: | ||
| ensemble_id = sample.ensemble.name |
There was a problem hiding this comment.
Do we guarantee that ensemble names are unique and would that break the file storage? (sorry, it's been a while since I've looked at/used OPS)
| for saving data that could be used in a restart. | ||
| """ | ||
| def __init__(self, mdtraj_selection=None): | ||
| def __init__(self, ext, mdtraj_selection=None): |
There was a problem hiding this comment.
In the same vein; would this change break loading files with old writers? and should OPS deal with that?
This exports trajectories from a set of MC steps, using symlinks to prevent duplication. There will be snapshot duplication when two trajectories share snapshots (one-way-shooting) and coordinate duplication when two trajectories are reversed (path reversal), but full trajectories, which may not change across MC steps (e.g., shooting in only one ensemble leaves the others unchanged), are not duplicated.
The default file structure is, e.g.:
where, in that example, 100 is the step number. The files in
activeandtrialsare named by the friendly name of the ensemble, and are symlinks to data inraw_data, named by the UUID.I've also changed it so that tracking the extension is the job of the writer. This does mean that the writer is more complicated than just a bare function, but it feels to me like it makes more sense.