SimStore: New storage subsystem (part 1) by dwhswenson · Pull Request #928 · openpathsampling/openpathsampling

dwhswenson · 2020-09-17T15:16:48Z

For a while, I've been trying to rework our storage. This is because of several problems with the netcdfplus package:

It doesn't seem to play nicely with dask, which I intend to use for parallelization
It is heavily tied to netCDF, and kept breaking because netCDF changed their internals frequently.
It doesn't have very good performance on loading data.
@jhprinz is no longer actively developing it, so it is becoming hard to maintain.

This PR gives the first work toward SimStore, a new storage subsystem. For now, I'm developing this in the experimental subpackage, because the API may change. This will replace netcdfplus in OPS 2.0. The last release before OPS 2.0 will include tools to facilitate transition from netcdfplus to SimStore, although I don't guarantee that all data will be transferable.

Some advantages of the new approach:

Faster and more efficient storage, and easier debugging. Instead of recursive algorithms, I use iterative algorithms that group everything to be written to/loaded from a given table into a minimal number of calls to disk. I also distinguish between storage of data objects (of which there are many, but which are of predictable format) and storage of simulation objects (which are few in number, but of a wider variety). I use faster algorithms to store data objects, and more generic (but slower) algorithms to store simulation objects.
More flexible with regards to backends. For now, the new storage uses SQL as a backend, but changing that is as easy as creating a new backend model (and I intend to do that by adding a pytables-based hdf5 backend in the future.) The backend is completely independent from the rest of the storage subsystem. In netcdfplus, storage inherits from netcdf4. That makes changing the backend much more difficult.
Easier to add new types for JSON-based storage. The new storage uses a registration-based system for adding JSON types, whereas netcdfplus would require actually editing the core netcdfplus code. Now, a user can add another type (e.g., simtk.unit.Quantity) without having to edit the core code.

In addition, as I go forward, I'm planning to implement it in a way that is designed to work with dask, and I'm also planning to focus on analysis performance.

This is a major undertaking, so in addition to this huge PR, there will be a number of PRs following to complete the storage.

Ability to store/reload arbitrary simulation objects
Ability to store/reload all data objects from a toy simulation
Custom tables for ensembles, volumes, etc.
SQL backend support
General framework for extensible JSON support; support for numpy arrays
Support for automatic registration of new snapshot types
Lazy proxy objects

Coming up in future PRs:

New base class for CVs (which will give better analysis performance and will enable parallelization)
Support for OpenMM (static/kinetic containers; simtk.unit.Quantity)
Generic "tagged" objects
Complete unit tests

This will be the first package released in the experimental subpackage. The policy for the experimental subpackage is that any unit tests must run and pass, but they aren't counted toward overall coverage requirements. In addition, things in experimental and beta are not tested on Python 2. Unit tests have to be complete (nearly 100% coverage) before we move this into the beta subpackage.

As I've been developing this, I've used a number of notebooks as additional tests (mainly to ensure that I don't break things that had been working.) Those notebooks are now online in their own repo: https://github.com/dwhswenson/ops-storage-notebooks

I'm planning on putting better examples in that repo soon so that users have the necessary information to start playing with this.

can set up a Storage; still trouble storing objects to DB

dwhswenson · 2020-09-24T11:18:48Z

I'm about to do a bit of renaming here. Essentially, I want the experimental subpackage to somewhat mimic what the openpathsampling package will look like in 2.0, so that all you need to do to go from experimental to 2.0 is remove the experimental from the imports, even for things not in the root openpathsampling namespace.

Of the user-facing things, I believe this would just mean the following changes:

experimental.storage.sql_backend.SQLStorageBackend => experimental.simstore.SQLStorageBackend
experimental.storage.ops_storage.OPSStorage => experimental.storage.Storage (will be in root namespace in OPS 2.0, but experimental only holds subpackages)
experimental.storage.monkey_patches.monkey_patch_all available at experimental.storage.monkey_patch_all (etc for other monkey patches; also, the existing imports will still work, but a simpler one will be available.)

@sroet : Mainly, I want to warn you this change is coming. If there's a reason it would be a huge hassle, I can hold off for a bit. After that change, I think I'll mark this ready for review.

sroet · 2020-09-24T11:39:14Z

Mainly, I want to warn you this change is coming. If there's a reason it would be a huge hassle, I can hold off for a bit. After that change, I think I'll mark this ready for review.

Don't worry about it. I think I should be proficient enough with git to lock to a specific commit hash if it would be a hassle ;)

openpathsampling/experimental/storage/storage.py

Co-authored-by: sroet <sanderroet@hotmail.com>

… storage

dwhswenson · 2020-09-25T10:41:36Z

I've completed the renaming/shifting imports as proposed in #928 (comment).

As far as I'm concerned, this is now ready for review. It's not expected to be the complete implementation of the new storage, but a lot of first steps. It's already a large enough PR that it'll be hard for anyone to thoroughly review it, so I want this merged in so that subsequent steps can come in smaller doses.

That said, I'll leave this up for at least 10 days, just in case someone is brave (foolish) enough to want to give it a full review, and to provide sufficient time for other feedback. I will not merge this before Monday, 5 October, 2020.

sroet · 2020-09-25T11:14:58Z

From me this is a +1,

I have not looked at the code, but:

I have used this code (specifically dwhswenson@f80eb26 ) to run multiple 100_000 step toy system calculation with custom objects.
I have used c3fbbe0 to analyze these simulations with reasonable speeds (for me it loads at about 22 steps/s, `1.25 hours for 100_000 steps while I never got the old (netcdf) storage to even open.)
To get to the above point I had to manually create an sql index which was the reason for this comment, which was not to painful if required (takes about 15 min)

The two things that I have not checked and am a bit worried about:

I have not run a simulation with any code after dwhswenson@f80eb26 , so I don't now if the "UUID is not an SQLIndex" is actually fixed.
I don't know if there is performance degradation when running long simulations because the storage cache is not cleared while running these simulations, but I do expect there to be (It used to scale as O(N), but might now be O(log(N)), with N being the number of objects in the cache).

Know issues that will be left for future PRs (@dwhswenson please correct me here if I am promising things that you will not do 😉 ) :

Loaded PathSimulator objects don't get initialized with the storage they are loaded from
Mutable attributes are not saved (whole different issue)
While indexing now works for both positive and negative indices (for example storage.steps[-1]), it can't be sliced yet (for example storage.steps[:5000])

dwhswenson · 2020-09-25T11:59:37Z

Know issues that will be left for future PRs

These are correct. The first one, storable storage, is already addressed in #929. Proper slicing of tables will probably come after #929, in a separate PR (keeping things smaller and easier to review). Storing mutable data is a larger problem, and may not happen until after 2.0 (but can be done in a way that is backward-compatible, so that's fine.)

Also, since this is in the experimental subpackage (where SemVer does not apply), let me take a moment to state a few ways to future-proof for anyone using this. I'll focus on creating/using the Storage object, since that's the main thing users will care about (i.e., only devs will care about adding new JSONCodecs or about how storage and backend communicate).

The examples currently show an approach of creating the backend directly with (SQLStorageBackend(filename, mode)) and then creating the storage with Storage.from_backend. This will probably remain the same. If you want to be even more future-proofed, starting with SimStore part 2: Storable functions (and more) #929 I think the approach of create SQLAlchemy engine => SQLBackend.from_engine => Storage.from_backend will be guaranteed. (In the future, there will by a single-step Storage(filename, mode) initialization.)
Saving objects, either singly or as a list, with storage.save() is guaranteed
Accessing true tables by single element (e.g., storage.steps[0]) or by iteration (e.g., for step in storage.steps) is guaranteed
Accessing pseudo-tables (views on the simulation objects table) by number (e.g., storage.cvs[0]), by name (e.g., storage.cvs['cv_x']) or by iteration (e.g., for cv in storage.cvs) is guaranteed. Note that not all pseudo-tables have been added yet; if you find one missing, let me know.

I think these are the main ways people tend to interact with the storage. If there are other ways, please ask about them!

sroet · 2020-09-30T21:26:54Z

alright I ran another simulation with 100_000 on commit fbe2b05

I have not run a simulation with any code after dwhswenson/openpathsampling@f80eb26 , so I don't now if the "UUID is not an SQLIndex" is actually fixed.

This has not been fixed and can lead to detrimental analysis performance as mentioned in this comment

Non-indexed sql loads with 29.17s/it while the indexed one: 17.69it/s (notice the swap of it and s on those scales)

I don't know if there is performance degradation when running long simulations because the storage cache is not cleared while running these simulations, but I do expect there to be (It used to scale as O(N), but might now be O(log(N)), with N being the number of objects in the cache).

The performance is still degrading somewhat, from on average 0.19 s/step at the start to on average 1.02 seconds per step at 100_000.

sroet · 2020-10-15T12:05:11Z

Ok I completed a new run with a2706cc:
Completed at 100_000 with and average speed of 0.45 s/step
The final average speed of this version (0.45 s/step) is already more than 2x the average speed of before 1.02 s/step.

Working on Monte Carlo cycle number 10000
Running for 29 minutes 36 seconds -  0.18 seconds per step
--
Working on Monte Carlo cycle number 20000
Running for 1 hour 9 minutes 40 seconds -  0.21 seconds per step
--
Working on Monte Carlo cycle number 30000
Running for 2 hours 3 minutes 14 seconds -  0.25 seconds per step
--
Working on Monte Carlo cycle number 40000
Running for 3 hours 3 minutes 21 seconds -  0.28 seconds per step
--
Working on Monte Carlo cycle number 50000
Running for 4 hours 16 minutes 47 seconds -  0.31 seconds per step
--
Working on Monte Carlo cycle number 60000
Running for 5 hours 37 minutes 18 seconds -  0.34 seconds per step
--
Working on Monte Carlo cycle number 70000
Running for 7 hours 5 minutes 29 seconds -  0.36 seconds per step
--
Working on Monte Carlo cycle number 80000
Running for 8 hours 39 minutes 29 seconds -  0.39 seconds per step
--
Working on Monte Carlo cycle number 90000
Running for 10 hours 41 minutes 45 seconds -  0.43 seconds per step
--
Working on Monte Carlo cycle number 100000
Running for 12 hours 33 minutes 34 seconds -  0.45 seconds per step

It is actually still growing linearly (on average), but at such a slow rate I think it is acceptable.

The if we look at the average time of the last 10_000 the speeds are (with slightly different rounding than above):

steps (x10000)	average	average last 10_000	diff last 10_000
1	0.18	0.18	-
2	0.21	0.24	0.06
3	0.25	0.33	0.09
4	0.28	0.37	0.04
5	0.31	0.43	0.06
6	0.34	0.49	0.06
7	0.37	0.55	0.06
8	0.40	0.61	0.06
9	0.43	0.67	0.06
10	0.46	0.73	0.06

This seems to indicate that the average time to add a step grows by 0.06 s for every 10_000 steps added. Given that this should still be sub-second for up to 130_000 steps , I am totally fine by this

Analysis runs as fast as you would expect with an indexed sql (36.38 it/s)
Appending seems to be as fast as before with an indexed sql (and in line with the last 10_000 speeds mentioned above)

Working on Monte Carlo cycle number 103100
Running for 38 minutes 24 seconds -  0.74 seconds per step
Estimated time remaining: 23 minutes 33 seconds

This solves all the issues for me.

dwhswenson · 2020-10-15T12:19:08Z

Thanks @sroet : I added an index on the UUIDs table, so this confirms my approach for that works and that the index persists.

It may be worth indexing other tables as well, especially snapshots and (in #929) saved CV results.

IIRC your trajectories are also rather long, right? Order of 1000 snapshots? Your analysis speed is probably limited by needing to load all snapshots (from accepted trajectories, at least) -- once I finish #929, that will no longer be necessary (if you run the simulation with the stuff in #929).

With this performance issue dealt with, I'll go ahead and merge this in, and future performance improvements can be in other PRs.

sroet · 2020-10-15T12:34:27Z

IIRC your trajectories are also rather long, right? Order of 1000 snapshots? Your analysis speed is probably limited by needing to load all snapshots (from accepted trajectories, at least) -- once I finish #929, that will no longer be necessary (if you run the simulation with the stuff in #929).

Normally between 100-1000 frames per trajectory.

dwhswenson added 30 commits January 31, 2018 19:45

Start to sql-based storage

2f072da

more tests for sql-based storage

c320ba4

Another test for new storage

5d40746

progress on sql backend

a4ef3b9

move file names

968219b

Split off storage from sql_backend

14659cf

fix up tests

9197ae1

add_to_table seems to be working

298ce49

more drafts of loading function for sql storage

23f0f7b

always use UUID columns

f480984

test int_table_from_db; smoketest load_uuids

cfaf730

full test for add_to_table

5fefb55

more steps on storage

97ec6c8

Merge branch 'typecheck_unicode_is_str' into storage

669a433

Added most of serialization

4b0894c

Merge remote-tracking branch 'upstream/master' into storage

b5fbda1

start thinking about lazy loading

dd4613d

added draft of OPS schema

48f3f60

work on tests for sql_backend

84a073e

tests for some of the tools methods

01a635f

majot work on save method

ea56dad

reorg for more straightforward setup

f6c75f8

steps toward serialization

06b2e78

can set up a Storage; still trouble storing objects to DB

successful storing of a PES with storage.save()

af2ce76

starting on numpy support

573c554

fix tests

1767b4f

progress on numpy support

754ff20

numpy can be serialized/deserialized

e3858f9

save snapshots and sim objs

3ca393d

Snapshots fully save with manual registration

0fbf74e

dwhswenson added 2 commits September 24, 2020 18:54

fix obvious mistake

65f71e1

Remove cache from StorageTable.__init__ (unused)

592429e

sroet reviewed Sep 24, 2020

View reviewed changes

openpathsampling/experimental/storage/storage.py Show resolved Hide resolved

dwhswenson and others added 6 commits September 24, 2020 22:38

Update openpathsampling/experimental/storage/storage.py

c3fbbe0

Co-authored-by: sroet <sanderroet@hotmail.com>

Rename and reduce system

1295837

Finish renaming and cleaning __init__.py

3e371b5

Merge branch 'storage' of github.com:dwhswenson/openpathsampling into…

55dd27c

… storage

correct test with && in bash

ec5213d

add simstore __init__

fbe2b05

dwhswenson marked this pull request as ready for review September 25, 2020 10:41

dwhswenson changed the title ~~[WIP] SimStore: New storage subsystem (part 1)~~ SimStore: New storage subsystem (part 1) Sep 25, 2020

dwhswenson mentioned this pull request Sep 25, 2020

SimStore part 2: Storable functions (and more) #929

Merged

4 tasks

dwhswenson added experimental misc PR labels Sep 26, 2020

build index on uuid table on creation

a2706cc

dwhswenson removed the misc PR label Oct 14, 2020

dwhswenson mentioned this pull request Oct 14, 2020

Outside-API subpackages: experimental and beta #813

Closed

dwhswenson merged commit ace9d7e into openpathsampling:master Oct 15, 2020

dwhswenson mentioned this pull request Dec 23, 2020

Release 1.4.0 #952

Merged

dwhswenson deleted the storage branch January 17, 2021 11:54

sroet mentioned this pull request Mar 2, 2021

Use a deque to keep track of more recent time estimates #989

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

SimStore: New storage subsystem (part 1)#928

SimStore: New storage subsystem (part 1)#928
dwhswenson merged 176 commits intoopenpathsampling:masterfrom
dwhswenson:storage

dwhswenson commented Sep 17, 2020

Uh oh!

dwhswenson commented Sep 24, 2020

Uh oh!

sroet commented Sep 24, 2020 •

edited

Loading

Uh oh!

Uh oh!

dwhswenson commented Sep 25, 2020

Uh oh!

sroet commented Sep 25, 2020 •

edited

Loading

Uh oh!

dwhswenson commented Sep 25, 2020 •

edited

Loading

Uh oh!

sroet commented Sep 30, 2020

Uh oh!

sroet commented Oct 15, 2020

Uh oh!

dwhswenson commented Oct 15, 2020

Uh oh!

sroet commented Oct 15, 2020

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

dwhswenson commented Sep 17, 2020

Uh oh!

dwhswenson commented Sep 24, 2020

Uh oh!

sroet commented Sep 24, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

dwhswenson commented Sep 25, 2020

Uh oh!

sroet commented Sep 25, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

dwhswenson commented Sep 25, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

sroet commented Sep 30, 2020

Uh oh!

sroet commented Oct 15, 2020

Uh oh!

dwhswenson commented Oct 15, 2020

Uh oh!

sroet commented Oct 15, 2020

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

sroet commented Sep 24, 2020 •

edited

Loading

sroet commented Sep 25, 2020 •

edited

Loading

dwhswenson commented Sep 25, 2020 •

edited

Loading