Skip to content

SimStore: New storage subsystem (part 1)#928

Merged
dwhswenson merged 176 commits intoopenpathsampling:masterfrom
dwhswenson:storage
Oct 15, 2020
Merged

SimStore: New storage subsystem (part 1)#928
dwhswenson merged 176 commits intoopenpathsampling:masterfrom
dwhswenson:storage

Conversation

@dwhswenson
Copy link
Member

For a while, I've been trying to rework our storage. This is because of several problems with the netcdfplus package:

  1. It doesn't seem to play nicely with dask, which I intend to use for parallelization
  2. It is heavily tied to netCDF, and kept breaking because netCDF changed their internals frequently.
  3. It doesn't have very good performance on loading data.
  4. @jhprinz is no longer actively developing it, so it is becoming hard to maintain.

This PR gives the first work toward SimStore, a new storage subsystem. For now, I'm developing this in the experimental subpackage, because the API may change. This will replace netcdfplus in OPS 2.0. The last release before OPS 2.0 will include tools to facilitate transition from netcdfplus to SimStore, although I don't guarantee that all data will be transferable.

Some advantages of the new approach:

  • Faster and more efficient storage, and easier debugging. Instead of recursive algorithms, I use iterative algorithms that group everything to be written to/loaded from a given table into a minimal number of calls to disk. I also distinguish between storage of data objects (of which there are many, but which are of predictable format) and storage of simulation objects (which are few in number, but of a wider variety). I use faster algorithms to store data objects, and more generic (but slower) algorithms to store simulation objects.
  • More flexible with regards to backends. For now, the new storage uses SQL as a backend, but changing that is as easy as creating a new backend model (and I intend to do that by adding a pytables-based hdf5 backend in the future.) The backend is completely independent from the rest of the storage subsystem. In netcdfplus, storage inherits from netcdf4. That makes changing the backend much more difficult.
  • Easier to add new types for JSON-based storage. The new storage uses a registration-based system for adding JSON types, whereas netcdfplus would require actually editing the core netcdfplus code. Now, a user can add another type (e.g., simtk.unit.Quantity) without having to edit the core code.

In addition, as I go forward, I'm planning to implement it in a way that is designed to work with dask, and I'm also planning to focus on analysis performance.

This is a major undertaking, so in addition to this huge PR, there will be a number of PRs following to complete the storage.

  • Ability to store/reload arbitrary simulation objects
  • Ability to store/reload all data objects from a toy simulation
  • Custom tables for ensembles, volumes, etc.
  • SQL backend support
  • General framework for extensible JSON support; support for numpy arrays
  • Support for automatic registration of new snapshot types
  • Lazy proxy objects

Coming up in future PRs:

  • New base class for CVs (which will give better analysis performance and will enable parallelization)
  • Support for OpenMM (static/kinetic containers; simtk.unit.Quantity)
  • Generic "tagged" objects
  • Complete unit tests

This will be the first package released in the experimental subpackage. The policy for the experimental subpackage is that any unit tests must run and pass, but they aren't counted toward overall coverage requirements. In addition, things in experimental and beta are not tested on Python 2. Unit tests have to be complete (nearly 100% coverage) before we move this into the beta subpackage.

As I've been developing this, I've used a number of notebooks as additional tests (mainly to ensure that I don't break things that had been working.) Those notebooks are now online in their own repo: https://github.com/dwhswenson/ops-storage-notebooks

I'm planning on putting better examples in that repo soon so that users have the necessary information to start playing with this.

@dwhswenson
Copy link
Member Author

I'm about to do a bit of renaming here. Essentially, I want the experimental subpackage to somewhat mimic what the openpathsampling package will look like in 2.0, so that all you need to do to go from experimental to 2.0 is remove the experimental from the imports, even for things not in the root openpathsampling namespace.

Of the user-facing things, I believe this would just mean the following changes:

  • experimental.storage.sql_backend.SQLStorageBackend => experimental.simstore.SQLStorageBackend
  • experimental.storage.ops_storage.OPSStorage => experimental.storage.Storage (will be in root namespace in OPS 2.0, but experimental only holds subpackages)
  • experimental.storage.monkey_patches.monkey_patch_all available at experimental.storage.monkey_patch_all (etc for other monkey patches; also, the existing imports will still work, but a simpler one will be available.)

@sroet : Mainly, I want to warn you this change is coming. If there's a reason it would be a huge hassle, I can hold off for a bit. After that change, I think I'll mark this ready for review.

@sroet
Copy link
Member

sroet commented Sep 24, 2020

Mainly, I want to warn you this change is coming. If there's a reason it would be a huge hassle, I can hold off for a bit. After that change, I think I'll mark this ready for review.

Don't worry about it. I think I should be proficient enough with git to lock to a specific commit hash if it would be a hassle ;)

@dwhswenson
Copy link
Member Author

I've completed the renaming/shifting imports as proposed in #928 (comment).

As far as I'm concerned, this is now ready for review. It's not expected to be the complete implementation of the new storage, but a lot of first steps. It's already a large enough PR that it'll be hard for anyone to thoroughly review it, so I want this merged in so that subsequent steps can come in smaller doses.

That said, I'll leave this up for at least 10 days, just in case someone is brave (foolish) enough to want to give it a full review, and to provide sufficient time for other feedback. I will not merge this before Monday, 5 October, 2020.

@dwhswenson dwhswenson marked this pull request as ready for review September 25, 2020 10:41
@dwhswenson dwhswenson changed the title [WIP] SimStore: New storage subsystem (part 1) SimStore: New storage subsystem (part 1) Sep 25, 2020
@sroet
Copy link
Member

sroet commented Sep 25, 2020

From me this is a +1,

I have not looked at the code, but:

  • I have used this code (specifically dwhswenson@f80eb26 ) to run multiple 100_000 step toy system calculation with custom objects.
  • I have used c3fbbe0 to analyze these simulations with reasonable speeds (for me it loads at about 22 steps/s, `1.25 hours for 100_000 steps while I never got the old (netcdf) storage to even open.)
  • To get to the above point I had to manually create an sql index which was the reason for this comment, which was not to painful if required (takes about 15 min)

The two things that I have not checked and am a bit worried about:

  • I have not run a simulation with any code after dwhswenson@f80eb26 , so I don't now if the "UUID is not an SQLIndex" is actually fixed.

  • I don't know if there is performance degradation when running long simulations because the storage cache is not cleared while running these simulations, but I do expect there to be (It used to scale as O(N), but might now be O(log(N)), with N being the number of objects in the cache).

Know issues that will be left for future PRs (@dwhswenson please correct me here if I am promising things that you will not do 😉 ) :

  • Loaded PathSimulator objects don't get initialized with the storage they are loaded from
  • Mutable attributes are not saved (whole different issue)
  • While indexing now works for both positive and negative indices (for example storage.steps[-1]), it can't be sliced yet (for example storage.steps[:5000])

@dwhswenson
Copy link
Member Author

dwhswenson commented Sep 25, 2020

Know issues that will be left for future PRs

These are correct. The first one, storable storage, is already addressed in #929. Proper slicing of tables will probably come after #929, in a separate PR (keeping things smaller and easier to review). Storing mutable data is a larger problem, and may not happen until after 2.0 (but can be done in a way that is backward-compatible, so that's fine.)

Also, since this is in the experimental subpackage (where SemVer does not apply), let me take a moment to state a few ways to future-proof for anyone using this. I'll focus on creating/using the Storage object, since that's the main thing users will care about (i.e., only devs will care about adding new JSONCodecs or about how storage and backend communicate).

  • The examples currently show an approach of creating the backend directly with (SQLStorageBackend(filename, mode)) and then creating the storage with Storage.from_backend. This will probably remain the same. If you want to be even more future-proofed, starting with SimStore part 2: Storable functions (and more) #929 I think the approach of create SQLAlchemy engine => SQLBackend.from_engine => Storage.from_backend will be guaranteed. (In the future, there will by a single-step Storage(filename, mode) initialization.)
  • Saving objects, either singly or as a list, with storage.save() is guaranteed
  • Accessing true tables by single element (e.g., storage.steps[0]) or by iteration (e.g., for step in storage.steps) is guaranteed
  • Accessing pseudo-tables (views on the simulation objects table) by number (e.g., storage.cvs[0]), by name (e.g., storage.cvs['cv_x']) or by iteration (e.g., for cv in storage.cvs) is guaranteed. Note that not all pseudo-tables have been added yet; if you find one missing, let me know.

I think these are the main ways people tend to interact with the storage. If there are other ways, please ask about them!

@sroet
Copy link
Member

sroet commented Sep 30, 2020

alright I ran another simulation with 100_000 on commit fbe2b05

I have not run a simulation with any code after dwhswenson/openpathsampling@f80eb26 , so I don't now if the "UUID is not an SQLIndex" is actually fixed.

This has not been fixed and can lead to detrimental analysis performance as mentioned in this comment

Non-indexed sql loads with 29.17s/it while the indexed one: 17.69it/s (notice the swap of it and s on those scales)

I don't know if there is performance degradation when running long simulations because the storage cache is not cleared while running these simulations, but I do expect there to be (It used to scale as O(N), but might now be O(log(N)), with N being the number of objects in the cache).

The performance is still degrading somewhat, from on average 0.19 s/step at the start to on average 1.02 seconds per step at 100_000.

@sroet
Copy link
Member

sroet commented Oct 15, 2020

Ok I completed a new run with a2706cc:
Completed at 100_000 with and average speed of 0.45 s/step
The final average speed of this version (0.45 s/step) is already more than 2x the average speed of before 1.02 s/step.

Working on Monte Carlo cycle number 10000
Running for 29 minutes 36 seconds -  0.18 seconds per step
--
Working on Monte Carlo cycle number 20000
Running for 1 hour 9 minutes 40 seconds -  0.21 seconds per step
--
Working on Monte Carlo cycle number 30000
Running for 2 hours 3 minutes 14 seconds -  0.25 seconds per step
--
Working on Monte Carlo cycle number 40000
Running for 3 hours 3 minutes 21 seconds -  0.28 seconds per step
--
Working on Monte Carlo cycle number 50000
Running for 4 hours 16 minutes 47 seconds -  0.31 seconds per step
--
Working on Monte Carlo cycle number 60000
Running for 5 hours 37 minutes 18 seconds -  0.34 seconds per step
--
Working on Monte Carlo cycle number 70000
Running for 7 hours 5 minutes 29 seconds -  0.36 seconds per step
--
Working on Monte Carlo cycle number 80000
Running for 8 hours 39 minutes 29 seconds -  0.39 seconds per step
--
Working on Monte Carlo cycle number 90000
Running for 10 hours 41 minutes 45 seconds -  0.43 seconds per step
--
Working on Monte Carlo cycle number 100000
Running for 12 hours 33 minutes 34 seconds -  0.45 seconds per step

It is actually still growing linearly (on average), but at such a slow rate I think it is acceptable.

The if we look at the average time of the last 10_000 the speeds are (with slightly different rounding than above):

steps (x10000) average average last 10_000 diff last 10_000
1 0.18 0.18 -
2 0.21 0.24 0.06
3 0.25 0.33 0.09
4 0.28 0.37 0.04
5 0.31 0.43 0.06
6 0.34 0.49 0.06
7 0.37 0.55 0.06
8 0.40 0.61 0.06
9 0.43 0.67 0.06
10 0.46 0.73 0.06

This seems to indicate that the average time to add a step grows by 0.06 s for every 10_000 steps added. Given that this should still be sub-second for up to 130_000 steps , I am totally fine by this

  • Analysis runs as fast as you would expect with an indexed sql (36.38 it/s)

  • Appending seems to be as fast as before with an indexed sql (and in line with the last 10_000 speeds mentioned above)

Working on Monte Carlo cycle number 103100
Running for 38 minutes 24 seconds -  0.74 seconds per step
Estimated time remaining: 23 minutes 33 seconds

This solves all the issues for me.

@dwhswenson
Copy link
Member Author

Thanks @sroet : I added an index on the UUIDs table, so this confirms my approach for that works and that the index persists.

It may be worth indexing other tables as well, especially snapshots and (in #929) saved CV results.

IIRC your trajectories are also rather long, right? Order of 1000 snapshots? Your analysis speed is probably limited by needing to load all snapshots (from accepted trajectories, at least) -- once I finish #929, that will no longer be necessary (if you run the simulation with the stuff in #929).

With this performance issue dealt with, I'll go ahead and merge this in, and future performance improvements can be in other PRs.

@dwhswenson dwhswenson merged commit ace9d7e into openpathsampling:master Oct 15, 2020
@sroet
Copy link
Member

sroet commented Oct 15, 2020

IIRC your trajectories are also rather long, right? Order of 1000 snapshots? Your analysis speed is probably limited by needing to load all snapshots (from accepted trajectories, at least) -- once I finish #929, that will no longer be necessary (if you run the simulation with the stuff in #929).

Normally between 100-1000 frames per trajectory.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants