SimStore: New storage subsystem (part 1)#928
SimStore: New storage subsystem (part 1)#928dwhswenson merged 176 commits intoopenpathsampling:masterfrom
Conversation
can set up a Storage; still trouble storing objects to DB
|
I'm about to do a bit of renaming here. Essentially, I want the Of the user-facing things, I believe this would just mean the following changes:
@sroet : Mainly, I want to warn you this change is coming. If there's a reason it would be a huge hassle, I can hold off for a bit. After that change, I think I'll mark this ready for review. |
Don't worry about it. I think I should be proficient enough with |
Co-authored-by: sroet <sanderroet@hotmail.com>
|
I've completed the renaming/shifting imports as proposed in #928 (comment). As far as I'm concerned, this is now ready for review. It's not expected to be the complete implementation of the new storage, but a lot of first steps. It's already a large enough PR that it'll be hard for anyone to thoroughly review it, so I want this merged in so that subsequent steps can come in smaller doses. That said, I'll leave this up for at least 10 days, just in case someone is brave (foolish) enough to want to give it a full review, and to provide sufficient time for other feedback. I will not merge this before Monday, 5 October, 2020. |
|
From me this is a +1, I have not looked at the code, but:
The two things that I have not checked and am a bit worried about:
Know issues that will be left for future PRs (@dwhswenson please correct me here if I am promising things that you will not do 😉 ) :
|
These are correct. The first one, storable storage, is already addressed in #929. Proper slicing of tables will probably come after #929, in a separate PR (keeping things smaller and easier to review). Storing mutable data is a larger problem, and may not happen until after 2.0 (but can be done in a way that is backward-compatible, so that's fine.) Also, since this is in the
I think these are the main ways people tend to interact with the storage. If there are other ways, please ask about them! |
|
alright I ran another simulation with
This has not been fixed and can lead to detrimental analysis performance as mentioned in this comment Non-indexed sql loads with
The performance is still degrading somewhat, from on average |
|
Ok I completed a new run with It is actually still growing linearly (on average), but at such a slow rate I think it is acceptable. The if we look at the average time of the last
This seems to indicate that the average time to add a step grows by
This solves all the issues for me. |
|
Thanks @sroet : I added an index on the UUIDs table, so this confirms my approach for that works and that the index persists. It may be worth indexing other tables as well, especially snapshots and (in #929) saved CV results. IIRC your trajectories are also rather long, right? Order of 1000 snapshots? Your analysis speed is probably limited by needing to load all snapshots (from accepted trajectories, at least) -- once I finish #929, that will no longer be necessary (if you run the simulation with the stuff in #929). With this performance issue dealt with, I'll go ahead and merge this in, and future performance improvements can be in other PRs. |
Normally between |
For a while, I've been trying to rework our storage. This is because of several problems with the netcdfplus package:
This PR gives the first work toward SimStore, a new storage subsystem. For now, I'm developing this in the
experimentalsubpackage, because the API may change. This will replace netcdfplus in OPS 2.0. The last release before OPS 2.0 will include tools to facilitate transition from netcdfplus to SimStore, although I don't guarantee that all data will be transferable.Some advantages of the new approach:
In addition, as I go forward, I'm planning to implement it in a way that is designed to work with dask, and I'm also planning to focus on analysis performance.
This is a major undertaking, so in addition to this huge PR, there will be a number of PRs following to complete the storage.
ensembles,volumes, etc.Coming up in future PRs:
This will be the first package released in the
experimentalsubpackage. The policy for theexperimentalsubpackage is that any unit tests must run and pass, but they aren't counted toward overall coverage requirements. In addition, things inexperimentalandbetaare not tested on Python 2. Unit tests have to be complete (nearly 100% coverage) before we move this into thebetasubpackage.As I've been developing this, I've used a number of notebooks as additional tests (mainly to ensure that I don't break things that had been working.) Those notebooks are now online in their own repo: https://github.com/dwhswenson/ops-storage-notebooks
I'm planning on putting better examples in that repo soon so that users have the necessary information to start playing with this.