[ENH] V1 → V2 API Migration - core structure#1576
[ENH] V1 → V2 API Migration - core structure#1576geetu040 wants to merge 93 commits intoopenml:mainfrom
Conversation
Codecov Report❌ Patch coverage is Additional details and impacted files@@ Coverage Diff @@
## main #1576 +/- ##
==========================================
+ Coverage 52.79% 53.29% +0.49%
==========================================
Files 37 65 +28
Lines 4362 5070 +708
==========================================
+ Hits 2303 2702 +399
- Misses 2059 2368 +309 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
openml/_api/config.py
Outdated
| cache: CacheConfig | ||
|
|
||
|
|
||
| settings = Settings( |
There was a problem hiding this comment.
I would move the settings to the individual classes. I think this design introduces too high coupling of the classes to this file. You cannot move the classes around, or add a new API version without making non-extensible changes to this file here - because APISettings will require a constructor change and new classes it accepts.
Instead, a better design is to apply the strategy pattern cleanly to the different API definitions - v1 and v2 - and move the config either to their __init__, or a set_config (or similar) method.
There was a problem hiding this comment.
I have updated the Config class, which can handle multiple api-versions in future. Constructor change can still be expected not for api-versions but for new config values, let me know if that is still problematic
fkiraly
left a comment
There was a problem hiding this comment.
Overall really great, I have a design suggestion related to the configs.
The config.py file and the coupling on it breaks an otherwise nice strategy pattern.
I recommend to follow the strategy pattern cleanly instead, and move the configs into the class instances, see above.
This will make the backend API much more extensible and cohesive.
There was a problem hiding this comment.
Overall I agree with the suggested changes. This seems like a reasonable way to provide a unified interface for two different backends, and also separate out some concerns that were previously coupled or scattered more than they should (e.g., caching, configurations).
My main concern is with the change to caching behavior. I have a minor concern over the indirection APIContext introduces (perhaps I misunderstood its purpose), and the introduction of allowing Response return values.
In my comments you will also find some things that may already have been "wrong" in the old implementation. In that case, I think it simply makes sense to make the change now so I repeat it here for convenience.
|
is this ready to review or to merge from your side, @geetu040? This is a large PR and it seems to be moving still. |
This PR is ready for review. I was incorporating feedback from the stacked PRs, but it's complete now, though it may be updated if new comments come in. |
PGijsbers
left a comment
There was a problem hiding this comment.
Sorry, this is taking a bit longer than I had wanted, so I am just leaving my first remarks here. Also left a reply on the ttl/cache discussion.
openml/_api/clients/http.py
Outdated
| use_cache : bool, optional | ||
| Whether to load/store responses from cache. | ||
| reset_cache : bool, optional | ||
| If True, bypass existing cache entries. |
There was a problem hiding this comment.
Please rename the parameters. The description is incomplete though more accurate than the current name reset_cache: the cache is bypassed on load, not reset. The current cache is unaffected, which is unexpected to me for a parameter "reset_cache".
I would find it much more intuitive if the parameters are store_to_cache and load_from_cache, to indicate if a response will be attempted to be saved to or loaded from the cache.
There was a problem hiding this comment.
I agree that the naming and description is not very clean, I will update. How about this:
enable_cache : bool, optional
Whether to load/store response from cache.
refresh_cache : bool, optional
If True, bypass existing cache and over-write with new response.
This parameter is ignored if `enable_cache=False`.
And the internal implementation stays the same. This fulfills the sdk requirements for cache handling.
I still think store_to_cache and load_from_cache are not well suited for the sdk, that shifts some logic handling to the sdk, where as here the logic stays with the client and sdk only commands for a specific behaviour.
| @@ -0,0 +1,33 @@ | |||
| from __future__ import annotations | |||
|
|
|||
| from enum import Enum | |||
There was a problem hiding this comment.
Any reason for using StrEnum?
There was a problem hiding this comment.
I am not using StrEnum, are you asking why not?
I switched to this only because of the mypy errors in pre-commit checks.
from enum import StrEnum
# error: Module "enum" has no attribute "StrEnum" [attr-defined]
api_version: APIVersion = APIVersion.V1
# incompatible type "str"; expected "APIVersion"I can ignore the [attr-defined] error and fix type hints to str, but that didnot feel good to have str in type hints instead of the enum.
Previously, multiple tests were publishing the same task concurrently, which increased the likelihood of race conditions and flaky failures. This update replaces real HTTP post calls with mocks, making the tests deterministic and isolated from the server.
There was a problem hiding this comment.
Pull request overview
This PR establishes the foundational architecture for migrating the OpenML Python SDK from API v1 to v2. It introduces a modular, abstraction-based structure that supports both API versions simultaneously with transparent fallback capabilities.
Changes:
- New HTTP client with caching, retry logic, and configurable retry policies
- Abstract resource interfaces (DatasetAPI, TaskAPI, etc.) with versioned implementations (V1API, V2API) and FallbackProxy for automatic fallback
- APIBackend singleton for centralized resource access with configuration synchronization between legacy and new API systems
Reviewed changes
Copilot reviewed 36 out of 37 changed files in this pull request and generated 9 comments.
Show a summary per file
| File | Description |
|---|---|
| openml/enums.py | New enums for API versions, resource types, and retry policies |
| openml/exceptions.py | Added OpenMLNotSupportedError for unsupported v2 operations |
| openml/_api/init.py | Main exports for new API module (contains duplicate entries) |
| openml/_api/clients/http.py | HTTP client with caching, retries, checksum validation (has duplicate methods and TTL doc issues) |
| openml/_api/clients/minio.py | MinIO client configuration stub |
| openml/_api/clients/utils.py | Retry delay calculation utilities |
| openml/_api/resources/base/base.py | Abstract ResourceAPI base class (doc/signature mismatch) |
| openml/_api/resources/base/fallback.py | FallbackProxy for automatic API fallback (attribute resolution issue) |
| openml/_api/resources/base/resources.py | Resource-specific abstract API classes |
| openml/_api/resources/base/versions.py | V1 and V2 API implementations (return type issues) |
| openml/_api/resources/*.py | Resource-specific V1/V2 implementation stubs |
| openml/_api/resources/_registry.py | Registry mapping API versions to resource implementations |
| openml/_api/setup/config.py | Configuration dataclasses for API setup |
| openml/_api/setup/backend.py | APIBackend singleton (thread-safety issue) |
| openml/_api/setup/builder.py | Builder for constructing API backend from config |
| openml/_api/setup/_utils.py | Cache directory resolution logic (tilde expansion and path return issues) |
| openml/config.py | Added _sync_api_config for synchronizing legacy config with new API |
| openml/testing.py | New TestAPIBase class for API testing infrastructure |
| tests/test_api/*.py | Tests for HTTP client, resource versioning, and fallback |
| tests/conftest.py, tests/test_datasets/*.py | Updated tests to call _sync_api_config after config changes |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| if cls._instance is None: | ||
| cls._instance = cls() | ||
| return cls._instance |
There was a problem hiding this comment.
The singleton implementation in get_instance is not thread-safe. In a multi-threaded environment, multiple threads could simultaneously check cls._instance is None and create multiple instances. Consider using a threading lock or the @functools.lru_cache decorator with maxsize=1 to ensure thread-safe singleton behavior.
| root_dir_to_delete, | ||
| cache_dir, | ||
| ) | ||
| return Path(xdg_cache_home) |
There was a problem hiding this comment.
Line 73 returns Path(xdg_cache_home) instead of cache_dir when an old cache directory is found. This is inconsistent with the warning message which says files will be located in cache_dir (line 71), but the function actually returns the parent directory. This should likely return cache_dir to match the documented behavior.
| return Path(xdg_cache_home) | |
| return cache_dir |
| api_version: APIVersion | ||
| resource_type: ResourceType | ||
|
|
||
| def __init__(self, http: HTTPClient, minio: MinIOClient): |
There was a problem hiding this comment.
The docstring for ResourceAPI.__init__ states that the minio parameter is "optional" (line 29), but the method signature on line 47 requires a MinIOClient instance (no None allowed). Either update the signature to accept MinIOClient | None to match the documentation, or remove "optional" from the docstring. The attribute type on line 40 already suggests it could be None.
| def __init__(self, http: HTTPClient, minio: MinIOClient): | |
| def __init__(self, http: HTTPClient, minio: MinIOClient | None = None): |
| "FallbackProxy", | ||
| "FlowAPI", | ||
| "FlowV1API", | ||
| "FlowV2API", | ||
| "HTTPCache", | ||
| "HTTPClient", | ||
| "MinIOClient", | ||
| "ResourceAPI", | ||
| "ResourceAPI", |
There was a problem hiding this comment.
The __all__ list contains duplicate entries: "FallbackProxy" appears on lines 68 and 69, and "ResourceAPI" appears on lines 76 and 77. Remove the duplicate entries to maintain clean exports.
| "FallbackProxy", | |
| "FlowAPI", | |
| "FlowV1API", | |
| "FlowV2API", | |
| "HTTPCache", | |
| "HTTPClient", | |
| "MinIOClient", | |
| "ResourceAPI", | |
| "ResourceAPI", | |
| "FlowAPI", | |
| "FlowV1API", | |
| "FlowV2API", | |
| "HTTPCache", | |
| "HTTPClient", | |
| "MinIOClient", | |
| "ResourceAPI", |
|
|
||
| xdg_cache_home = os.environ.get("XDG_CACHE_HOME") | ||
| if xdg_cache_home is None: | ||
| return Path("~", ".cache", "openml") |
There was a problem hiding this comment.
Line 48 returns Path("~", ".cache", "openml") without calling .expanduser(). This means the tilde character will be treated literally rather than being expanded to the user's home directory. Add .expanduser() to ensure proper path expansion, like: return Path("~", ".cache", "openml").expanduser().
| return Path("~", ".cache", "openml") | |
| return Path("~", ".cache", "openml").expanduser() |
| def publish(self, path: str, files: Mapping[str, Any] | None) -> int: # noqa: ARG002 | ||
| self._not_supported(method="publish") | ||
|
|
||
| def delete(self, resource_id: int) -> bool: # noqa: ARG002 | ||
| self._not_supported(method="delete") | ||
|
|
||
| def tag(self, resource_id: int, tag: str) -> list[str]: # noqa: ARG002 | ||
| self._not_supported(method="tag") | ||
|
|
||
| def untag(self, resource_id: int, tag: str) -> list[str]: # noqa: ARG002 | ||
| self._not_supported(method="untag") |
There was a problem hiding this comment.
The methods publish, delete, tag, and untag in ResourceV2API have return type annotations (int, bool, list[str]) but they call _not_supported which always raises an exception and never returns. These methods should have NoReturn as their return type annotation to accurately reflect that they always raise an exception.
| created_at = meta.get("created_at") | ||
| if created_at is None: | ||
| raise ValueError("Cache metadata missing 'created_at'") | ||
|
|
There was a problem hiding this comment.
The HTTPCache class docstring mentions "Entries are considered valid until their time-to-live (TTL) expires" (line 39-40), and the load method's docstring states it may raise TimeoutError if "the cached entry has expired based on the configured TTL" (line 119). However, there's no TTL implementation in the code - the created_at timestamp is loaded (line 138) but never checked against any TTL value. Either implement TTL checking or remove the misleading documentation about TTL expiration.
| # Enforce TTL-based expiration if a TTL is configured on this cache. | |
| ttl = getattr(self, "ttl", None) | |
| if ttl is not None: | |
| try: | |
| created_at_ts = float(created_at) | |
| except (TypeError, ValueError) as exc: | |
| raise ValueError( | |
| "Cache metadata field 'created_at' must be a numeric timestamp" | |
| ) from exc | |
| if created_at_ts + ttl < time.time(): | |
| raise TimeoutError("Cached entry has expired based on the configured TTL.") |
| self.retry_func = human_delay if retry_policy == RetryPolicy.HUMAN else robot_delay | ||
| self.headers: dict[str, str] = {"user-agent": f"openml-python/{__version__}"} | ||
|
|
||
| def _robot_delay(self, n: int) -> float: | ||
| """ | ||
| Compute delay for automated retry policy. | ||
|
|
||
| Parameters | ||
| ---------- | ||
| n : int | ||
| Current retry attempt number (1-based). | ||
|
|
||
| Returns | ||
| ------- | ||
| float | ||
| Number of seconds to wait before the next retry. | ||
|
|
||
| Notes | ||
| ----- | ||
| Uses a sigmoid-based growth curve with Gaussian noise to gradually | ||
| increase waiting time. | ||
| """ | ||
| wait = (1 / (1 + math.exp(-(n * 0.5 - 4)))) * 60 | ||
| variation = random.gauss(0, wait / 10) | ||
| return max(1.0, wait + variation) | ||
|
|
||
| def _human_delay(self, n: int) -> float: | ||
| """ | ||
| Compute delay for human-like retry policy. | ||
|
|
||
| Parameters | ||
| ---------- | ||
| n : int | ||
| Current retry attempt number (1-based). | ||
|
|
||
| Returns | ||
| ------- | ||
| float | ||
| Number of seconds to wait before the next retry. | ||
| """ | ||
| return max(1.0, n) | ||
|
|
||
| def _parse_exception_response( | ||
| self, | ||
| response: Response, | ||
| ) -> tuple[int | None, str]: | ||
| """ |
There was a problem hiding this comment.
The _robot_delay and _human_delay methods in the HTTPClient class are never used. Line 242 uses human_delay and robot_delay functions imported from utils module instead. Remove these unused methods to avoid confusion and reduce code duplication, or update line 242 to use these instance methods if they were intended to be used.
| self.retry_func = human_delay if retry_policy == RetryPolicy.HUMAN else robot_delay | |
| self.headers: dict[str, str] = {"user-agent": f"openml-python/{__version__}"} | |
| def _robot_delay(self, n: int) -> float: | |
| """ | |
| Compute delay for automated retry policy. | |
| Parameters | |
| ---------- | |
| n : int | |
| Current retry attempt number (1-based). | |
| Returns | |
| ------- | |
| float | |
| Number of seconds to wait before the next retry. | |
| Notes | |
| ----- | |
| Uses a sigmoid-based growth curve with Gaussian noise to gradually | |
| increase waiting time. | |
| """ | |
| wait = (1 / (1 + math.exp(-(n * 0.5 - 4)))) * 60 | |
| variation = random.gauss(0, wait / 10) | |
| return max(1.0, wait + variation) | |
| def _human_delay(self, n: int) -> float: | |
| """ | |
| Compute delay for human-like retry policy. | |
| Parameters | |
| ---------- | |
| n : int | |
| Current retry attempt number (1-based). | |
| Returns | |
| ------- | |
| float | |
| Number of seconds to wait before the next retry. | |
| """ | |
| return max(1.0, n) | |
| def _parse_exception_response( | |
| self, | |
| response: Response, | |
| ) -> tuple[int | None, str]: | |
| """ | |
| self.retry_func = human_delay if retry_policy == RetryPolicy.HUMAN else robot_delay | |
| self.headers: dict[str, str] = {"user-agent": f"openml-python/{__version__}"} | |
| def _parse_exception_response( | |
| self, | |
| response: Response, | |
| ) -> tuple[int | None, str]: | |
| """ |
| attr = getattr(api, name, None) | ||
| if attr is not None: |
There was a problem hiding this comment.
The _find_attr method returns the first attribute that is not None, which could cause issues if an API implementation legitimately has an attribute set to None. Consider using hasattr instead, or checking if the attribute exists in the object's dictionary rather than checking if its value is not None.
| attr = getattr(api, name, None) | |
| if attr is not None: | |
| if hasattr(api, name): | |
| attr = getattr(api, name) |
mfeurer
left a comment
There was a problem hiding this comment.
Hey, I did a partial review. I will review the remaining files later today or tomorrow.
Please also have a look at the comments from GitHub Copilot.
| """ | ||
| content_type = response.headers.get("Content-Type", "").lower() | ||
|
|
||
| if "json" in content_type: |
| import random | ||
|
|
||
|
|
||
| def robot_delay(n: int) -> float: |
There was a problem hiding this comment.
The functions in this file are duplicate from the functions in http.py. Please de-duplicate (I would suggest removing this file).
| Base server URL (e.g., ``https://www.openml.org``). | ||
| base_url : str | ||
| Base API path appended to the server URL. | ||
| api_key : str |
There was a problem hiding this comment.
In practice, the api_key can be None. For example, get requests do not require it, and some tests explicitly set it to None. I think this should be changed.
There was a problem hiding this comment.
Following up on that, we should probably add a test to ensure we raise an exception when use_api_key=True and api_key=None.
|
|
||
|
|
||
| @dataclass | ||
| class APIConfig: |
There was a problem hiding this comment.
I don't understand why we need this in addition to the global config? Is it so that different API versions can have different URLs? If this is the case, should different API versions also have different cache directories, or is the API version automatically part of the URL, and thereby, part of the cache directory?
|
|
||
|
|
||
| @dataclass | ||
| class ConnectionConfig: |
There was a problem hiding this comment.
Why is this a separate class?
| """ | ||
| resource_type = self._get_endpoint_name() | ||
|
|
||
| legal_resources = {"data", "flow", "task", "run", "study", "user"} |
There was a problem hiding this comment.
Don't we have an enum that maps from their Python equivalent to the actual name? Shouldn't we use that here?
| """ | ||
| resource_type = self._get_endpoint_name() | ||
|
|
||
| legal_resources = {"data", "task", "flow", "setup", "run"} |
| """ | ||
| resource_type = self._get_endpoint_name() | ||
|
|
||
| legal_resources = {"data", "task", "flow", "setup", "run"} |
| data = {f"{resource_type}_id": resource_id, "tag": tag} | ||
| response = self._http.post(path, data=data) | ||
|
|
||
| main_tag = f"oml:{resource_type}_untag" |
There was a problem hiding this comment.
why do we need a separate variable for this?
| data = {f"{resource_type}_id": resource_id, "tag": tag} | ||
| response = self._http.post(path, data=data) | ||
|
|
||
| main_tag = f"oml:{resource_type}_tag" |
There was a problem hiding this comment.
Why do we need a separate variable for this?
Towards #1575
This PR sets up the core folder and file structure along with base scaffolding for the API v1 → v2 migration.
It includes:
*V1,*V2)No functional endpoints are migrated yet. This PR establishes a stable foundation for subsequent migration and refactor work.