Skip to content

[ENH] V1 → V2 API Migration - core structure#1576

Open
geetu040 wants to merge 93 commits intoopenml:mainfrom
geetu040:migration
Open

[ENH] V1 → V2 API Migration - core structure#1576
geetu040 wants to merge 93 commits intoopenml:mainfrom
geetu040:migration

Conversation

@geetu040
Copy link
Collaborator

Towards #1575

This PR sets up the core folder and file structure along with base scaffolding for the API v1 → v2 migration.

It includes:

  • Skeleton for the HTTP client, backend, and API context
  • Abstract resource interfaces and versioned stubs (*V1, *V2)
  • Minimal wiring to allow future version switching and fallback support

No functional endpoints are migrated yet. This PR establishes a stable foundation for subsequent migration and refactor work.

@geetu040 geetu040 mentioned this pull request Dec 30, 2025
25 tasks
@codecov-commenter
Copy link

codecov-commenter commented Dec 31, 2025

Codecov Report

❌ Patch coverage is 53.95480% with 326 lines in your changes missing coverage. Please review.
✅ Project coverage is 53.29%. Comparing base (fefea59) to head (ab3c1eb).
⚠️ Report is 2 commits behind head on main.

Files with missing lines Patch % Lines
openml/_api/clients/http.py 23.44% 160 Missing ⚠️
openml/_api/resources/base/versions.py 24.71% 67 Missing ⚠️
openml/_api/setup/backend.py 64.04% 32 Missing ⚠️
openml/_api/resources/base/fallback.py 26.31% 28 Missing ⚠️
openml/testing.py 45.83% 13 Missing ⚠️
openml/_api/setup/_utils.py 56.00% 11 Missing ⚠️
openml/_api/setup/builder.py 81.08% 7 Missing ⚠️
openml/_api/clients/utils.py 55.55% 4 Missing ⚠️
openml/_api/resources/base/base.py 86.36% 3 Missing ⚠️
openml/config.py 93.33% 1 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main    #1576      +/-   ##
==========================================
+ Coverage   52.79%   53.29%   +0.49%     
==========================================
  Files          37       65      +28     
  Lines        4362     5070     +708     
==========================================
+ Hits         2303     2702     +399     
- Misses       2059     2368     +309     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

cache: CacheConfig


settings = Settings(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would move the settings to the individual classes. I think this design introduces too high coupling of the classes to this file. You cannot move the classes around, or add a new API version without making non-extensible changes to this file here - because APISettings will require a constructor change and new classes it accepts.

Instead, a better design is to apply the strategy pattern cleanly to the different API definitions - v1 and v2 - and move the config either to their __init__, or a set_config (or similar) method.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have updated the Config class, which can handle multiple api-versions in future. Constructor change can still be expected not for api-versions but for new config values, let me know if that is still problematic

Copy link
Collaborator

@fkiraly fkiraly left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall really great, I have a design suggestion related to the configs.

The config.py file and the coupling on it breaks an otherwise nice strategy pattern.

I recommend to follow the strategy pattern cleanly instead, and move the configs into the class instances, see above.

This will make the backend API much more extensible and cohesive.

@geetu040 geetu040 changed the title [ENH] Migration: set up core/base structure [ENH] V1 → V2 API Migration - core structure Jan 9, 2026
@geetu040 geetu040 marked this pull request as draft January 12, 2026 18:47
Copy link
Collaborator

@PGijsbers PGijsbers left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall I agree with the suggested changes. This seems like a reasonable way to provide a unified interface for two different backends, and also separate out some concerns that were previously coupled or scattered more than they should (e.g., caching, configurations).

My main concern is with the change to caching behavior. I have a minor concern over the indirection APIContext introduces (perhaps I misunderstood its purpose), and the introduction of allowing Response return values.

In my comments you will also find some things that may already have been "wrong" in the old implementation. In that case, I think it simply makes sense to make the change now so I repeat it here for convenience.

@fkiraly
Copy link
Collaborator

fkiraly commented Feb 10, 2026

is this ready to review or to merge from your side, @geetu040? This is a large PR and it seems to be moving still.

@geetu040
Copy link
Collaborator Author

is this ready to review or to merge from your side, @geetu040? This is a large PR and it seems to be moving still.

This PR is ready for review. I was incorporating feedback from the stacked PRs, but it's complete now, though it may be updated if new comments come in.

Copy link
Collaborator

@PGijsbers PGijsbers left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, this is taking a bit longer than I had wanted, so I am just leaving my first remarks here. Also left a reply on the ttl/cache discussion.

Comment on lines 576 to 579
use_cache : bool, optional
Whether to load/store responses from cache.
reset_cache : bool, optional
If True, bypass existing cache entries.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please rename the parameters. The description is incomplete though more accurate than the current name reset_cache: the cache is bypassed on load, not reset. The current cache is unaffected, which is unexpected to me for a parameter "reset_cache".

I would find it much more intuitive if the parameters are store_to_cache and load_from_cache, to indicate if a response will be attempted to be saved to or loaded from the cache.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree that the naming and description is not very clean, I will update. How about this:

enable_cache : bool, optional
    Whether to load/store response from cache.
refresh_cache : bool, optional
    If True, bypass existing cache and over-write with new response.
    This parameter is ignored if `enable_cache=False`.

And the internal implementation stays the same. This fulfills the sdk requirements for cache handling.

I still think store_to_cache and load_from_cache are not well suited for the sdk, that shifts some logic handling to the sdk, where as here the logic stays with the client and sdk only commands for a specific behaviour.

@@ -0,0 +1,33 @@
from __future__ import annotations

from enum import Enum
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any reason for using StrEnum?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not using StrEnum, are you asking why not?

I switched to this only because of the mypy errors in pre-commit checks.

from enum import StrEnum
# error: Module "enum" has no attribute "StrEnum"  [attr-defined]

api_version: APIVersion = APIVersion.V1
# incompatible type "str"; expected "APIVersion"

I can ignore the [attr-defined] error and fix type hints to str, but that didnot feel good to have str in type hints instead of the enum.

Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR establishes the foundational architecture for migrating the OpenML Python SDK from API v1 to v2. It introduces a modular, abstraction-based structure that supports both API versions simultaneously with transparent fallback capabilities.

Changes:

  • New HTTP client with caching, retry logic, and configurable retry policies
  • Abstract resource interfaces (DatasetAPI, TaskAPI, etc.) with versioned implementations (V1API, V2API) and FallbackProxy for automatic fallback
  • APIBackend singleton for centralized resource access with configuration synchronization between legacy and new API systems

Reviewed changes

Copilot reviewed 36 out of 37 changed files in this pull request and generated 9 comments.

Show a summary per file
File Description
openml/enums.py New enums for API versions, resource types, and retry policies
openml/exceptions.py Added OpenMLNotSupportedError for unsupported v2 operations
openml/_api/init.py Main exports for new API module (contains duplicate entries)
openml/_api/clients/http.py HTTP client with caching, retries, checksum validation (has duplicate methods and TTL doc issues)
openml/_api/clients/minio.py MinIO client configuration stub
openml/_api/clients/utils.py Retry delay calculation utilities
openml/_api/resources/base/base.py Abstract ResourceAPI base class (doc/signature mismatch)
openml/_api/resources/base/fallback.py FallbackProxy for automatic API fallback (attribute resolution issue)
openml/_api/resources/base/resources.py Resource-specific abstract API classes
openml/_api/resources/base/versions.py V1 and V2 API implementations (return type issues)
openml/_api/resources/*.py Resource-specific V1/V2 implementation stubs
openml/_api/resources/_registry.py Registry mapping API versions to resource implementations
openml/_api/setup/config.py Configuration dataclasses for API setup
openml/_api/setup/backend.py APIBackend singleton (thread-safety issue)
openml/_api/setup/builder.py Builder for constructing API backend from config
openml/_api/setup/_utils.py Cache directory resolution logic (tilde expansion and path return issues)
openml/config.py Added _sync_api_config for synchronizing legacy config with new API
openml/testing.py New TestAPIBase class for API testing infrastructure
tests/test_api/*.py Tests for HTTP client, resource versioning, and fallback
tests/conftest.py, tests/test_datasets/*.py Updated tests to call _sync_api_config after config changes

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +112 to +114
if cls._instance is None:
cls._instance = cls()
return cls._instance
Copy link

Copilot AI Feb 17, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The singleton implementation in get_instance is not thread-safe. In a multi-threaded environment, multiple threads could simultaneously check cls._instance is None and create multiple instances. Consider using a threading lock or the @functools.lru_cache decorator with maxsize=1 to ensure thread-safe singleton behavior.

Copilot uses AI. Check for mistakes.
root_dir_to_delete,
cache_dir,
)
return Path(xdg_cache_home)
Copy link

Copilot AI Feb 17, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Line 73 returns Path(xdg_cache_home) instead of cache_dir when an old cache directory is found. This is inconsistent with the warning message which says files will be located in cache_dir (line 71), but the function actually returns the parent directory. This should likely return cache_dir to match the documented behavior.

Suggested change
return Path(xdg_cache_home)
return cache_dir

Copilot uses AI. Check for mistakes.
api_version: APIVersion
resource_type: ResourceType

def __init__(self, http: HTTPClient, minio: MinIOClient):
Copy link

Copilot AI Feb 17, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The docstring for ResourceAPI.__init__ states that the minio parameter is "optional" (line 29), but the method signature on line 47 requires a MinIOClient instance (no None allowed). Either update the signature to accept MinIOClient | None to match the documentation, or remove "optional" from the docstring. The attribute type on line 40 already suggests it could be None.

Suggested change
def __init__(self, http: HTTPClient, minio: MinIOClient):
def __init__(self, http: HTTPClient, minio: MinIOClient | None = None):

Copilot uses AI. Check for mistakes.
Comment on lines +69 to +77
"FallbackProxy",
"FlowAPI",
"FlowV1API",
"FlowV2API",
"HTTPCache",
"HTTPClient",
"MinIOClient",
"ResourceAPI",
"ResourceAPI",
Copy link

Copilot AI Feb 17, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The __all__ list contains duplicate entries: "FallbackProxy" appears on lines 68 and 69, and "ResourceAPI" appears on lines 76 and 77. Remove the duplicate entries to maintain clean exports.

Suggested change
"FallbackProxy",
"FlowAPI",
"FlowV1API",
"FlowV2API",
"HTTPCache",
"HTTPClient",
"MinIOClient",
"ResourceAPI",
"ResourceAPI",
"FlowAPI",
"FlowV1API",
"FlowV2API",
"HTTPCache",
"HTTPClient",
"MinIOClient",
"ResourceAPI",

Copilot uses AI. Check for mistakes.

xdg_cache_home = os.environ.get("XDG_CACHE_HOME")
if xdg_cache_home is None:
return Path("~", ".cache", "openml")
Copy link

Copilot AI Feb 17, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Line 48 returns Path("~", ".cache", "openml") without calling .expanduser(). This means the tilde character will be treated literally rather than being expanded to the user's home directory. Add .expanduser() to ensure proper path expansion, like: return Path("~", ".cache", "openml").expanduser().

Suggested change
return Path("~", ".cache", "openml")
return Path("~", ".cache", "openml").expanduser()

Copilot uses AI. Check for mistakes.
Comment on lines +309 to +319
def publish(self, path: str, files: Mapping[str, Any] | None) -> int: # noqa: ARG002
self._not_supported(method="publish")

def delete(self, resource_id: int) -> bool: # noqa: ARG002
self._not_supported(method="delete")

def tag(self, resource_id: int, tag: str) -> list[str]: # noqa: ARG002
self._not_supported(method="tag")

def untag(self, resource_id: int, tag: str) -> list[str]: # noqa: ARG002
self._not_supported(method="untag")
Copy link

Copilot AI Feb 17, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The methods publish, delete, tag, and untag in ResourceV2API have return type annotations (int, bool, list[str]) but they call _not_supported which always raises an exception and never returns. These methods should have NoReturn as their return type annotation to accurately reflect that they always raise an exception.

Copilot uses AI. Check for mistakes.
created_at = meta.get("created_at")
if created_at is None:
raise ValueError("Cache metadata missing 'created_at'")

Copy link

Copilot AI Feb 17, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The HTTPCache class docstring mentions "Entries are considered valid until their time-to-live (TTL) expires" (line 39-40), and the load method's docstring states it may raise TimeoutError if "the cached entry has expired based on the configured TTL" (line 119). However, there's no TTL implementation in the code - the created_at timestamp is loaded (line 138) but never checked against any TTL value. Either implement TTL checking or remove the misleading documentation about TTL expiration.

Suggested change
# Enforce TTL-based expiration if a TTL is configured on this cache.
ttl = getattr(self, "ttl", None)
if ttl is not None:
try:
created_at_ts = float(created_at)
except (TypeError, ValueError) as exc:
raise ValueError(
"Cache metadata field 'created_at' must be a numeric timestamp"
) from exc
if created_at_ts + ttl < time.time():
raise TimeoutError("Cached entry has expired based on the configured TTL.")

Copilot uses AI. Check for mistakes.
Comment on lines +242 to +288
self.retry_func = human_delay if retry_policy == RetryPolicy.HUMAN else robot_delay
self.headers: dict[str, str] = {"user-agent": f"openml-python/{__version__}"}

def _robot_delay(self, n: int) -> float:
"""
Compute delay for automated retry policy.

Parameters
----------
n : int
Current retry attempt number (1-based).

Returns
-------
float
Number of seconds to wait before the next retry.

Notes
-----
Uses a sigmoid-based growth curve with Gaussian noise to gradually
increase waiting time.
"""
wait = (1 / (1 + math.exp(-(n * 0.5 - 4)))) * 60
variation = random.gauss(0, wait / 10)
return max(1.0, wait + variation)

def _human_delay(self, n: int) -> float:
"""
Compute delay for human-like retry policy.

Parameters
----------
n : int
Current retry attempt number (1-based).

Returns
-------
float
Number of seconds to wait before the next retry.
"""
return max(1.0, n)

def _parse_exception_response(
self,
response: Response,
) -> tuple[int | None, str]:
"""
Copy link

Copilot AI Feb 17, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The _robot_delay and _human_delay methods in the HTTPClient class are never used. Line 242 uses human_delay and robot_delay functions imported from utils module instead. Remove these unused methods to avoid confusion and reduce code duplication, or update line 242 to use these instance methods if they were intended to be used.

Suggested change
self.retry_func = human_delay if retry_policy == RetryPolicy.HUMAN else robot_delay
self.headers: dict[str, str] = {"user-agent": f"openml-python/{__version__}"}
def _robot_delay(self, n: int) -> float:
"""
Compute delay for automated retry policy.
Parameters
----------
n : int
Current retry attempt number (1-based).
Returns
-------
float
Number of seconds to wait before the next retry.
Notes
-----
Uses a sigmoid-based growth curve with Gaussian noise to gradually
increase waiting time.
"""
wait = (1 / (1 + math.exp(-(n * 0.5 - 4)))) * 60
variation = random.gauss(0, wait / 10)
return max(1.0, wait + variation)
def _human_delay(self, n: int) -> float:
"""
Compute delay for human-like retry policy.
Parameters
----------
n : int
Current retry attempt number (1-based).
Returns
-------
float
Number of seconds to wait before the next retry.
"""
return max(1.0, n)
def _parse_exception_response(
self,
response: Response,
) -> tuple[int | None, str]:
"""
self.retry_func = human_delay if retry_policy == RetryPolicy.HUMAN else robot_delay
self.headers: dict[str, str] = {"user-agent": f"openml-python/{__version__}"}
def _parse_exception_response(
self,
response: Response,
) -> tuple[int | None, str]:
"""

Copilot uses AI. Check for mistakes.
Comment on lines +87 to +88
attr = getattr(api, name, None)
if attr is not None:
Copy link

Copilot AI Feb 17, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The _find_attr method returns the first attribute that is not None, which could cause issues if an API implementation legitimately has an attribute set to None. Consider using hasattr instead, or checking if the attribute exists in the object's dictionary rather than checking if its value is not None.

Suggested change
attr = getattr(api, name, None)
if attr is not None:
if hasattr(api, name):
attr = getattr(api, name)

Copilot uses AI. Check for mistakes.
Copy link
Collaborator

@mfeurer mfeurer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey, I did a partial review. I will review the remaining files later today or tomorrow.

Please also have a look at the comments from GitHub Copilot.

"""
content_type = response.headers.get("Content-Type", "").lower()

if "json" in content_type:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not ==?

import random


def robot_delay(n: int) -> float:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The functions in this file are duplicate from the functions in http.py. Please de-duplicate (I would suggest removing this file).

Base server URL (e.g., ``https://www.openml.org``).
base_url : str
Base API path appended to the server URL.
api_key : str
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In practice, the api_key can be None. For example, get requests do not require it, and some tests explicitly set it to None. I think this should be changed.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Following up on that, we should probably add a test to ensure we raise an exception when use_api_key=True and api_key=None.



@dataclass
class APIConfig:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't understand why we need this in addition to the global config? Is it so that different API versions can have different URLs? If this is the case, should different API versions also have different cache directories, or is the API version automatically part of the URL, and thereby, part of the cache directory?



@dataclass
class ConnectionConfig:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is this a separate class?

"""
resource_type = self._get_endpoint_name()

legal_resources = {"data", "flow", "task", "run", "study", "user"}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't we have an enum that maps from their Python equivalent to the actual name? Shouldn't we use that here?

"""
resource_type = self._get_endpoint_name()

legal_resources = {"data", "task", "flow", "setup", "run"}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same as above.

"""
resource_type = self._get_endpoint_name()

legal_resources = {"data", "task", "flow", "setup", "run"}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same as above.

data = {f"{resource_type}_id": resource_id, "tag": tag}
response = self._http.post(path, data=data)

main_tag = f"oml:{resource_type}_untag"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why do we need a separate variable for this?

data = {f"{resource_type}_id": resource_id, "tag": tag}
response = self._http.post(path, data=data)

main_tag = f"oml:{resource_type}_tag"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we need a separate variable for this?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

8 participants