diff --git a/pgml-cms/docs/SUMMARY.md b/pgml-cms/docs/SUMMARY.md index 84e656fcb..bfc9ef6a1 100644 --- a/pgml-cms/docs/SUMMARY.md +++ b/pgml-cms/docs/SUMMARY.md @@ -36,7 +36,7 @@ * [pgml.tune()](introduction/apis/sql-extensions/pgml.tune.md) * [Client SDKs](introduction/apis/client-sdks/README.md) * [Overview](introduction/apis/client-sdks/getting-started.md) - * [Collections](../../pgml-docs/docs/guides/sdks/collections.md) + * [Collections](introduction/apis/client-sdks/collections.md) * [Pipelines](introduction/apis/client-sdks/pipelines.md) * [Search](introduction/apis/client-sdks/search.md) * [Tutorials](introduction/apis/client-sdks/tutorials/README.md) diff --git a/pgml-cms/docs/introduction/apis/sql-extensions/pgml.deploy.md b/pgml-cms/docs/introduction/apis/sql-extensions/pgml.deploy.md index e24dabf05..e5c52f793 100644 --- a/pgml-cms/docs/introduction/apis/sql-extensions/pgml.deploy.md +++ b/pgml-cms/docs/introduction/apis/sql-extensions/pgml.deploy.md @@ -26,11 +26,11 @@ pgml.deploy( There are 3 different deployment strategies available: -| Strategy | Description | -| ------------- | --------------------------------------------------------------------------------------------------------------------- | -| `most_recent` | The most recently trained model for this project is immediately deployed, regardless of metrics. | -| `best_score` | The model that achieved the best key metric score is immediately deployed. | -| `rollback` | The model that was last deployed for this project is immediately redeployed, overriding the currently deployed model. | +| Strategy | Description | +| ------------- |--------------------------------------------------------------------------------------------------| +| `most_recent` | The most recently trained model for this project is immediately deployed, regardless of metrics. | +| `best_score` | The model that achieved the best key metric score is immediately deployed. | +| `rollback` | The model that was deployed before to the current one is deployed. | The default deployment behavior allows any algorithm to qualify. It's automatically used during training, but can be manually executed as well: @@ -40,11 +40,12 @@ The default deployment behavior allows any algorithm to qualify. It's automatica #### SQL -
SELECT * FROM pgml.deploy(
-    'Handwritten Digit Image Classifier',
+```sql
+SELECT * FROM pgml.deploy(
+   'Handwritten Digit Image Classifier',
     strategy => 'best_score'
 );
-
+``` #### Output @@ -121,3 +122,22 @@ SELECT * FROM pgml.deploy( Handwritten Digit Image Classifier | rollback | xgboost (1 row) ``` + +### Specific Model IDs + +In the case you need to deploy an exact model that is not the `most_recent` or `best_score`, you may deploy a model by id. Model id's can be found in the `pgml.models` table. + +#### SQL + +```sql +SELECT * FROM pgml.deploy(12); +``` + +#### Output + +```sql + project | strategy | algorithm +------------------------------------+----------+----------- + Handwritten Digit Image Classifier | specific | xgboost +(1 row) +``` diff --git a/pgml-cms/docs/introduction/apis/sql-extensions/pgml.train/data-pre-processing.md b/pgml-cms/docs/introduction/apis/sql-extensions/pgml.train/data-pre-processing.md index 8d4aeb222..3362c99bd 100644 --- a/pgml-cms/docs/introduction/apis/sql-extensions/pgml.train/data-pre-processing.md +++ b/pgml-cms/docs/introduction/apis/sql-extensions/pgml.train/data-pre-processing.md @@ -25,11 +25,11 @@ In this example: There are 3 steps to preprocessing data: -* [Encoding](data-pre-processing.md#ordinal-encoding) categorical values into quantitative values -* [Imputing](data-pre-processing.md#imputing-missing-values) NULL values to some quantitative value -* [Scaling](data-pre-processing.md#scaling-values) quantitative values across all variables to similar ranges +* [Encoding](../../../../../../pgml-dashboard/content/docs/training/preprocessing.md#categorical-encodings) categorical values into quantitative values +* [Imputing](../../../../../../pgml-dashboard/content/docs/training/preprocessing.md#imputing-missing-values) NULL values to some quantitative value +* [Scaling](../../../../../../pgml-dashboard/content/docs/training/preprocessing.md#scaling-values) quantitative values across all variables to similar ranges -These preprocessing steps may be specified on a per-column basis to the [train()](./) function. By default, PostgresML does minimal preprocessing on training data, and will raise an error during analysis if NULL values are encountered without a preprocessor. All types other than `TEXT` are treated as quantitative variables and cast to floating point representations before passing them to the underlying algorithm implementations. +These preprocessing steps may be specified on a per-column basis to the [train()](../../../../../../docs/training/overview/) function. By default, PostgresML does minimal preprocessing on training data, and will raise an error during analysis if NULL values are encountered without a preprocessor. All types other than `TEXT` are treated as quantitative variables and cast to floating point representations before passing them to the underlying algorithm implementations. ```sql SELECT pgml.train( diff --git a/pgml-cms/docs/resources/developer-docs/contributing.md b/pgml-cms/docs/resources/developer-docs/contributing.md index 38688dc26..3648acbe3 100644 --- a/pgml-cms/docs/resources/developer-docs/contributing.md +++ b/pgml-cms/docs/resources/developer-docs/contributing.md @@ -67,7 +67,7 @@ Once there, you can initialize `pgrx` and get going: #### Pgrx command line and environments ```commandline -cargo install cargo-pgrx --version "0.9.8" --locked && \ +cargo install cargo-pgrx --version "0.11.2" --locked && \ cargo pgrx init # This will take a few minutes ``` diff --git a/pgml-cms/docs/resources/developer-docs/installation.md b/pgml-cms/docs/resources/developer-docs/installation.md index 990cec5a8..119080bf2 100644 --- a/pgml-cms/docs/resources/developer-docs/installation.md +++ b/pgml-cms/docs/resources/developer-docs/installation.md @@ -36,7 +36,7 @@ brew bundle PostgresML is written in Rust, so you'll need to install the latest compiler from [rust-lang.org](https://rust-lang.org). Additionally, we use the Rust PostgreSQL extension framework `pgrx`, which requires some initialization steps: ```bash -cargo install cargo-pgrx --version 0.9.8 && \ +cargo install cargo-pgrx --version 0.11.2 && \ cargo pgrx init ``` @@ -63,8 +63,7 @@ To install the necessary Python packages into a virtual environment, use the `vi ```bash virtualenv pgml-venv && \ source pgml-venv/bin/activate && \ -pip install -r requirements.txt && \ -pip install -r requirements-xformers.txt --no-dependencies +pip install -r requirements.txt ``` {% endtab %} @@ -146,7 +145,7 @@ pgml_test=# SELECT pgml.version(); We like and use pgvector a lot, as documented in our blog posts and examples, to store and search embeddings. You can install pgvector from source pretty easily: ```bash -git clone --branch v0.4.4 https://github.com/pgvector/pgvector && \ +git clone --branch v0.5.0 https://github.com/pgvector/pgvector && \ cd pgvector && \ echo "trusted = true" >> vector.control && \ make && \ @@ -288,7 +287,7 @@ We use the `pgrx` Postgres Rust extension framework, which comes with its own in ```bash cd pgml-extension && \ -cargo install cargo-pgrx --version 0.9.8 && \ +cargo install cargo-pgrx --version 0.11.2 && \ cargo pgrx init ``` diff --git a/pgml-docs/docs/guides/sdks/collections.md b/pgml-docs/docs/guides/sdks/collections.md deleted file mode 100644 index 2ebc415d5..000000000 --- a/pgml-docs/docs/guides/sdks/collections.md +++ /dev/null @@ -1,349 +0,0 @@ -# Collections - -Collections are the organizational building blocks of the SDK. They manage all documents and related chunks, embeddings, tsvectors, and pipelines. - -## Creating Collections - -By default, collections will read and write to the database specified by `DATABASE_URL` environment variable. - -### **Default `DATABASE_URL`** - -{% tabs %} -{% tab title="JavaScript" %} -```javascript -const collection = pgml.newCollection("test_collection") -``` -{% endtab %} - -{% tab title="Python" %} -```python -collection = Collection("test_collection") -``` -{% endtab %} -{% endtabs %} - -### **Custom DATABASE\_URL** - -Create a Collection that reads from a different database than that set by the environment variable `DATABASE_URL`. - -{% tabs %} -{% tab title="Javascript" %} -```javascript -const collection = pgml.newCollection("test_collection", CUSTOM_DATABASE_URL) -``` -{% endtab %} - -{% tab title="Python" %} -```python -collection = Collection("test_collection", CUSTOM_DATABASE_URL) -``` -{% endtab %} -{% endtabs %} - -## Upserting Documents - -Documents are dictionaries with two required keys: `id` and `text`. All other keys/value pairs are stored as metadata for the document. - -{% tabs %} -{% tab title="JavaScript" %} -```javascript -const documents = [ - { - id: "Document One", - text: "document one contents...", - random_key: "this will be metadata for the document", - }, - { - id: "Document Two", - text: "document two contents...", - random_key: "this will be metadata for the document", - }, -]; -await collection.upsert_documents(documents); -``` -{% endtab %} - -{% tab title="Python" %} -```python -documents = [ - { - "id": "Document 1", - "text": "Here are the contents of Document 1", - "random_key": "this will be metadata for the document" - }, - { - "id": "Document 2", - "text": "Here are the contents of Document 2", - "random_key": "this will be metadata for the document" - } -] -collection = Collection("test_collection") -await collection.upsert_documents(documents) -``` -{% endtab %} -{% endtabs %} - -Document metadata can be replaced by upserting the document without the `text` key. - -{% tabs %} -{% tab title="JavaScript" %} -```javascript -const documents = [ - { - id: "Document One", - random_key: "this will be NEW metadata for the document", - }, - { - id: "Document Two", - random_key: "this will be NEW metadata for the document", - }, -]; -await collection.upsert_documents(documents); -``` -{% endtab %} - -{% tab title="Python" %} -```python -documents = [ - { - "id": "Document 1", - "random_key": "this will be NEW metadata for the document" - }, - { - "id": "Document 2", - "random_key": "this will be NEW metadata for the document" - } -] -collection = Collection("test_collection") -await collection.upsert_documents(documents) -``` -{% endtab %} -{% endtabs %} - -Document metadata can be merged with new metadata by upserting the document without the `text` key and specifying the merge option. - -{% tabs %} -{% tab title="JavaScript" %} -```javascript -const documents = [ - { - id: "Document One", - text: "document one contents...", - }, - { - id: "Document Two", - text: "document two contents...", - }, -]; -await collection.upsert_documents(documents, { - metdata: { - merge: true - } -}); -``` -{% endtab %} - -{% tab title="Python" %} -```python -documents = [ - { - "id": "Document 1", - "random_key": "this will be NEW merged metadata for the document" - }, - { - "id": "Document 2", - "random_key": "this will be NEW merged metadata for the document" - } -] -collection = Collection("test_collection") -await collection.upsert_documents(documents, { - "metadata": { - "merge": True - } -}) -``` -{% endtab %} -{% endtabs %} - -## Getting Documents - -Documents can be retrieved using the `get_documents` method on the collection object. - -{% tabs %} -{% tab title="JavaScript" %} -```javascript -const collection = Collection("test_collection") -const documents = await collection.get_documents({limit: 100 }) -``` -{% endtab %} - -{% tab title="Python" %} -```python -collection = Collection("test_collection") -documents = await collection.get_documents({ "limit": 100 }) -``` -{% endtab %} -{% endtabs %} - -### Paginating Documents - -The SDK supports limit-offset pagination and keyset pagination. - -#### Limit-Offset Pagination - -{% tabs %} -{% tab title="JavaScript" %} -```javascript -const collection = pgml.newCollection("test_collection") -const documents = await collection.get_documents({ limit: 100, offset: 10 }) -``` -{% endtab %} - -{% tab title="Python" %} -```python -collection = Collection("test_collection") -documents = await collection.get_documents({ "limit": 100, "offset": 10 }) -``` -{% endtab %} -{% endtabs %} - -#### Keyset Pagination - -{% tabs %} -{% tab title="JavaScript" %} -```javascript -const collection = Collection("test_collection") -const documents = await collection.get_documents({ limit: 100, last_row_id: 10 }) -``` -{% endtab %} - -{% tab title="Python" %} -```python -collection = Collection("test_collection") -documents = await collection.get_documents({ "limit": 100, "last_row_id": 10 }) -``` -{% endtab %} -{% endtabs %} - -The `last_row_id` can be taken from the `row_id` field in the returned document's dictionary. - -### Filtering Documents - -Metadata and full text filtering are supported just like they are in vector recall. - -{% tabs %} -{% tab title="JavaScript" %} -```javascript -const collection = pgml.newCollection("test_collection") -const documents = await collection.get_documents({ - limit: 100, - offset: 10, - filter: { - metadata: { - id: { - $eq: 1 - } - }, - full_text_search: { - configuration: "english", - text: "Some full text query" - } - } -}) -``` -{% endtab %} - -{% tab title="Python" %} -```python -collection = Collection("test_collection") -documents = await collection.get_documents({ - "limit": 100, - "offset": 10, - "filter": { - "metadata": { - "id": { - "$eq": 1 - } - }, - "full_text_search": { - "configuration": "english", - "text": "Some full text query" - } - } -}) -``` -{% endtab %} -{% endtabs %} - -### Sorting Documents - -Documents can be sorted on any metadata key. Note that this does not currently work well with Keyset based pagination. If paginating and sorting, use Limit-Offset based pagination. - -{% tabs %} -{% tab title="JavaScript" %} -```javascript -const collection = pgml.newCollection("test_collection") -const documents = await collection.get_documents({ - limit: 100, - offset: 10, - order_by: { - id: "desc" - } -}) -``` -{% endtab %} - -{% tab title="Python" %} -```python -collection = Collection("test_collection") -documents = await collection.get_documents({ - "limit": 100, - "offset": 10, - "order_by": { - "id": "desc" - } -}) -``` -{% endtab %} -{% endtabs %} - -### Deleting Documents - -Documents can be deleted with the `delete_documents` method on the collection object. - -Metadata and full text filtering are supported just like they are in vector recall. - -{% tabs %} -{% tab title="JavaScript" %} -```javascript -const collection = pgml.newCollection("test_collection") -const documents = await collection.delete_documents({ - metadata: { - id: { - $eq: 1 - } - }, - full_text_search: { - configuration: "english", - text: "Some full text query" - } -}) -``` -{% endtab %} - -{% tab title="Python" %} -```python -documents = await collection.delete_documents({ - "metadata": { - "id": { - "$eq": 1 - } - }, - "full_text_search": { - "configuration": "english", - "text": "Some full text query" - } -}) -``` -{% endtab %} -{% endtabs %}