Storing embedding in Azure Database for PostgreSQL

 

Image from https://www.pexels.com/@tara-winstead/
Image from https://www.pexels.com/@tara-winstead/

There are many options to store text embeddings. In this blog, we shall look at storing embedding in Azure Database for PostgreSQL.

Installing Azure Database for PostgreSQL

Install Azure Database for PostgreSQL Flexible Server. 


Once it is deployed, enable two extensions.


Install Python modules

python -m venv .venv
poetry init
poetry shell
poetry add openai[datalib]
poetry add python-dotenv
poetry add psycopg2-binary

My Python version is 3.10.

Create .env

OPENAI_API_TYPE="azure"
OPENAI_API_BASE="<HTTP endpoint to Azure OpenAI Service>"
OPENAI_API_KEY="<secret to Azure OpenAI Service>"
OPENAI_API_VERSION="2023-05-15"
TEXT_MODEL="<deployment id of text embedding model>"

PGHOST="<my-service>.postgres.database.azure.com"
PGUSER="<user name>"
PGPASSWORD="<secret>" 
PGPORT="5432"
PGDATABASE="< database name e.g. postgres>"
PGSSL="require"

For the text embedding model, I am using text-embedding-ada-002 (Version 2).

Source Code

For simplicity, we just read a file, generate embedding, and store it. And then do a query.
import os
import psycopg2

from dotenv import load_dotenv


def connect_pgl():
    host = os.getenv("PGHOST")
    dbname = os.getenv("PGDATABASE")
    user = os.getenv("PGUSER")
    password = os.getenv("PGPASSWORD")
    sslmode = os.getenv("PGSSL")

    conn_string = "host={0} user={1} dbname={2} password={3} sslmode={4}".format(
        host, user, dbname, password, sslmode
    )
    return psycopg2.connect(conn_string)


def get_embedding(text: str):
    response = openai.Embedding.create(
        deployment_id=os.getenv("TEXT_MODEL"), input=text
    )
    return response["data"][0]["embedding"]


if __name__ == "__main__":
    load_dotenv()
    import openai
    from openai.embeddings_utils import cosine_similarity

    with open("transcript.txt", "r") as file:
        transcript = file.read()

        conn = connect_pgl()
        try:
            cursor = conn.cursor()

            cursor.execute("CREATE EXTENSION IF NOT EXISTS vector")
            cursor.execute('CREATE EXTENSION IF NOT EXISTS "uuid-ossp"')
            cursor.execute("DROP TABLE IF EXISTS tbl_embeddings;")
            cursor.execute(
                """CREATE TABLE tbl_embeddings
                (id uuid, text text not null, embedding vector(1536),
                PRIMARY KEY (id))"""
            )
            conn.commit()

	    # insert embedding into table
            cursor.execute(
                """INSERT INTO tbl_embeddings (id, text, embedding)
                VALUES (uuid_generate_v4(), %s, %s)""",
                (transcript, get_embedding(transcript)),
            )
            conn.commit()

            txt = "hospital-based tumor board"  # some random search term
            query_embedding = get_embedding(txt)

    	    # <=> for cosine similiarity
            cursor.execute(
                """SELECT text, embedding <=> %s::vector AS similarity_score
                FROM tbl_embeddings
                ORDER BY embedding <=> %s::vector LIMIT 1;""",
                (query_embedding, query_embedding),
            )
            out = cursor.fetchall()
            # just to print out the score
	    print(out[0][1])
        finally:
            cursor.close()
            conn.close()




Comments

Popular posts from this blog

OpenAI: Functions Feature in 2023-07-01-preview API version

Happy New Year, 2024 from DALL-E