|  | 
| Image from https://www.pexels.com/@tara-winstead/ | 
There are many options to store text embeddings. In this blog, we shall look at storing embedding in Azure Database for PostgreSQL.
Installing Azure Database for PostgreSQL
Install Azure Database for PostgreSQL Flexible Server.
Once it is deployed, enable two extensions.
Install Python modules
python -m venv .venv poetry init poetry shell poetry add openai[datalib] poetry add python-dotenv poetry add psycopg2-binary
My Python version is 3.10.
Create .env
OPENAI_API_TYPE="azure" OPENAI_API_BASE="<HTTP endpoint to Azure OpenAI Service>" OPENAI_API_KEY="<secret to Azure OpenAI Service>" OPENAI_API_VERSION="2023-05-15" TEXT_MODEL="<deployment id of text embedding model>" PGHOST="<my-service>.postgres.database.azure.com" PGUSER="<user name>" PGPASSWORD="<secret>" PGPORT="5432" PGDATABASE="< database name e.g. postgres>" PGSSL="require"
For the text embedding model, I am using text-embedding-ada-002 (Version 2).
Source Code
For simplicity, we just read a file, generate embedding, and store it. And then do a query.
import os
import psycopg2
from dotenv import load_dotenv
def connect_pgl():
    host = os.getenv("PGHOST")
    dbname = os.getenv("PGDATABASE")
    user = os.getenv("PGUSER")
    password = os.getenv("PGPASSWORD")
    sslmode = os.getenv("PGSSL")
    conn_string = "host={0} user={1} dbname={2} password={3} sslmode={4}".format(
        host, user, dbname, password, sslmode
    )
    return psycopg2.connect(conn_string)
def get_embedding(text: str):
    response = openai.Embedding.create(
        deployment_id=os.getenv("TEXT_MODEL"), input=text
    )
    return response["data"][0]["embedding"]
if __name__ == "__main__":
    load_dotenv()
    import openai
    from openai.embeddings_utils import cosine_similarity
    with open("transcript.txt", "r") as file:
        transcript = file.read()
        conn = connect_pgl()
        try:
            cursor = conn.cursor()
            cursor.execute("CREATE EXTENSION IF NOT EXISTS vector")
            cursor.execute('CREATE EXTENSION IF NOT EXISTS "uuid-ossp"')
            cursor.execute("DROP TABLE IF EXISTS tbl_embeddings;")
            cursor.execute(
                """CREATE TABLE tbl_embeddings
                (id uuid, text text not null, embedding vector(1536),
                PRIMARY KEY (id))"""
            )
            conn.commit()
	    # insert embedding into table
            cursor.execute(
                """INSERT INTO tbl_embeddings (id, text, embedding)
                VALUES (uuid_generate_v4(), %s, %s)""",
                (transcript, get_embedding(transcript)),
            )
            conn.commit()
            txt = "hospital-based tumor board"  # some random search term
            query_embedding = get_embedding(txt)
    	    # <=> for cosine similiarity
            cursor.execute(
                """SELECT text, embedding <=> %s::vector AS similarity_score
                FROM tbl_embeddings
                ORDER BY embedding <=> %s::vector LIMIT 1;""",
                (query_embedding, query_embedding),
            )
            out = cursor.fetchall()
            # just to print out the score
	    print(out[0][1])
        finally:
            cursor.close()
            conn.close()


Comments
Post a Comment