Storing embedding in Azure Database for PostgreSQL
Image from https://www.pexels.com/@tara-winstead/ |
There are many options to store text embeddings. In this blog, we shall look at storing embedding in Azure Database for PostgreSQL.
Installing Azure Database for PostgreSQL
Install Azure Database for PostgreSQL Flexible Server.
Once it is deployed, enable two extensions.
Install Python modules
python -m venv .venv poetry init poetry shell poetry add openai[datalib] poetry add python-dotenv poetry add psycopg2-binary
My Python version is 3.10.
Create .env
OPENAI_API_TYPE="azure" OPENAI_API_BASE="<HTTP endpoint to Azure OpenAI Service>" OPENAI_API_KEY="<secret to Azure OpenAI Service>" OPENAI_API_VERSION="2023-05-15" TEXT_MODEL="<deployment id of text embedding model>" PGHOST="<my-service>.postgres.database.azure.com" PGUSER="<user name>" PGPASSWORD="<secret>" PGPORT="5432" PGDATABASE="< database name e.g. postgres>" PGSSL="require"
For the text embedding model, I am using text-embedding-ada-002 (Version 2).
Source Code
For simplicity, we just read a file, generate embedding, and store it. And then do a query.
import os import psycopg2 from dotenv import load_dotenv def connect_pgl(): host = os.getenv("PGHOST") dbname = os.getenv("PGDATABASE") user = os.getenv("PGUSER") password = os.getenv("PGPASSWORD") sslmode = os.getenv("PGSSL") conn_string = "host={0} user={1} dbname={2} password={3} sslmode={4}".format( host, user, dbname, password, sslmode ) return psycopg2.connect(conn_string) def get_embedding(text: str): response = openai.Embedding.create( deployment_id=os.getenv("TEXT_MODEL"), input=text ) return response["data"][0]["embedding"] if __name__ == "__main__": load_dotenv() import openai from openai.embeddings_utils import cosine_similarity with open("transcript.txt", "r") as file: transcript = file.read() conn = connect_pgl() try: cursor = conn.cursor() cursor.execute("CREATE EXTENSION IF NOT EXISTS vector") cursor.execute('CREATE EXTENSION IF NOT EXISTS "uuid-ossp"') cursor.execute("DROP TABLE IF EXISTS tbl_embeddings;") cursor.execute( """CREATE TABLE tbl_embeddings (id uuid, text text not null, embedding vector(1536), PRIMARY KEY (id))""" ) conn.commit() # insert embedding into table cursor.execute( """INSERT INTO tbl_embeddings (id, text, embedding) VALUES (uuid_generate_v4(), %s, %s)""", (transcript, get_embedding(transcript)), ) conn.commit() txt = "hospital-based tumor board" # some random search term query_embedding = get_embedding(txt) # <=> for cosine similiarity cursor.execute( """SELECT text, embedding <=> %s::vector AS similarity_score FROM tbl_embeddings ORDER BY embedding <=> %s::vector LIMIT 1;""", (query_embedding, query_embedding), ) out = cursor.fetchall() # just to print out the score print(out[0][1]) finally: cursor.close() conn.close()
Comments
Post a Comment