Azure OpenAI: Similarity Search
Image from https://www.pexels.com/@cottonbro/ |
In this blog, we look at generating search scores with OpenAI Similarity Search. We randomly picked a few Yelp reviews online and added some random texts. Then we want to score them by "Restaurant Reviews"
I have an Azure OpenAI service. Please get one if you wish to run the code. Or you can get any OpenAI service.
Dependencies
python = "^3.10" bs4 = "^0.0.1" openai = "^0.27.7" langchain = "^0.0.189" python-dotenv = "^1.0.0" pandas = "^2.0.2" tiktoken = "^0.4.0" matplotlib = "^3.7.1" plotly = "^5.14.1" scikit-learn = "^1.2.2"
Environment Parameters
OPENAI_API_TYPE="azure" OPENAI_API_BASE="<azure openai endpoint>" OPENAI_API_KEY="<azure openai key> " OPENAI_API_VERSION="<azure openai API version> "
I have the OPENAI_API_VERSION as 2023-03-15-preview
Source Code
import pandas as pd import tiktoken import time from functools import wraps from langchain.embeddings import OpenAIEmbeddings from openai.embeddings_utils import cosine_similarity from dotenv import load_dotenv def exec_time(msg: str): def timer(func): @wraps(func) def wrapper(*args, **kws): start_time = time.time() retval = func(*args, **kws) print(f"{msg} ends in ", round(time.time() - start_time, 2), "secs") return retval return wrapper return timer @exec_time("get data and load them into df") def get_data() -> pd.DataFrame: with open("data.txt") as fp: return pd.DataFrame.from_dict({"review": (fp.readlines())}) @exec_time("tokenize") def tokenize(df: pd.DataFrame, embeddings: OpenAIEmbeddings) -> pd.DataFrame: tokenizer = tiktoken.get_encoding("cl100k_base") df["n_tokens"] = df["review"].apply(lambda x: len(tokenizer.encode(x))) df = df[df.n_tokens < 8192] df["ada_v2"] = df["review"].apply(lambda x: embeddings.embed_query(x)) return df @exec_time("get search scores") def get_search_scores(embeddings: OpenAIEmbeddings, df: pd.DataFrame, query: str) -> pd.DataFrame: embedding = embeddings.embed_query(query) df["similarities"] = df["ada_v2"].apply(lambda x: cosine_similarity(x, embedding)) return df.sort_values("similarities", ascending=False) if __name__ == "__main__": load_dotenv() df = get_data() embeddings = OpenAIEmbeddings(model="text-embedding-ada-002") df_tokenizer = tokenize(df=df, embeddings=embeddings) df_score = get_search_scores( embeddings=embeddings, df=df_tokenizer, query="I love this food in this resturant", ) print(df_score)
The steps are
- load the environment parameters so that OpenAIEmbeddings class can pick them up
- load the random reviews data into a pandas data frame
- instantiate the embeddings model (text-embedding-ada-002 model)
- tokenize the rows in the pandas data frame
- generate search scores on "Restaurant"
Results
get data and load them into df ends in 0.0 secs tokenize ends in 2.2 secs get search scores ends in 0.12 secs review similarities 6 It's my most favourite food and restaurant. Th... 0.881386 9 Absolutely love their food. But haven't been... 0.840172 10 If I'm to keep my reviews honest, DTF received... 0.809663 7 First of all, if you can, go on a weekday and/... 0.801368 8 How would like it if the waiter put your order... 0.774455 2 This place is awesome. The online booking syst... 0.771106 4 Another awesome stop. Bought 4 Continental Con... 0.746466 3 Normally I do go to America's Tire off Hamilto... 0.738993 5 American tire fix my newly bought Tesla flat t... 0.734128 0 Booking online at TireRack and shipping my new... 0.733234 1 Had a flat tire on a 2 month old tire due to n... 0.725734 11 We are introducing embeddings, a new endpoint ... 0.693676 12 To compare the similarity of two pieces of tex... 0.681633
Indexes (6 to 10) are restaurant reviews, (0 to 5) are tire store reviews, and (11, 12) are some texts about embeddings
When we sort the scores in decreasing order, they look correct.
The tokenize function took the most time.
Comments
Post a Comment