![]() |
| Image from https://www.pexels.com/@cottonbro/ |
In this blog, we look at generating search scores with OpenAI Similarity Search. We randomly picked a few Yelp reviews online and added some random texts. Then we want to score them by "Restaurant Reviews"
I have an Azure OpenAI service. Please get one if you wish to run the code. Or you can get any OpenAI service.
Dependencies
python = "^3.10" bs4 = "^0.0.1" openai = "^0.27.7" langchain = "^0.0.189" python-dotenv = "^1.0.0" pandas = "^2.0.2" tiktoken = "^0.4.0" matplotlib = "^3.7.1" plotly = "^5.14.1" scikit-learn = "^1.2.2"
Environment Parameters
OPENAI_API_TYPE="azure" OPENAI_API_BASE="<azure openai endpoint>" OPENAI_API_KEY="<azure openai key> " OPENAI_API_VERSION="<azure openai API version> "
I have the OPENAI_API_VERSION as 2023-03-15-preview
Source Code
import pandas as pd
import tiktoken
import time
from functools import wraps
from langchain.embeddings import OpenAIEmbeddings
from openai.embeddings_utils import cosine_similarity
from dotenv import load_dotenv
def exec_time(msg: str):
def timer(func):
@wraps(func)
def wrapper(*args, **kws):
start_time = time.time()
retval = func(*args, **kws)
print(f"{msg} ends in ", round(time.time() - start_time, 2), "secs")
return retval
return wrapper
return timer
@exec_time("get data and load them into df")
def get_data() -> pd.DataFrame:
with open("data.txt") as fp:
return pd.DataFrame.from_dict({"review": (fp.readlines())})
@exec_time("tokenize")
def tokenize(df: pd.DataFrame, embeddings: OpenAIEmbeddings) -> pd.DataFrame:
tokenizer = tiktoken.get_encoding("cl100k_base")
df["n_tokens"] = df["review"].apply(lambda x: len(tokenizer.encode(x)))
df = df[df.n_tokens < 8192]
df["ada_v2"] = df["review"].apply(lambda x: embeddings.embed_query(x))
return df
@exec_time("get search scores")
def get_search_scores(embeddings: OpenAIEmbeddings, df: pd.DataFrame, query: str) -> pd.DataFrame:
embedding = embeddings.embed_query(query)
df["similarities"] = df["ada_v2"].apply(lambda x: cosine_similarity(x, embedding))
return df.sort_values("similarities", ascending=False)
if __name__ == "__main__":
load_dotenv()
df = get_data()
embeddings = OpenAIEmbeddings(model="text-embedding-ada-002")
df_tokenizer = tokenize(df=df, embeddings=embeddings)
df_score = get_search_scores(
embeddings=embeddings,
df=df_tokenizer,
query="I love this food in this resturant",
)
print(df_score)The steps are
- load the environment parameters so that OpenAIEmbeddings class can pick them up
- load the random reviews data into a pandas data frame
- instantiate the embeddings model (text-embedding-ada-002 model)
- tokenize the rows in the pandas data frame
- generate search scores on "Restaurant"
Results
get data and load them into df ends in 0.0 secs
tokenize ends in 2.2 secs
get search scores ends in 0.12 secs
review similarities
6 It's my most favourite food and restaurant. Th... 0.881386
9 Absolutely love their food. But haven't been... 0.840172
10 If I'm to keep my reviews honest, DTF received... 0.809663
7 First of all, if you can, go on a weekday and/... 0.801368
8 How would like it if the waiter put your order... 0.774455
2 This place is awesome. The online booking syst... 0.771106
4 Another awesome stop. Bought 4 Continental Con... 0.746466
3 Normally I do go to America's Tire off Hamilto... 0.738993
5 American tire fix my newly bought Tesla flat t... 0.734128
0 Booking online at TireRack and shipping my new... 0.733234
1 Had a flat tire on a 2 month old tire due to n... 0.725734
11 We are introducing embeddings, a new endpoint ... 0.693676
12 To compare the similarity of two pieces of tex... 0.681633Indexes (6 to 10) are restaurant reviews, (0 to 5) are tire store reviews, and (11, 12) are some texts about embeddings
When we sort the scores in decreasing order, they look correct.
The tokenize function took the most time.

Comments
Post a Comment