Azure OpenAI: Similarity Search

Image from https://www.pexels.com/@cottonbro/
Image from https://www.pexels.com/@cottonbro/

In this blog, we look at generating search scores with OpenAI Similarity Search. We randomly picked a few Yelp reviews online and added some random texts. Then we want to score them by "Restaurant Reviews"

I have an Azure OpenAI service. Please get one if you wish to run the code. Or you can get any OpenAI service.

Dependencies

python = "^3.10"
bs4 = "^0.0.1"
openai = "^0.27.7"
langchain = "^0.0.189"
python-dotenv = "^1.0.0"
pandas = "^2.0.2"
tiktoken = "^0.4.0"
matplotlib = "^3.7.1"
plotly = "^5.14.1"
scikit-learn = "^1.2.2"

Environment Parameters

OPENAI_API_TYPE="azure"
OPENAI_API_BASE="<azure openai endpoint>"
OPENAI_API_KEY="<azure openai key>"
OPENAI_API_VERSION="<azure openai API version>"

I have the OPENAI_API_VERSION as 2023-03-15-preview

Source Code

import pandas as pd
import tiktoken
import time

from functools import wraps

from langchain.embeddings import OpenAIEmbeddings
from openai.embeddings_utils import cosine_similarity
from dotenv import load_dotenv


def exec_time(msg: str):
    def timer(func):
        @wraps(func)
        def wrapper(*args, **kws):
            start_time = time.time()
            retval = func(*args, **kws)
            print(f"{msg} ends in ", round(time.time() - start_time, 2), "secs")
            return retval

        return wrapper

    return timer


@exec_time("get data and load them into df")
def get_data() -> pd.DataFrame:
    with open("data.txt") as fp:
        return pd.DataFrame.from_dict({"review": (fp.readlines())})


@exec_time("tokenize")
def tokenize(df: pd.DataFrame, embeddings: OpenAIEmbeddings) -> pd.DataFrame:
    tokenizer = tiktoken.get_encoding("cl100k_base")
    df["n_tokens"] = df["review"].apply(lambda x: len(tokenizer.encode(x)))
    df = df[df.n_tokens < 8192]
    df["ada_v2"] = df["review"].apply(lambda x: embeddings.embed_query(x))
    return df


@exec_time("get search scores")
def get_search_scores(embeddings: OpenAIEmbeddings, df: pd.DataFrame, query: str) -> pd.DataFrame:
    embedding = embeddings.embed_query(query)
    df["similarities"] = df["ada_v2"].apply(lambda x: cosine_similarity(x, embedding))
    return df.sort_values("similarities", ascending=False)


if __name__ == "__main__":
    load_dotenv()
    df = get_data()

    embeddings = OpenAIEmbeddings(model="text-embedding-ada-002")
    df_tokenizer = tokenize(df=df, embeddings=embeddings)
    df_score = get_search_scores(
        embeddings=embeddings,
        df=df_tokenizer,
        query="I love this food in this resturant",
    )

    print(df_score)

The steps are

  1. load the environment parameters so that OpenAIEmbeddings class can pick them up
  2. load the random reviews data into a pandas data frame
  3. instantiate the embeddings model (text-embedding-ada-002 model)
  4. tokenize the rows in the pandas data frame
  5. generate search scores on "Restaurant"

Results

get data and load them into df ends in  0.0 secs
tokenize ends in  2.2 secs
get search scores ends in  0.12 secs
                                               review  similarities
6   It's my most favourite food and restaurant. Th...      0.881386
9   Absolutely love their food.   But haven't been...      0.840172
10  If I'm to keep my reviews honest, DTF received...      0.809663
7   First of all, if you can, go on a weekday and/...      0.801368
8   How would like it if the waiter put your order...      0.774455
2   This place is awesome. The online booking syst...      0.771106
4   Another awesome stop. Bought 4 Continental Con...      0.746466
3   Normally I do go to America's Tire off Hamilto...      0.738993
5   American tire fix my newly bought Tesla flat t...      0.734128
0   Booking online at TireRack and shipping my new...      0.733234
1   Had a flat tire on a 2 month old tire due to n...      0.725734
11  We are introducing embeddings, a new endpoint ...      0.693676
12  To compare the similarity of two pieces of tex...      0.681633

Indexes (6 to 10) are restaurant reviews, (0 to 5) are tire store reviews, and (11, 12) are some texts about embeddings

When we sort the scores in decreasing order, they look correct.

The tokenize function took the most time.




Comments