Getting Transcripts from Video file (mp4)

Images by cottonbro from pexels.com
Images by cottonbro from pexels.com

In this blog, we share some Python code to get transcripts from video files with Microsoft Speech Service. We have mp4 files which are video files, and we need to convert them to Waveform audio format because the speech service takes only audio format.

1. Azure Cognitive Service

Get an account for Azure Cognitive Service (if you do not have one). https://azure.microsoft.com/en-us/free/cognitive-services/. Please take note of the subscription key and region of the account.

2. Install Dependencies

2.1 moviepy

This library allows us to convert mp4 file to wav file format.

pip install moviepy

2.2 azure-cognitiveservices-speech

pip install azure-cognitiveservices-speech

3. Configuration

export these environment variables.

export SPX_SUBSCRIPTION_KEY=<the subscription key to the cognitive service resides>
export SPX_REGION=<the region which the cognitive service resides>

4. Python Code

Save the following to file named, spx_recognize.py. We have file path as file_path="/Users/joe/video/sample.mp4. Please change it to refer to an mp4 file on your laptop.

import os
import tempfile

import moviepy.editor as mp
import azure.cognitiveservices.speech as speechsdk
import azure.cognitiveservices.speech.languageconfig as spxlangconfig


def convert_mp4_wav(file_path: str):
    """Convert mp4 to audio wav file in temporary folder.

    Args:
        file_path (str): path to the mp4 file.

    Returns:
        str: path to the newly created wav file.
    """
    file_name = os.path.basename(file_path)
    output_file_name = file_name[0 : file_name.rindex(".")] + ".wav"

    tmp_folder = tempfile.gettempdir()
    output_path = os.path.join(tmp_folder, output_file_name)
    clip = mp.VideoFileClip(file_path)
    clip.audio.write_audiofile(output_path)
    return output_path


def recognize(
    languages: list,
    subscription_key: str,
    region: str,
    file_path: str,
):
    """Get the transcript and source audio language.

    Recognize a file and get the transcript and source audio
    language.

    Args:
        languages (list): List of possible languages.
        subscription_key (str): subscription key of Azure Speech Service
        region (str): region where the Azure Speech Service is hosted.
        file_path (str): path to the mp4 file.

    Raises:
        SystemError: when there are no results.
        SystemError: when request is cancelled because subscription key
            and/or region are incorrect.
    """
    wav_file = convert_mp4_wav(file_path)

    auto_detect_language_config = spxlangconfig.AutoDetectSourceLanguageConfig(
        languages=languages,
    )
    speech_config = speechsdk.SpeechConfig(
        subscription=subscription_key,
        region=region,
    )
    audio_input = speechsdk.AudioConfig(filename=wav_file)
    speech_recognizer = speechsdk.SpeechRecognizer(
        speech_config=speech_config,
        audio_config=audio_input,
        auto_detect_source_language_config=auto_detect_language_config,
    )

    result = speech_recognizer.recognize_once_async().get()

    if result.reason == speechsdk.ResultReason.RecognizedSpeech:
        detect_lang_result = speechsdk.AutoDetectSourceLanguageResult(result)
        return result.text, detect_lang_result.language
    elif result.reason == speechsdk.ResultReason.NoMatch:
        raise SystemError(
            f"No speech could be recognized: {result.no_match_details}",
        )
    elif result.reason == speechsdk.ResultReason.Canceled:
        cancellation_details = result.cancellation_details
        raise SystemError(
            f"Speech Recognition canceled: {cancellation_details.reason}",
        )


if __name__ == "__main__":
    text, lang = recognize(
        languages=["en-US", "es-ES"],
        subscription_key=os.getenv("SPX_SUBSCRIPTION_KEY"),
        region=os.getenv("SPX_REGION"),
        file_path="/Users/joe/video/sample.mp4",
    )
    print(text)
    print(lang)

Type python spx_recognize.py and the transcript and source audio language will be printed on the terminal.



Comments

Popular posts from this blog

OpenAI: Functions Feature in 2023-07-01-preview API version

Storing embedding in Azure Database for PostgreSQL

Happy New Year, 2024 from DALL-E