Lecture Assistant: Enhancing Video Learning with LLMs

Note: Work done in collaboration with @LagosJanos.

Note: If you want to check out all of the details, you can take a look at This Kaggle notebook.

In this blog, we demonstrate two tools that we developed to assist students in learning from video lectures. To demonstrate them, we used the course we are taking on Machine Learning, whose lecture videos are publicly available on the website of Professor Fabio Gonzalez. For the remainder of the blog, we will refer to this lecture, which lasts 1h35m.

Auto chapters: Segments the video and generates chapters for easier information extraction.
Question-Answering: Allows users to ask questions directly about the video content.

Video Transcription with Whisper

We use OpenAI’s Whisper model to efficiently transcribe videos:

Download the audio file from a sample lecture.
Load the Whisper “small.en” model for a balance of accuracy and speed.
Transcribe the video (typically takes about 5 minutes for a 1.5-hour video).

import whisper

video_path = "/kaggle/input/ml-videos/lecture.mp4"
model = whisper.load_model("small.en")
transcription = model.transcribe(video_path, verbose=False)

From the transcription, we have 63125 characters. Then, by tokenizing this, we have 5484 tokens. Here’s a small sample of the transccription:

Okay. Okay. So last class, last two classes, we were discussing all the story of neural networks. We saw that there had been like three waves of popularity of neural networks. The first one was during the fifties, mainly. The first work on neural networks was from the forties, specifically in 1943, the paper of Matt Coludon-Pitz, then the perceptron of Rosenblatt. Then the criticism of other researchers, in particular, of Narvin Minsky, of the limitations of the perceptrons. Okay. This made the interest on neural networks to fade. Then again, during the eighties, there was a reborn interest on neural networks, thanks to the back propagation algorithm, the one that we discussed last class, that allowed us to train multilayer networks. Okay. That was developed by Romel Harhinton and Williams. Okay. And during the eighties and the nineties, the first half of the nineties, the neural networks dominated. However, other methods, more powerful methods in terms of the mathematical background

Semantic Text Segmentation

In order to create titles from the text, we employ a technique called text segmentation. Text segmentation involves dividing the text into semantically related segments. We use the TextTilingTokenizer from NLTK to divide the text into semantically related segments:

from nltk.tokenize import PunktSentenceTokenizer, TextTilingTokenizer

st = PunktSentenceTokenizer()
tt = TextTilingTokenizer(w=50)

sentences = st.tokenize(transcription["text"])
paragraphs = [
    " ".join(sentences[i : i + 3])
    for i in range(0, len(sentences), 3)
]

segmented_sections = tt.tokenize("\n\n".join(paragraphs))

We end up with about 36 segments.

Title Generation with Large Language Models

We first align the original timestamps generated by Whisper with the segmented sections. Now, in order to generate titles for each section, we needed to select a suitable large language model. After exploration of the Huggingface Library, we considered two models at the time of this writing:

Falcon-7B-instruct: This is a decoder-only model developed by TII. It is based on Falcon-7B and has been fine-tuned on a combination of chat and instruct datasets. It is available under the Apache 2.0 license.
Alpaca-LoRA: This model is the 7B version of the llama foundational model from Meta. It has been fine-tuned on the alpaca dataset using the Low Rank Adapters technique. It is important to note that both Alpaca-LoRA and LlamA models are intended for research purposes only.

After experimenting with both models and trying out different prompts, we ultimately decided to use the Alpaca-LoRA model. Its inference time was comparatively lower, which made it more suitable for our requirements.

We can load the model and generate the chapters titles with the following snippet:

llama_model_name = "huggyllama/llama-7b"
lora_weights = "tloen/alpaca-lora-7b"
device = "cuda" if torch.cuda.is_available() else "cpu"

tokenizer = LlamaTokenizer.from_pretrained(llama_model_name)
model = LlamaForCausalLM.from_pretrained(
    llama_model_name,
    load_in_8bit=False,
    torch_dtype=torch.float16,
    device_map="auto",
)

model = PeftModel.from_pretrained(
    model,
    "tloen/alpaca-lora-7b",
)

pipeline = transformers.pipeline(
    "text-generation",
    model=model,
    max_new_tokens=50,
    tokenizer=tokenizer,
    return_full_text=False,
    torch_dtype=torch.float16,
    device_map="auto",
)

def gen_prompt(text):
    return f"""Here's a section of transcription from a Youtube video. Your task is to create a title that appropriately describes the content presented in the transcription. The transcription is the following:\n\n{text}\n\nThe best title for the previous transcription is:\n\n"""

sample_section = timestamped_sections[0]["text"]
prompt = gen_prompt(sample_section)
title = pipeline(prompt)[0]["generated_text"]

Title Generation

Moving on to title generation, we engaged in prompt engineering. We experimented with various approaches to effectively convey the task of generating a title for a given transcription section in the prompt. After thorough testing, we discovered that the most successful phrasing for this purpose was as follows:

Here's a section of transcription from a Youtube video. Your task is to create a title that appropriately describes the content presented in the transcription. The transcription is the following:

{text}

The best title for the previous transcription is:

Results for Auto Chapter Tool

Here are some examples of the generated chapters:

========================================
Prompt length in tokens: 455.
Proposed Title         : A Brief History of Neural Networks
Timestamp              : 00:00:00
Inference time         : 1.15 seconds.
========================================
========================================
Prompt length in tokens: 539.
Proposed Title         : The DeepMind Challenge for World Go Championship: Lisa Doll vs. AlphaGo
Timestamp              : 00:03:07
Inference time         : 2.05 seconds.
========================================
========================================
Prompt length in tokens: 361.
Proposed Title         : Gary Kasparov's Shocking Defeat to IBM's DeepBlue Computer
Timestamp              : 00:07:05
Inference time         : 2.02 seconds.
========================================

As we can see, the generated chapters are reasonable, and after reviewing the proposed sections, we found that the video segments indeed correspond to the generated titles.

Question-Answering

For question-answering, we use a CharacterTextSplitter from langchain to split the text into chunks:

from langchain.text_splitter import CharacterTextSplitter

text_splitter = CharacterTextSplitter(        
    separator = ".",
    chunk_size = 3000,
    chunk_overlap  = 200,
)
texts = text_splitter.split_text(transcription["text"])

We then use the all-MiniLM-L6-v2 embeddings model to generate embeddings for each chunk:

from langchain.embeddings import HuggingFaceEmbeddings

embeddings_model = HuggingFaceEmbeddings(model_name = "sentence-transformers/all-MiniLM-L6-v2")

Finally, we use the Annoy library to create a vector store for efficient similarity search:

from langchain.vectorstores import Annoy

vector_store = Annoy.from_texts(texts, embeddings_model)

Results for Question-Answering

We present some sample questions and answers to showcase the performance of our solution:

Q: What is Deep Blue? A: Deep Blue is a computer program developed by IBM to play chess. It was the first computer program to defeat a world champion in a match.

Q: Who is Yann LeCun and what do they talk about him in the video? A: Yann LeCun is a French-American computer scientist and artificial intelligence researcher. In the video, he talks about his work in deep learning and how it has revolutionized the field of artificial intelligence.

Conclusion

Our work introduced two tools for enhancing video content learning: Automatic chapters and Question-Answering. Initial results are promising, but further evaluation is needed. We leveraged tools like huggingface, langchain, and annoy for development.

To improve performance, we plan to annotate the dataset and fine-tune the large language model. While these tools typically require GPU-accelerated hardware, some models can run on less demanding hardware using CPU-optimized implementations.

Currently, the tools only support English-language content, but similar LLMs have been fine-tuned on Spanish datasets, offering potential for multilingual support in the future.