๐ Process YouTube Transcripts with Apify, OpenAI & Pinecone Database
Categories
Created by
Last edited 39 days ago
๐ YouTube Transcript Indexing Backend for Pinecone ๐ฅ๐พ
This tutorial explains how to build the backend workflow in n8n that indexes YouTube video transcripts into a Pinecone vector database. Note: This workflow handles the processing and indexing of transcripts onlyโthe retrieval agent (which searches these embeddings) is implemented separately.
๐ Workflow Overview
This backend workflow performs the following tasks:
-
Fetch Video Records from Airtable ๐ฅ
Retrieves video URLs and related metadata. -
Scrape YouTube Transcripts Using Apify ๐ฌ
Triggers an Apify actor to scrape transcripts with timestamps from each video. -
Update Airtable with Transcript Data ๐
Stores the fetched transcript JSON back in Airtable linked via video ID. -
Process & Chunk Transcripts โ๏ธ
Parses the transcript JSON, converts "mm:ss" timestamps to seconds, and groups entries into meaningful chunks. Each chunk is enriched with metadataโsuch as video title, description, start/end timestamps, and a direct URL linking to that video moment. -
Generate Embeddings & Index in Pinecone ๐พ
Uses OpenAI to create vector embeddings for each transcript chunk and indexes them in Pinecone. This enables efficient semantic searches later by a separate retrieval agent.
๐ง Step-by-Step Guide
Step 1: Retrieve Video Records from Airtable ๐ฅ
-
Airtable Search Node:
- Setup: Configure the node to fetch video records (with essential fields like
url
and metadata) from your Airtable base.
- Setup: Configure the node to fetch video records (with essential fields like
-
Loop Over Items:
- Use a SplitInBatches node to process each video record individually.
Step 2: Scrape YouTube Transcripts Using Apify ๐ฌ
-
Trigger Apify Actor:
- HTTP Request Node ("Apify NinjaPost"):
- Method: POST
- Endpoint:
https://api.apify.com/v2/acts/topaz_sharingan~youtube-transcript-scraper-1/runs?token=<YOUR_TOKEN>
- Payload Example:
{ "includeTimestamps": "Yes", "startUrls": ["{{ $json.url }}"] }
- Purpose: Initiates transcript scraping for each video URL.
- HTTP Request Node ("Apify NinjaPost"):
-
Wait for Processing:
- Wait Node:
- Duration: Approximately 1 minute to allow Apify to generate the transcript.
- Wait Node:
-
Retrieve Transcript Data:
- HTTP Request Node ("Get JSON TS"):
- Method: GET
- Endpoint:
https://api.apify.com/v2/acts/topaz_sharingan~youtube-transcript-scraper-1/runs/last/dataset/items?token=<YOUR_TOKEN>
- HTTP Request Node ("Get JSON TS"):
Step 3: Update Airtable with Transcript Data ๐
-
Format Transcript Data:
- Code Node ("Code"):
- Task: Convert the fetched transcript JSON into a formatted string.
const jsonObject = items[0].json; const jsonString = JSON.stringify(jsonObject, null, 2); return { json: { stringifiedJson: jsonString } };
- Task: Convert the fetched transcript JSON into a formatted string.
- Code Node ("Code"):
-
Extract the Video ID:
- Set Node ("Edit Fields"):
- Expression:
{{$json.url.split('v=')[1].split('&')[0]}}
- Expression:
- Set Node ("Edit Fields"):
-
Update Airtable Record:
- Airtable Update Node ("Airtable1"):
- Updates:
- ts: Stores the transcript string.
- videoid: Uses the extracted video ID to match the record.
- Updates:
- Airtable Update Node ("Airtable1"):
Step 4: Process Transcripts into Semantic Chunks โ๏ธ
-
Retrieve Updated Records:
- Airtable Search Node ("Airtable2"):
- Purpose: Fetch records that now contain transcript data.
- Airtable Search Node ("Airtable2"):
-
Parse and Chunk Transcripts:
- Code Node ("Code4"):
- Functionality:
- Parses transcript JSON.
- Converts "mm:ss" timestamps to seconds.
- Groups transcript entries into chunks based on a 3-second gap.
- Creates an object for each chunk that includes:
- Text: The transcript segment.
- Video Metadata: Video ID, title, description, published date, thumbnail.
- Chunk Details: Start and end timestamps.
- Direct URL: A link to the exact moment in the video (e.g.,
https://youtube.com/watch?v=VIDEOID&t=XXs
).
- Functionality:
- Code Node ("Code4"):
-
Enrich & Split Text:
- Default Data Loader Node:
- Attaches additional metadata (e.g., video title, description) to each chunk.
- Recursive Character Text Splitter Node:
- Settings: Typically set to 500-character chunks with a 50-character overlap.
- Purpose: Ensures long transcript texts are broken into manageable segments for embedding.
- Default Data Loader Node:
Step 5: Generate Embeddings & Index in Pinecone ๐พ
-
Generate Embeddings:
- Embeddings OpenAI Node:
- Task: Convert each transcript chunk into a vector embedding.
- Tip: Adjust the batch size (e.g., 512) based on your data volume.
- Embeddings OpenAI Node:
-
Index in Pinecone:
- Pinecone Vector Store Node:
- Configuration:
- Index: Specify your Pinecone index (e.g.,
"videos"
). - Namespace: Use a dedicated namespace (e.g.,
"transcripts"
).
- Index: Specify your Pinecone index (e.g.,
- Outcome: Each enriched transcript chunk is stored in Pinecone, ready for semantic retrieval by a separate retrieval agent.
- Configuration:
- Pinecone Vector Store Node:
๐ Final Thoughts
This backend workflow is dedicated to processing and indexing YouTube video transcripts so that a separate retrieval agent can perform efficient semantic searches. With this setup:
-
Transcripts Are Indexed:
Chunks of transcripts are enriched with metadata and stored as vector embeddings. -
Instant Topic Retrieval:
A retrieval agent (implemented separately) can later query Pinecone to find the exact moment in a video where a topic is discussed, thanks to the direct URL and metadata stored with each chunk. -
Scalable & Modular:
The separation between indexing and retrieval allows for easy updates and scalability.
Happy automating and enjoy building powerful search capabilities with your YouTube content! ๐
You may also like
New to n8n?
Need help building new n8n workflows? Process automation for you or your company will save you time and money, and it's completely free!