Scrape and summarize posts of a news site without RSS feed using AI and save them to a NocoDB

Nodes

a97f27bd-6185-4b9b-bade-f7020a7a1885550cbb22-673d-46ed-a595-c3da3597ecfa767a4dc9-a96d-4faf-8994-3ca7563046762e3e39e3-0e17-4fae-9ac5-9ee31f9711ed3b9f1b78-6b80-4575-aa6c-966987000389+2

Created by

AsAskan

Last edited 38 days ago

The News Site from Colt, a telecom company, does not offer an RSS feed, therefore web scraping is the choice to extract and process the news.

The goal is to get only the newest posts, a summary of each post and their respective (technical) keywords.

Note that the news site offers the links to each news post, but not the individual news. We collect first the links and dates of each post before extracting the newest ones.

The result is sent to a SQL database, in this case a NocoDB database.

This process happens each week thru a cron job.

Requirements:

  • Basic understanding of CSS selectors and how to get them via browser (usually: right click → inspect)
  • ChatGPT API account - normal account is not sufficient
  • A NocoDB database - of course you may choose any type of output target

Assumptions:

  • CSS selectors work on the news site
  • The post has a date with own CSS selector - meaning date is not part of the news content

"Warnings"

  • Not every site likes to be scraped, especially not in high frequency
  • Each website is structured in different ways, the workflow may then need several adaptations.

New to n8n?

Need help building new n8n workflows? Process automation for you or your company will save you time and money, and it's completely free!