Set up and configure a custom RSS news feed
Deftly skirting the lines of legality, David Rutland reveals how he’s able to scrape all the news that’s good to read direct to his own server.
Back in the early 2000s, RSS was the coolest gimmick in town. It was a way to get the latest headlines from your favourite blogs without needing to wait for your dial-up connection to load the complete webpage at 14.4kbps. A short snippet of text and a headline would let you know which Geocities sites had been updated, and independent bloggers would proudly syndicate their stories to other members of their webring to scroll in an endless ticker-tape parade across the screen.
The garish displays are long gone (thankfully), but RSS is more powerful than ever, and in part three of our virtual private server (VPS) series, we’ll show you how you can use RSS to avoid trackers, blast past paywalls and ensure that your very own VPS is the only site you need to visit.
These days 65 per cent of the world digests news online and many use an aggregator built into their phones and completely under the control of someone else. Whether it’s Google, Facebook or Apple, your reading habits are valuable, and the metrics of what you choose to click can be monetised. We don’t like that idea and, for some reason, we have the feeling that you’re not entirely comfortable with it, either.
For a technology that was birthed in the previous millennium, it may surprise you how widespread the RSS protocol is. Practically every site you visit today will have an RSS feed. From the BBC news pages to the latest acquisition by Future – via Reddit and your favourite cyberpunk fanzine – most web publishers pour out a torrent of unfiltered headlines, links, and a couple of paragraphs to let readers know what to expect should they choose to click through.
These juicy titbits are usually contained in an XML file, which is usually pretty easy to find on the main site. In most cases, they’re automatically generated by the CMS or site generator, without the site owners even knowing they exist. They’re sad, neglected things, and are rarely categorised: a veritable firehose of news, opinions and, well, garbage.
The price of temptation
Should you choose to digest your headlines via RSS, the temptation to click through can be overwhelming. The snippets, intros and excerpts offer just enough to catch your interest and edge your mouse or finger closer to the link. Of course you don’t need us telling you that the internet is a dangerous place. Those same sites that seek to entice you in with witty one-liners, awesome alliteration and titillating titles contain third-party tracking code. This will stalk you across the internet, and enable bad actors to build a profile on you with more detail than you want your spouse, mother or children to be aware of.
Introducing FreshRSS
We’re a paranoid bunch here at Linux Format. We like to do our reading without someone looking over our shoulders, recording how long we looked at each article, which pictures we found particularly interesting, and then figuring out how they can use that information to sell us stuff.
You know what else we don’t like? Paywalls. Sure, it’s easy enough to pop open each news story from a soft-walled site in a new private tab (thereby setting the cookie counter to zero), but then you have to click through the dull old GDPR consent dialogues. Again.
And so we were delighted beyond belief when we discovered FreshRSS. It’s a self-hosted RSS aggregator that will pull down entire articles to your VPS (Virtual Private Server) for you to peruse at your leisure. It cuts past cookie-based access paywalls, and keeps tracking code out of your browser. Best of all, anyone who monitors what sites you connect to on your mobile
device (we’re looking at you, Google) will only ever see you connecting to one, and that one is yours.
FreshRSS isn’t a perfect piece of software by any means, but it’s well suited for putting on your VPS if you want to keep nosy parkers out of your business.
If you’ve been following this tutorial series since it started, you’ll already have your VPS set up and ready to go. If not, you’ll find we have graciously hosted the first tutorial as a PDF –see the Quick Tip box (right).
We’ve set up a new subdomain for our FreshRSS instance which you can find at https://fresh.lxf.by.
We set the DNS records with our Belarusian registrar, and pointed it to the IP address assigned to us by our VPS provider.
With that out of the way it’s time to SSH in and get started gathering the day’s news!
Making a Fresh start
Setup of FreshRSS is fairly easy compared to some of the web-facing software that we’ve been configuring up recently. Essentially, it comes down to having a directory in which it can live and a conf file to tell Apache where that directory is, but before we get to that point, we need to satisfy a couple of dependencies:
sudo apt-get install php-xml sudo apt-get install php-curl
And if you think you’re likely to be reading news containing either Chinese, Korean, and Russian or Japanese characters, then you’ll need sudo apt install php-mbstring , too.
Download the latest version to your home directory with the following:
wget https://github.com/FreshRSS/FreshRSS/archive/ master.zip
Then unzip it with unzip master.zip and a new directory named FreshRSS-master will be created. Feel free to change the name if you want – it doesn’t really matter.
Then move the directory to its new home: sudo mv FreshRSS-master /var/www/
And give ownership of it to the server:
sudo chown -R www-data:www-data FreshRSS-master
Inside the FreshRSS directory you will find another directory, mysteriously called p, and this is the one you want to expose to the web. Make a note of its complete path, then create a conf file to help Apache find it:
sudo nano /etc/apache2/sites-available/fresh.conf
Here’s what ours looks like:
ServerName fresh.lxf.by
DocumentRoot /var/www/FreshRSS-master/p/
Obviously, you’ll need to change the ServerName to the address of your new instance.
Enable the site with sudo a2ensite /etc/apache2/ sites-available/fresh.conf and Reboot Apache with sudo service apache2 restart .
Run sudo certbot to enable SSL, select redirection, and then restart Apache once more.
Visiting your domain name will enable you to continue the basic setup. You’ll be able to choose from a fistful of languages before FreshRSS runs the necessary checks to tell you whether any vital or not-so-vital components are missing.
Then it’s on to database configuration, so if you haven’t already set up a new user then now is the time to do it. Pop up a terminal and tell Maria what you need her to do.
sudo mariadb
Enter these commands to create a user called fresh on a new database called fresh and enable the new user to use the new database.
CREATE DATABASE fresh;
CREATE USER fresh IDENTIFIED BY ‘secretpassword’; GRANT USAGE ON *.* TO fresh@localhost
IDENTIFIED BY ‘secretpassword’;
GRANT ALL privileges ON fresh.* TO fresh@localhost; FLUSH PRIVILEGES; quit;
Hop back over to your browser and enter the appropriate responses into the fields, choose a user name and password, and hey presto – you’re done.
Getting the Freshest news
On logging into your FreshRSS instance, you’ll see that there’s only one feed. It’s the FreshRSS changelog, and if that sort of thing floats your boat, then good for you. The rest of us are off to find some even fresher news.
The first thing is to decide what to read. What are your interests and what do you want delivered into your instance on a regular basis for you to digest at leisure?
Here at Linux Format, we’re into caravanning and we love nothing better than dragging a 1,500kg metal box across the windswept Scottish moors. When we’re not caravanning, we read and fantasise about caravanning (sorry, do I know you? – Ed).
Heading over to www.practicalcaravan.com, we didn’t spot the familiar RSS icon, and those three letters appeared nowhere on the front page. Undeterred, and noticing that it was a WordPress site, we typed ‘feed’ after the trailing slash of the base URL. Opening it in Firefox gave us an XML document – nice, but none-tooeasy to read.
Back over in FreshRSS, we pressed the plus button on the big blue ‘Subscription Management’ button created a new category, hobbies, and then added the Practical Caravan magazine’s feed URL to that category.
In the main reading display the 20 most recent posts from the publication appeared – and to our great surprise, they were actually the full text of the articles, beautifully laid out and a pleasure to digest. This wasn’t through any hi-jinx or subterfuge on our part or that of our shiny new software. The feed from www. practicalcaravan.com is just configured that way – either through negligence or generosity (In which case, cheers guys!). In the interests of guiding our readers, we needed something a little more challenging.
Canine content
We’re also really into dogs and there is a 90 per cent chance of at least two four-legged friends being within touching distance of this author at any time of day. From an avalanche of search results for online canine-oriented publications we pulled The Bark
(https://thebark.com).
The Bark doesn’t advertise its RSS feeds either, and it isn’t a WordPress site. But taking a peek at the page source and scouring for either ‘RSS’ or ‘XML’ gave us
https://thebark.com/rss.xml as the feed URL.
Excited, we fed the URL into FreshRSS and waited for the furry fables to fill our feed. Disaster! The canine communications were limited to a mere 20 or so words. Would we ever be able to find out the ‘10 Dog Walking Tips Everyone Should Know?’
It turns out that the answer is ‘yes’. And quite easily, too. We had to resist a fist-pump.
Open up an article page on the site that you’re trying to add, and right-click the big block of text. Then hit Inspect. You’ll note that the screen splits horizontally in two: the bottom half will show the source and the top part will show the original site. Certain parts of the page will be highlighted in blue – usually the paragraph that you’re currently reading. Move your mouse up the hierarchy until the entire article body is blue, then right-click again and choose Copy, then CSS Selector.
In this particular case, the most useful selector was #main-section.
Over in your FreshRSS site, click the cog next to the feed name in the left bar and choose Manage to open up the Settings page for the current feed. About halfway down, in the Advanced section, you’ll see an entry marked ‘Article CSS selector on original website.’ Paste the CSS selector into the box and then click the eye to preview the page and check whether the selector is valid. If it works, scroll down and hit the next Submit button, then right down to the very bottom press Reload Articles.
When you return to the main FreshRSS interface, you’ll find the full text of every article on the site. In this example, there is a small amount of garbage in the middle of the text where an email newsletter signup should be. But that’s OK. We can live with that.
Fresher than Fresh
Very few publications actively curate or even acknowledge the ongoing existence of their RSS feeds. As a result, they’re a mess and often include headlines that are hidden from the general site and only used for the purpose of SEO optimisation and very specific natural language queries. The headlines are generic and common to most tech sites. ‘Best VPN deals in 2021,' ‘Working VPNs for China in 2021,' ‘Black Friday microwave deals 2021.’ We’re sure you get the picture. Websites need to make money especially in 2021 – and this is how they do it.
But these headlines are stale. They’re used year after year and have substantially the same content. They’re not aimed at you, so you shouldn’t feel bad about filtering them out.
Head back to the management page of the feed that you want to filter. Near the bottom of the Advanced section, you’ll see Filter actions and a text box which prompts you to enter the criteria to mark an article as one that you’ve read.
The two most useful filters are probably intitle and inurl. These enable users to filter based on strings either in the title or the url, respectively.
If, for instance, we wanted to avoid seeing articles with 2021 in the title, we would enter: intitle:'2021’ into the box.
If we wanted to avoid all sport stories from The Daily Telegraph (sounds reasonable–Ed) we would use: inurl:'/sport/’
A complete list of filters is available on the developer site at https://freshrss.org.
And finally…
After gathering the feeds and performing the magic that enables full articles to appear on your server, FreshRSS
generates a static web page containing every post up to whatever limit you’ve set. It won’t update the page or check for new articles until you tell it to by pressing the update button.
This can be a pain if you’re accessing your FreshRSS
instance through a mobile app, or if you haven’t refreshed in some time.
To force the software to refresh on a schedule without user input, you need to set up a cronjob:
sudo crontab -e
Then add the new line:
40 * * * * php -f /var/www/FreshRSS-master/app/ actualize_script.php
This will mean that on the hour and at 40 minutes past the hour, cron will trigger a script that will update the feeds and grab fresh goodness from the ether, for you to read at leisure.
FreshRSS also offers a number of customisations to make life easier for seekers of knowledge who would prefer to grow their own walled garden, rather than to venture out and be trapped in someone else’s.
These include the ability to set your own cookies when retrieving content, for example: foo=bar; gdpr_consent=true; cookie=value
This would trick a site into thinking that you have ticked the annoying pop-up box, and preventing it from redirecting you to (yawn) yet another GDPR consent page. We’ve never had to use this, but it’s nice to know we have the option if we need it. You can also set your own user agent string so that it looks like a real person, rather than a bot, is retrieving the page, and you can even retrieve individual feeds through a proxy.