How Do I Fetch All Old Items on an RSS Feed?

Rss

Rss Problem Overview


I've been experimenting with writing my own RSS reader. I can handle the "parse XML" bit. The thing I'm getting stuck on is "How do I fetch older posts?"

Most RSS feeds only list the 10-25 most recent items in their XML file. How do I get ALL the items in a feed, and not just the most recent ones?

The only solution I could find was using the "unofficial" Google Reader API, which would be something like

http://www.google.com/reader/atom/feed/http://fskrealityguide.blogspot.com/feeds/posts/default?n=1000

I don't want to make my application dependent on Google Reader.

Is there any better way? I noticed that on Blogger, I can do "?start-index=1&max-results=1000", and on WordPress I can do "?paged=5". Is there any general way to fetch an RSS feed so that it gives me everything, and not just the most recent items?

Rss Solutions


Solution 1 - Rss

RSS/Atom feeds does not allow for historic information to be retrieved. It is up to the publisher of the feed to provide it if they want such as in the blogger or wordpress examples you gave above.

The only reason that Google Reader has more information is that it remembered it from when it came up the first time.

There is some information on something like this talked about as an extension to the ATOM protocol, but I don't know if it is actually implemented anywhere.

Solution 2 - Rss

As the other replies here mentioned, a feed may not provide archival data but historical items may be available from another source.

Archive.org’s Wayback Machine has an API to access historical content, including RSS feeds (if their bots have downloaded it). I’ve created the web tool Backfeed that uses this API to regenerate a feed containing concatenated historical items. If you'd like to discuss the implementation in detail please get in touch.

Solution 3 - Rss

In my experience with RSS, the feed is compiled by the last X items where X is a variable. Certain Feeds may have the full list, but for bandwidth sake most places are likely limiting to just the last few items.

The likely answer for google reader having the old info, is that it is storing it on its side for users later.

Solution 4 - Rss

Further to what David Dean said the RSS/Atom feeds will only contain what the publisher of the feed has up at that moment and someone would need to be actively collecting this informaton in order to have any historical information. Basically Google Reader was doing this for free and when you interacted with it you could retrieve this stored informaton from the google database servers.

Now that they have retired the service, to my knowledge you have two choices. You either have to start collection of this information from your feeds of interest and store the data using XML or some such, or you could pay for this data from one of the companies who sell this type of archived feed information.

I hope this information helps somebody.

Seán

Solution 5 - Rss

Another potential solution that might not have been available when the question was originally asked and shouldn't require any specific service.

  1. Find the URL of the RSS feed you want and use waybackpack to get the archived urls for that feed.
  2. Use FeedReader or a similar library to pull down the archived RSS feed.
  3. Take the URLs from each feed and scrape them as you wish. If you're going way back in time it's possible there might be some dead links.

Solution 6 - Rss

All previous answers more or less relied on existing services to still have a copy of that feed or the feed engine to be able to provide older items dynamically.

There's though another, admittedly pro-active and rather theoretical way to do so: Let your feedreader use a caching proxy which semantically understands RSS and/or Atom feeds and caches them on a per-item base up to as many items as you configure.

If the feedreader doesn't poll feeds regularily, the proxy could fetch known feeds time-based on its own to not miss an item in highly volatile feeds like the one from User Friendly which has only one item and changes every day (or at least used to do so). Hence if the feedreadere.g. crashed or lost network connection while you are away for a few days, you might loose items in your feedreader's cache. Having the proxy to fetch those feeds regularily (e.g. from a data center instead from at home or on a server instead of a laptop) allows you to easily run the feedreader only then and when without loosing items which were posted after your feedreader fetched feeds the last time but rotated out again before you fetch them the next time.

I call that concept a Semantic Feed Proxy and I've implemented a proof of concept implementation called sfp. It's though not much more than a proof of concept and I haven't developed it further. (So I'd be happy about hints to projects with similar ideas or purposes. :-)

Solution 7 - Rss

Why does this problem exist?

Most RSS readers need to import feeds through a live URL, which makes things harder for sites that are unindexed on Wayback Machine.

The reason why Wayback Machine feeds can be imported is that the reader can regularly poll the server for updates according to its defined TTL configuration. The reader compares the current datetime with the RSS feed posts pubDate or lastBuildDate keys in the XML response. We can't hack the machine datetime to work around the datetime resolution because the current datetime is fetched live.

I've outlined an alternative solution without Wayback below. Unfortunately, I have not been able to find a universal solution for all feed sources.

Alternative Solution(s)

In my experience, NOT ALL feeds are partial though. The XML doesn't have to specify the datetime of each post. This means the RSS Reader doesn't have a datetime to filter the feed with. An example of this feed type can be found here.

This kind of reading experience is useful when chronological order is irrelevant, and the content doesn't need to be sorted. This approach is useful for sites where ALL the content is valuable, and the linked Essays of Paul Graham is a good example.

  1. If the site has a generic, non-chronological feed option, subscribe to that RSS instead (the preferred option).
  2. Download the linked timestamped .rss file, strip datetimes and host the file on your own server. Note, we can implement this via an AWS Lambda.
    1. Set up a server that fetches the RSS from live.
    2. Strip the pubDate tags from the XML file on fetch.
    3. Host the modified RSS on your own server.

Note

These are suboptimal solutions due to loss of orders, however, I wanted to provide a potential alternative to WaybackMachine.

In addition, some existing answers require advanced SysDesign workarounds, more prework and in some cases are outdated (Google Reader is shut down). I hope it's helpful for those who really need a solution for a complete feed list. Constructing new RSS feeds is not too hard from the original RSS file.

Attributions

All content for this solution is sourced from the original question on Stackoverflow.

The content on this page is licensed under the Attribution-ShareAlike 4.0 International (CC BY-SA 4.0) license.

Content TypeOriginal AuthorOriginal Content on Stackoverflow
Questionuser14834View Question on Stackoverflow
Solution 1 - RssDavid DeanView Answer on Stackoverflow
Solution 2 - RssQuinn ComendantView Answer on Stackoverflow
Solution 3 - RssRob HauptView Answer on Stackoverflow
Solution 4 - RssSeán O'SullivanView Answer on Stackoverflow
Solution 5 - RssAlex KlibiszView Answer on Stackoverflow
Solution 6 - RssAxel BeckertView Answer on Stackoverflow
Solution 7 - RssPranav KasettiView Answer on Stackoverflow