What is the advantage to using bloom filters?

AlgorithmData StructuresBloom Filter

Algorithm Problem Overview


I am reading up on bloom filters and they just seem silly. Anything you can accomplish with a bloom filter, you could accomplish in less space, more efficiently, using a single hash function rather than multiple, or that's what it seems. Why would you use a bloom filter and how is it useful?

Algorithm Solutions


Solution 1 - Algorithm

Alex has explained it pretty well. For those who still did not get quite a grasp on it, hopefully this example will help you understand:

Lets say I work for Google, in the Chrome team, and I want to add a feature to the browser which notifies the user if the url he has entered is a malicious URL. So I have a dataset of about 1 million malicious URLs, the size of this file being around 25MB. Since the size is quite big, (big in comparison to the size of the browser itself), I store this data on a remote server.

Case 1 : I use a hash function with a hash table. I decide on an efficient hashing function, and run all the 1 million urls through the hashing function to get hash keys. I then make a hash table (an array), where the hash key would give me the index to place that URL. So now once I have hashed and filled the hashing table, I check its size. I have stored all 1 million URLs in the hash table along with their keys. So the size is at least 25 MB. This hash table, due to its size will be stored on a remote server. When a user comes along and enters a URL in the address bar, I need to check if it malicious. Thus I run the URL through the hash function (the browser itself can do this) and I get a hash key for that URL. I now have to make a request to my remote server with that hash key, to check the if the particular URL in my hash table with that particular key, is the same as what the user has entered. If yes then it is malicious and if no then it is not malicious. Thus every time the user enters a URL, a request to the remote server has to be made to check if it is a malicious URL. This would take a lot of time and thus make my browser slow.

Case 2 : I use a bloom filter. The entire list of 1 million URLs are run through the bloom filter using multiple hash functions and the respective positions are marked as 1, in a huge array of 0s. Let's say we want a false positive rate of 1%, using a bloom filter calculator (http://hur.st/bloomfilter?n=1000000&p=0.01) , we get the size of the bloom filter required as only 1.13 MB. This small size is expected as, even though the size of the array is huge, we are only storing 1s or 0s and not the URLs as in case of the hash table.This array can be treated as a bit array. That is, since we have only two values 1 and 0, we can set individual bits instead of bytes. This would reduce the space taken by 8 times. This 1.13 MB bloom filter, due to its small size, can be stored in the web browser itself !! Thus when a user comes along and enters a URL, we simply apply the required hash functions (in the browser itself), and check all the positions in the bloom filter (which is stored in the browser). A value of 0 in any of the positions tells us that this URL is DEFINITELY NOT in the list of malicious URLs and the user can proceed freely. Thus we did not make a call to the server and hence saved time. A value of 1 tells us that the URL MIGHT be in the list of malicious URLs. In these cases we make a call to the remote server and over there we can use some other hash function with some hash table as in the first case to retrieve and check if the URL is actually present. Since most of the times, a URL is not likely to be a malicious one, the small bloom filter in the browser figures that out and hence saves time by avoiding calls to the remote server. Only in some cases, if the bloom filter tells us that the URL MIGHT be malicious, only in those cases we make a call to the server. That 'MIGHT' is 99% right.

So by using a small bloom filter in the browser, we have saved a lot of time as we do not need to make server calls for every URL entered.

We can see that hash table with a single hash function is used for a different purpose altogether than a bloom filter. Hopefully this clears your doubts :)

edit:

I have implemented a bloom filter for the task of malicious URL testing in Python. The code can be found here - https://github.com/tarunsharma1/Bloom-Filter The code is very simple to understand and a detailed description is provided in the readme file.

Solution 2 - Algorithm

From Wikipedia:

> Bloom filters have a strong space > advantage over other data structures > for representing sets, such as > self-balancing binary search trees, > tries, hash tables, or simple arrays > or linked lists of the entries. Most > of these require storing at least the > data items themselves, which can > require anywhere from a small number > of bits, for small integers, to an > arbitrary number of bits, such as for > strings (tries are an exception, since > they can share storage between > elements with equal prefixes). Linked > structures incur an additional linear > space overhead for pointers. A Bloom > filter with 1% error and an optimal > value of k, on the other hand, > requires only about 9.6 bits per > element — regardless of the size of > the elements. This advantage comes > partly from its compactness, inherited > from arrays, and partly from its > probabilistic nature. If a 1% false > positive rate seems too high, each > time we add about 4.8 bits per element > we decrease it by ten times.

Pretty clear to me.

A bloom filter doesn't store the elements themselves, this is the crucial point. You don't use a bloom filter to test if an element is present, you use it to test whether it's certainly not present, since it guarantees no false negatives. This lets you not do extra work for elements that don't exist in a set (such as disk IO to look them up).

And all in significantly less space than something like a hash table (which is likely going to be partially on disk for large data sets). Though you may use a bloom filter in conjunction with a structure like a hash table, once you're certain the element has a chance of being present.

So an example usage pattern might be:

You've got a lot of data, on disk -- you decide on what error bound you want (e.g. 1%), that prescribes the value of m. Then the optimal k is determined (from the formula given in the article). You populate your filter from this disk-bound data once.

Now you have the filter in RAM. When you need to process some element, you query your filter to see if it stands a chance of existing in your data set. If it doesn't, no extra work is done. No disk reads, etc. (Which you would have to do if it were a hash or tree, etc).

Otherwise, if the filter says "Yes, it's in there", there's a 1% chance that it's wrong, so you do the necessary work to find out. 99% of the time, it really will be there, so the work was not for naught.

Solution 3 - Algorithm

I will start with the explanation of what is a bloom filter, what it can and can't do, why do we need it, show an intuitive description how it works and then give some example when they can be useful.

So a standard bloom filter is a probabilistic data structure that can*:


  • add element to a set
  • check if an element is in the set by telling definitely not in the set or possibly in the set

This possibly in the set is exactly why it is called probabilistic. Using smart words it means that false positive are possible (there can be cases where it falsely thinks that the element is positive) but false negative are impossible.

But it can't *:

  • remove an item from the set
  • give you a list of all elements that are currently in your set

*This set of can/can't is for a basic bloom filter. Because it is a useful data structure that was created long time ago, people found how to augment it with other useful features.


But wait a minute: we already know a data structure that can answer all of this without vague 'possible' and also without all of the limitations (can't remove, can't show all). And it is called a set. And here comes a main advantage of a bloom filter: it is space efficient and space constant.

This means that it does not matter how many elements do we store there, the space will be the same. Yes a bloom filter with 10^6 elements (useless bloom filter) will take the same amount of space as a bloom filter with 10^20 elements and the same space as bloom filter with 0 elements. So how much space will it take? It is up to you to decide (but there is a trade of: the more elements you have the more uncertain you are with you possible in the set answer.

Another cool thing is that it is space constant. When you save the data to a set, you have to actually save this data. So if you store this long string in the set you have to at least use 27 bytes of space. But for a 1% error and an optimal value of k **, you will need ~ 9.6 bits ( < 2 bytes) per any element (whether it is a short int or a huge wall of text).

Another property is that all the operations are taking constant time, which is absolutely not the same as amortized constant time in case of sets (remember that if the set has collisions, it can deteriorate in O(n) time).

**k is a value of hash functions used in the bloom filter


I will not describe how the bloom filters work (wikipedia article does a very good job explaining everything). Here I will just briefly tell the basics.

  • you initiate an empty bit array of length m
  • you select k different hash functions (the more independent the better)
  • if you want to add element, you calculate all the k hashes of this value and set the corresponding bits to 1
  • if you want to check if element exist, you also calculate all k hashes and if at least one of them is not set, it is surely is not in the set. Otherwise it can be in the set.

Even this description is enough to understand why we can't be sure (you can get all bits set from various other values). Here is a very nice visualization of how it works.

enter image description here


So when can bloom filters be useful? The short answer is everywhere where false positive are acceptable and where you would want to check if something is in the set, but even if they are not, it can be a first line of defense to rule out expensive calls to verifiers.

Here is a list of more concrete descriptions:

  • a standard example of malicious websites and a browser is described in almost any place where people talk about bloom filters
  • is a passwords weak: instead of having a huge set of all possible weak passwords, you can just check whether the password is surely not weak with a way smaller bloom filter
  • if you have a list of articles and a list of users, you can use bloom filter to show users' articles they have not read. Interesting thing is that you can have only one filter (you check whether the combination of user_id + article_id is there)
  • bitcoin uses bloom filter for wallet synchronization
  • Akamai's web servers use Bloom filters to prevent "one-hit-wonders" from being stored in its disk caches. One-hit-wonders are web objects requested by users just once, something that Akamai found applied to nearly three-quarters of their caching infrastructure. Using a Bloom filter to detect the second request for a web object and caching that object only on its second request prevents one-hit wonders from entering the disk cache, significantly reducing disk workload and increasing disk cache hit rates (taken from examples in bloom's filter article at wiki)

Solution 4 - Algorithm

Bloom filters are quite useful in bioinformatics. They can be more space efficient compared to using a regular hash, especially when the size of the strings you are working with can be hundreds of millions of letters with a very small alphabet ie {A,G,T,C} . They are usually used to assess whether a certain k-mer is present or absence in a genome. There's an example of one used for something relevant here.

EDIT:

The multiple hash functions are used to minimize false positives. The hope is that between all of the k-hash functions each value will have a unique signature in the bit-array compared to every other possible value. However, false positives do exist, but they can be minimized to a manageable level. Using this technique you hash elements independently of their size. When you search for them, you use each hash function and check to make sure their bit-values are all 1.

Compare this to the human genome, where an increase in the size of the element increases the size of the hash table significantly (The table size is 4*4k). This is assuming you encode the the elements using 2 bits / letter.

Solution 5 - Algorithm

If a Bloom filter returns that an item is member of the set, there's a certain probability for a false positive. If only a single hash function were used to indicate membership in the set, the probability of a false positive would be higher than using multiple hash functions.

Solution 6 - Algorithm

A Bloom filter is a space-efficient probabilistic data structure that is used to test whether an element is a member of a set or not. It is called probabilistic because while a negative answer is 100% accurate, there might be false positives means that if the answer is "Yes, this element presents in the data set` it is not 100% accurate.

Why Bloom filter is Memory Efficient:

Basic Bloom filters don’t store data; they just answer the question whether an element is a member of a set or not. To implement a bloom filter we need bits array or think it as a binary number initially with zeroes. (I think the size should be 2x of the size of the data)enter image description here

Then to check if an element exists you apply k (kk is determined when you start to implement bloom filter and you store this k in memory. each hash function outputs an index between 0 and length-1. Convert the result index element to 1.

enter image description here

Eventually, you are converting this binary representation into 10 base number and let sey it is 1000003232232. Now this number is the bloom filter and you are storing this number in memory. What ever byte size that it holds, you are consuming that byte size

Bloom filters are used for caching but not used everywhere. If you know some applications of bloom filters, you will see how helpful they are:

Routers

Modern routers have limited space and, with the volume of packets they process per second, they need extremely fast algorithms. They are thus the perfect recipient for Bloom filters, for all those operations that can cope with a small rate of errors. Besides caching, routers often employ Bloom filters to keep track of forbidden IPs and to maintain statistics that will be used to reveal DoS attacks.

Crawlers

Crawlers are automated software agents scanning a network and looking for content, parsing, and indexing anything they find. When a crawler finds links in a page or document, it is usually programmed to follow them and recursively crawl the link’s destination.

Crawlers ignores links created using tags with an attribute rel="nofollow".

It is actually recommended that you mark in this way any anchor with a link to an action having side effects. Otherwise, search engines’ crawlers, even if they respect this policy, will cause unpredictable behavior.

What can happen is that if you write your own crawler and you are not careful, it might end up in an endless loop between two or more pages with mutual links (or chain of links) to each other. To avoid such loops, crawlers need to keep track of pages they already visited.

Bloom filters are the best way to do so because they can store URLs in a compact way and perform checking and saving of the URLs in constant time.

IO Fetcher

Bloom filter-based caching helps in reducing the unnecessary fetching/storage of expensive IO resources. The mechanism is the same as with crawling: the operation is only performed when we have a “miss,” while “hits” usually trigger a more in-depth comparison (for instance, on a hit, retrieving from disk just the first few lines or the first block of a document, and comparing them).

Spell Checker

Simpler versions of spell checkers used to implement Bloom filters as dictionaries. Today, however, spell checkers mostly take advantage of tries: these data structures provide good performance on text searches without false positives.

Distributed databases and file systems

Cassandra uses Bloom filters for index scans to determine whether an SSTable has data for a particular row. Likewise, Apache HBase uses Bloom filters as an efficient mechanism to test whether a StoreFile contains a specific row or row-col cell. This in turn boosts the overall read speed, by filtering out unnecessary disk reads of HFile blocks that don’t contain a particular row or row-column.

Attributions

All content for this solution is sourced from the original question on Stackoverflow.

The content on this page is licensed under the Attribution-ShareAlike 4.0 International (CC BY-SA 4.0) license.

Content TypeOriginal AuthorOriginal Content on Stackoverflow
QuestionheadacheView Question on Stackoverflow
Solution 1 - AlgorithmTarunView Answer on Stackoverflow
Solution 2 - AlgorithmAlex BudovskiView Answer on Stackoverflow
Solution 3 - AlgorithmSalvador DaliView Answer on Stackoverflow
Solution 4 - AlgorithmGWWView Answer on Stackoverflow
Solution 5 - AlgorithmMichael BurrView Answer on Stackoverflow
Solution 6 - AlgorithmYilmazView Answer on Stackoverflow