Opposite of Bloom filter?

Data StructuresBloom Filter

Data Structures Problem Overview


I'm trying to optimize a piece of software which is basically running millions of tests. These tests are generated in such a way that there can be some repetitions. Of course, I don't want to spend time running tests which I already ran if I can avoid it efficiently.

So, I'm thinking about using a Bloom filter to store the tests which have been already ran. However, the Bloom filter errs on the unsafe side for me. It gives false positives. That is, it may report that I've ran a test which I haven't. Although this could be acceptable in the scenario I'm working on, I was wondering if there's an equivalent to a Bloom filter, but erring on the opposite side, that is, only giving false negatives.

I've skimmed through the literature without any luck.

Data Structures Solutions


Solution 1 - Data Structures

Yes, a lossy hash table or a LRUCache is a data structure with fast O(1) lookup that will only give false negatives -- if you ask if "Have I run test X", it will tell you either "Yes, you definitely have", or "I can't remember".

Forgive the extremely crude pseudocode:

setup_test_table():
    create test_table( some large number of entries )
    clear each entry( test_table, NEVER )
    return test_table

has_test_been_run_before( new_test_details, test_table ):
    index = hash( test_details , test_table.length )
    old_details = test_table[index].detail
    // unconditionally overwrite old details with new details, LRU fashion.
    // perhaps some other collision resolution technique might be better.
    test_table[index].details = new_test_details
    if ( old_details === test_details ) return YES
    else if ( old_details === NEVER ) return NEVER
    else return PERHAPS    
    
main()
    test_table = setup_test_table();
    loop
        test_details = generate_random_test()
        status = has_test_been_run_before( test_details, test_table )
        case status of
           YES: do nothing;
           NEVER: run test (test_details);
           PERHAPS: if( rand()&1 ) run test (test_details);
    next loop
end.

Solution 2 - Data Structures

The exact data structure that accomplishes this task is a Direct-mapped cache, and is commonly used in CPUs.

function set_member(set, item)
    set[hash(item) % set.length] = item

function is_member(set, item)
    return set[hash(item) % set.length] == item

Solution 3 - Data Structures

Is it possible to store the tests that you did not run? This should inverse the filter's behavior.

Solution 4 - Data Structures

How about an LRUCache?

Solution 5 - Data Structures

I think you're leaving out part of the solution; to avoid false positives entirely you will still have to track which have run, and essentially use the bloom filter as a shortcut to determine the a test definitely has not been run.

That said, since you know the number of tests in advance, you can size the filter in such a way as to provide an acceptable error rate using some well-known formulae; for a 1% probability of returning a false positive you need ~9.5 bits/entry, so for one million entries 1.2 megabytes is sufficient. If you reduce the acceptable error rate to 0.1%, this only increases to 1.8 MB.

The Wikipedia article Bloom Filters gives a great analysis, and an interesting overview of the maths involved.

Solution 6 - Data Structures

  1. Use a bit set, as mentioned above. If you know the no. of tests you are going to run beforehand, you will always get correct results (present, not-present) from the data structure.
  2. Do you know what keys you will be hashing? If so, you should run an experiment to see the distribution of the keys in the BloomFilter so you can fine tune it to reproduce false positives, or what have you.
  3. You might want to checkout HyperLogLog as well.

Solution 7 - Data Structures

No and if you think about it, it wouldn't be very useful. In your case you couldn't be sure that your test run would ever stop, because if there are always 'false negatives' there will always be tests that need to be run...

I would say you just have to use a hash.

Solution 8 - Data Structures

I'm sorry I'm not much help - I don't think its possible. If test execution can't be ordered maybe use a packed format (8 tests per byte!) or a good sparse array library for storing the outcomes in memory.

Solution 9 - Data Structures

The data structure you expect does not exist. Because such data structure must be a many-to-one mapping, or say, a limited state set. There must be at least two different inputs mapping to the same internal state. So you can't tell whether both (or more) of them are in the set, only knowing at least one of such input exists.

EDIT This statement is true only when you are looking for a memory efficient data structure. If memory is unlimited, you can always get a data structure to give 100% accurate results, by storing every member item.

Attributions

All content for this solution is sourced from the original question on Stackoverflow.

The content on this page is licensed under the Attribution-ShareAlike 4.0 International (CC BY-SA 4.0) license.

Content TypeOriginal AuthorOriginal Content on Stackoverflow
QuestionabcView Question on Stackoverflow
Solution 1 - Data StructuresDavid CaryView Answer on Stackoverflow
Solution 2 - Data Structuresawdz9nldView Answer on Stackoverflow
Solution 3 - Data StructuresTobiesqueView Answer on Stackoverflow
Solution 4 - Data StructuresView Answer on Stackoverflow
Solution 5 - Data StructuresAndy LynchView Answer on Stackoverflow
Solution 6 - Data StructuresMohamed BanaView Answer on Stackoverflow
Solution 7 - Data Structuresf3lixView Answer on Stackoverflow
Solution 8 - Data StructuresEinsteinView Answer on Stackoverflow
Solution 9 - Data StructuresroamView Answer on Stackoverflow