How can I get the most frequent 100 numbers out of 4,000,000,000 numbers?

JavaAlgorithm

Java Problem Overview


Yesterday in a coding interview I was asked how to get the most frequent 100 numbers out of 4,000,000,000 integers (may contain duplicates), for example:

813972066
908187460
365175040
120428932
908187460
504108776

The first approach that came to my mind was using HashMap:

static void printMostFrequent100Numbers() throws FileNotFoundException {
    
    // Group unique numbers, key=number, value=frequency
    Map<String, Integer> unsorted = new HashMap<>();
    try (Scanner scanner = new Scanner(new File("numbers.txt"))) {
        while (scanner.hasNextLine()) {
            String number = scanner.nextLine();
            unsorted.put(number, unsorted.getOrDefault(number, 0) + 1);
        }
    }

    // Sort by frequency in descending order
    List<Map.Entry<String, Integer>> sorted = new LinkedList<>(unsorted.entrySet());
    sorted.sort((o1, o2) -> o2.getValue().compareTo(o1.getValue()));

    // Print first 100 numbers
    int count = 0;
    for (Map.Entry<String, Integer> entry : sorted) {
        System.out.println(entry.getKey());
        if (++count == 100) {
            return;
        }
    }
}

But it probably would throw an OutOfMemory exception for the data set of 4,000,000,000 numbers. Moreover, since 4,000,000,000 exceeds the maximum length of a Java array, let's say numbers are in a text file and they are not sorted. I assume multithreading or Map Reduce would be more appropriate for big data set?

How can the top 100 values be calculated when the data does not fit into the available memory?

Java Solutions


Solution 1 - Java

If the data is sorted, you can collect the top 100 in O(n) where n is the data's size. Because the data is sorted, the distinct values are contiguous. Counting them while traversing the data once gives you the global frequency, which is not available to you when the data is not sorted.

See the sample code below on how this can be done. There is also an implementation (in Kotlin) of the entire approach on GitHub

Note: Sorting is not required. What is required is that distinct values are contiguous and so there is no need for ordering to be defined - we get this from sorting but perhaps there is a way of doing this more efficiently.

You can sort the data file using (external) merge sort in roughly O(n log n) by splitting the input data file into smaller files that fit into your memory, sorting and writing them out into sorted files then merging them.



About this code sample:

  • Sorted data is represented by a long[]. Because the logic reads values one by one, it's an OK approximation of reading the data from a sorted file.

  • The OP didn't specify how multiple values with equal frequency should be treated; consequently, the code doesn't do anything beyond ensuring that the result is top N values in no particular order and not implying that there aren't other values with the same frequency.

import java.util.*;
import java.util.Map.Entry;

class TopN {
    private final int maxSize;
    private Map<Long, Long> countMap;

    public TopN(int maxSize) {
        this.maxSize = maxSize;
        this.countMap = new HashMap(maxSize);
    }

    private void addOrReplace(long value, long count) {
        if (countMap.size() < maxSize) {
            countMap.put(value, count);
        } else {
            Optional<Entry<Long, Long>> opt = countMap.entrySet().stream().min(Entry.comparingByValue());
            Entry<Long, Long> minEntry = opt.get();
            if (minEntry.getValue() < count) {
                countMap.remove(minEntry.getKey());
                countMap.put(value, count);
            }
        }
    }

    public Set<Long> get() {
        return countMap.keySet();
    }

    public void process(long[] data) {
        long value = data[0];
        long count = 0;

        for (long current : data) {
            if (current == value) {
                ++count;
            } else {
                addOrReplace(value, count);
                value = current;
                count = 1;
            }
        }
        addOrReplace(value, count);
    }

    public static void main(String[] args) {
        long[] data = {0, 2, 3, 3, 4, 5, 5, 5, 5, 6, 6, 6, 7};
        TopN topMap = new TopN(2);

        topMap.process(data);
        System.out.println(topMap.get()); // [5, 6]
    }
}

Solution 2 - Java

Integers are signed 32 bits, so if only positive integers happen, we look at 2^31 max different entries. An array of 2^31 bytes should stay under max array size.

But that can't hold frequencies higher than 255, you would say? Yes, you're right.

So we add an hashmap for all entries that exceed the max value possible in your array (255 - if it's signed just start counting at -128). There are at most 16 million entries in this hash map (4 billion divided by 255), which should be possible.


We have two data structures:

  • a large array, indexed by the number read (0..2^31) of bytes.
  • a hashmap of (number read, frequency)

Algorithm:

while reading next number 'x'
{
if (hashmap.contains(x))
{
hashmap[x]++;
}
else
{
bigarray[x]++;
if (bigarray[x] > 250)
{
hashmap[x] = bigarray[x];
}
}
}

// when done: // Look up top-100 in hashmap // if not 100 yet, add more from bigarray, skipping those already taken from the hashmap

I'm not fluent in Java, so can't give a better code example.


Note that this algorithm is single-pass, works on unsorted input, and doesn't use external pre-processing steps.

All it does is assuming a maximum to the number read. It should work if the input are non-negative Integers, which have a maximum of 2^31. The sample input satisfies that constraint.


The algorithm above should satisfy most interviewers that ask this question. Whether you can code in Java should be established by a different question. This question is about designing data structures and efficient algorithms.

Solution 3 - Java

In pseudocode:

  1. Perform an external sort
  2. Do a pass to collect the top 100 frequencies (not which values have them)
  3. Do another pass to collect the values that have those frequencies

Assumption: There are clear winners - no ties (outside the top 100).

Time complexity: O(n log n) (approx) due to sort. Space complexity: Available memory, again due to sort.

Steps 2 and 3 are both O(n) time and O(1) space.


If there are no ties (outside the top 100), steps 2 and 3 can be combined into one pass, which wouldn’t improve the time complexity, but would improve the run time slightly.

If there are ties that would make the quantity of winners large, you couldn’t discover that and take special action (e.g., throw error or discard all ties) without two passes. You could however find the smallest 100 values from the ties with one pass.

Solution 4 - Java

> But it probably would throw an OutOfMemory exception for the data set of 4000000000 numbers. Moreover, since 4000000000 exceeds max length of Java array, let's say numbers are in a text file and they are not sorted.

That depends on the value distribution. If you have 4E9 numbers, but the numbers are integers 1-1000, then you will end up with a map of 1000 entries. If the numbers are doubles or the value space is unrestricted, then you may have an issue.

As in the previous answer - there's a bug

unsorted.put(number, unsorted.getOrDefault(number, 0) + 1);

I personally would use "AtomicLong" for value, it allows to increase the value without updating the HashMap entries.

> I assume multithreading or Map Reduce would be more appropriate for big data set? > What would be the most efficient solution for this problem?

This is a typical map-reduce exercise example, so in theory you could use multi-threaded or M-R approach. Maybe it's the goal of your exercise and you suppose to implement the multithreaded map-reduce tasks regardless if it's the most efficient way or not.

In reality you should calculate if it is worth the effort. If you're reading the input serially (as it's in your code using the Scanner), then definitely not. If you can split the input files and read multiple parts in parallel, considering the I/O throughput, it may be the case.

Or maybe if the value space is too large to fit into memory and you will need to downscale the dataset, you may consider different approach.

Solution 5 - Java

Since the data set is presumably too big for memory, I'd do a hexadecimal radix sort. So the data set would get split between 16 files in each pass with as many passes as needed to get to the largest integer.

The second part would be to combine the files into one large data set.

The third part would be to read the file number by number and count the occurrence of each number. Save the number and number of occurrences into a two-dimensional array (the list) which is sorted by size. If the next number from the file has more occurrences than the number in the list with the lowest occurrences then replace that number.

Solution 6 - Java

One option is a type of binary search. Consider a binary tree where each split corresponds to a bit in a 32-bit integer. So conceptually we have a binary tree of depth 32. At each node, we can compute the count of numbers in the set that start with the bit sequence for that node. This count is an O(n) operation, so the total cost of finding our most common sequence is going to be O(n * f(n)) where the function depends on how many nodes we need to enumerate.

Let's start by considering a depth-first search. This provides a reasonable upper bound to the stack size during enumeration. A brute force search of all nodes is obviously terrible (in that case, you can ignore the tree concept entirely and just enumerate over all the integers), but we have two things that can prevent us from needing to search all nodes:

  1. If we ever reach a branch where there are 0 numbers in the set starting with that bit sequence, we can prune that branch and stop enumerating.

  2. Once we hit a terminal node, we know how many occurrences of that specific number there are. We add this to our 'top 100' list, removing the lowest if necessary. Once this list fills up, we can start pruning any branches whose total count is lower than the lowest of the 'top 100' counts.

I'm not sure what the average and worst-case performance for this would be. It would tend to perform better for sets with fewer distinct numbers and probably performs worst for sets that approach uniformly distributed, since that implies more nodes will need to be searched.

A few observations:

  1. There are at most N terminal nodes with non-zero counts, but since N > 2^32 in this specific case, that doesn't matter.

  2. The total number of nodes for M leaf nodes (M = 2^32) is 2M-1. This is still linear in M, so worst case running time is bounded above at O(N*M).

  3. This will perform worse than just searching all integers for some cases, but only by a linear scalar factor. Whether this performs better on average depends on the the expected data. For uniformly random data sets, my intuitive guess is that you'd be able to prune enough branches once the top-100 list fills up that you would tend to require fewer than M counts, but that would need to evaluated empirically or proven.

  4. As a practical matter, the fact that this algorithm just requires read-only access to the data set (it only ever performs a count of numbers starting with a certain bit pattern) means it is amenable to parallelization by storing the data across multiple arrays, counting the subsets in parallel, then adding the counts together. This could be a pretty substantial speedup in a practical implementation that's harder to do with an approach that requires sorting.


A concrete example of how this might execute, for a simpler set of 3-bit numbers and only finding the single most frequent. Let's say the set is '000, 001, 100, 001, 100, 010'.

  1. Count all numbers that start with '0'. This count is 4.

  2. Go deeper, count all numbers that start with '00'. This count is 3.

  3. Count all numbers that are '000'. This count is 1. This is our new most frequent.

  4. Count all numbers that are '001'. This count is 2. This is our new most frequent.

  5. Take next deep branch and count all numbers that start with '01'. This count is 1, which is less than our most frequent, so we can stop enumerating this branch.

  6. Count all numbers that start with '1'. This count is 1, which is less than our most frequent, so we can stop enumerating this branch.

  7. We're out of branches, so we're done and '001' is the most frequent.

Solution 7 - Java

Linux tools

That's simply done in a shell script on Linux/Mac:

sort inputfile | uniq -c | sort -nr | head -n 100

If the data is already sorted, you just use

uniq -c inputfile | sort -nr | head -n 100
File system

Another idea is to use the number as the filename and increase the file size for each hit

while read number;
do
  echo -n "." >> number
done <<< inputfile

File system constraints could cause trouble with that many files, so you can create a directory tree with the first digits and store the files there.

When finished, you traverse through the tree and remember the 100 highest seen values for file size.

Database

You can use the same approach with a database, so you don't need to actually store the GB of data there (works too), just the counters (needs less space).

Interview

An interesting question would be how you handle edge cases, so what should happen if the 100th, 101st, ... number have the same frequency. Are the integers only positive?

What kind of output do they need, just the numbers or also the frequencies? Just think it through like a real task at work and ask everything you need to know to solve it. It's more about how you think and analyze a problem.

Solution 8 - Java

I have noticed there is a bug in this line.

unsorted.put(number, unsorted.getOrDefault(number, 1) + 1);

You should make the default value as 0 as you are then adding 1 to it. If not when you only have 1 occurrence of a value, it is recorded as the frequency of 2.

unsorted.put(number, unsorted.getOrDefault(number, 0) + 1);

One downside that I see is the unnecessity of keeping all 4 billion frequencies when you are sorting.

You can use a PriorityQueue to hold only 100 values.

    Map<String, Integer> unsorted = new HashMap<>();

    PriorityQueue<Map.Entry<String, Integer>> highestFrequentValues = new PriorityQueue<>(100,
            (o1, o2) -> o2.getValue().compareTo(o1.getValue()));

    // O(n)
    try (Scanner scanner = new Scanner(new File("numbers.txt"))) {
        while (scanner.hasNextLine()) {
            String number = scanner.nextLine();
            unsorted.put(number, unsorted.getOrDefault(number, 0) + 1);
        }
    }

    // O(n)
    for (Map.Entry<String, Integer> stringIntegerEntry : unsorted.entrySet()) {
        if (highestFrequentValues.size() < 100) {
            highestFrequentValues.add(stringIntegerEntry);
        } else {
            Map.Entry<String, Integer> minFrequencyWithinHundredEntries = highestFrequentValues.poll();
            if (minFrequencyWithinHundredEntries.getValue() < stringIntegerEntry.getValue()) {
                highestFrequentValues.add(stringIntegerEntry);
            }
        }
    }

    // O(n)
    for (Map.Entry<String, Integer> frequentValue : highestFrequentValues) {
        System.out.println(frequentValue.getKey());
    }

Solution 9 - Java

OK, I know that the question is about Java and algorithms and solving this problem otherwise is not the point, but I still think this solution must be posted for completeness.

Solution in sh:

 sort FILE | uniq -c | sort -nr | head -n 100

Explanation: sort | uniq -c lists only unique entries and counts the number of their occurrences in the input; sort -nr sorts the output numerically in reverse order (the lines with more occurrences on the top); head -n 100 keeps 100 top lines only. A file with 4,000,000,000 numbers up to 999999999 (as per OP) will take about ~40GB, so fits well on a disk of a single machine, so it is technically possible to use this solution.

Pro: simple, has constant and limited memory usage. Cons: sub-optimal (because of sort), consumes lots of the temporary disk space for the operation, and overall there is no doubt that a solution specifically designed for this problem will have a much better performance. The question remains (in all seriousness): in a general case, will writing (and then debugging and executing) an optimized solution take more or less time than using a sub-optimal one (as above) but available immediately? I ran the solution on a sample file with 400,000,000 lines (10x smaller) and it took about 7 minutes on my computer.


P.S. On a side note, OP mentions that this question was asked during a programming interview. This is interesting because I think this a kind of a solution worth mentioning in this context before starting to code another program from scratch. When people say "experienced engineers are 10x faster...", I personally don't think that this is because experienced engineers code faster or produce optimized algorithms off the top of the head, but because they explore the alternatives that can save time. In the context of an interview it is an important skill to demonstrate among others.

Solution 10 - Java

I suppose that 4 trillion was chosen to be sure the problem is too large to fit in memory on current desktop machines. So rent a large VM from Amazon or Microsoft for the purpose? That's an answer most people don't think of yet but is valid for real-world solutions.

The way I'd approach it is start by binning. The range of numbers is presumably all 32-bit unsigned integers (or whatever they said). How large of an array does fit in RAM? divide the range into that many equal bins and pass through the data once. Look over the distribution: Is it fairly uniform, or spikey, or a curve of some kind? If the first/last range of bins are zeros then it gives you the true range of input values, and you can adjust the program to just bin over that range and repeat, to get better accuracy.

Then depending on the distribution, decide how to proceed. In general, only the top 100 bins can possibly contain the top 100 values, so you can reconfigure with those ranges and the largest bins you can handle within that excerpted range. If the distribution is too uniform, you might get many many bins with all the same count, so drop the smaller bins even though you have many more than 100 bins remaining -- you still cut it down some.

Worst case is that all the bins come out the same and you can't cut it down this way! Someone prepared some pathological data assuming this kind of approach. So re-arrange the way you do the binning. Rather than simply chopping into contiguous ranges of equal size, us a 1:1 mapping to shuffle them. However, for large bins, this might preserve the property of being fairly uniform, so you don't want a conventional "good" hashing function.

Another approach

If binning works, and rapidly cuts down the problem, it's easy. But the data could be such that it's actually very difficult. So what's a way that always works, regardless of the data? Well, I can assume that the result exists: some 100 values will have more occurrences.

Instead of bins, pick n specific values (however many you can fit in memory). Either choose random numbers, or use the first N distinct values from your input. Count those, and copy the others to another file. That is, the values you don't have room to count get copied to a (smaller the original) file.

Now you'll at least have a useful pivot value: the exact cardinality of the 100 distinct top values that you did count exactly. Well, the ones you picked might still end up being all the same count! So you only have 1 distinct cardinality worst case. You know that this is not a "top" value since there are far more an 100 of them.

Run again on your new (smaller) file, and discard counts that are smaller than the top 100 you already know. Repeat.

This reminds me of something that I might have read in Knuth's TAOCP, but scaled up for modern machine sizes.

Solution 11 - Java

I would just drop all the numbers in a database (SQLite would be my first choice) with a table like

CREATE TABLE tbl (
number INTEGER PRIMARY KEY,
counter INTEGER
)

Then for every number received, just do a

INSERT INTO tbl (number,counter) VALUES (:number,1) ON DUPLICATE KEY UPDATE counter=counter+1;

or with SQLite syntax

INSERT INTO tbl (number,counter) VALUES (:number,1) ON CONFLICT(number) DO UPDATE SET counter=counter+1;

Then when all the numbers are accounted for,

SELECT number, counter FROM tbl ORDER BY counter DESC LIMIT 100

... then I would end up with the 100 most common numbers, and how often they occurred. This scheme will only break when you run out of disk space... (or when you reach ~20000000000000 (20 trillion) unique digits at some ~281 terabytes of disk space... )

Solution 12 - Java

  1. Divide your numbers into two buckets
  2. Find top 100 in each bucket
  3. Merge those top 100 lists.

To divide, do median of medians (which can be modified to make medians of the top/bottom as well).

Each bucket has a distinct range of numbers in it. The initial median split makes 2 buckets, each with half (about) as many elements as the entire list in it.

To find the top 100, first know if the bucket is narrow (similar minimum and maximum) O(1) or small (few numbers in it) (O(n) time O(n*bucket count) memory). If either is true, a simple counting pass (possibly doing more than 1 bucket at once) solves it (you will have to do it more than once probably, as you have memory limits).

If neither is true, recurse and divide that bucket into two.

There are going to be fiddly bits with how you recurse without wasting too much time.

But the idea is that each bucket exponentially gets narrower or smaller. Narrow buckets have a minimum and maximum that is close, and small buckets have few elements.

You merge buckets so that you have enough storage to count the elements in the bucket (either width based, or volume based). Then you do a pass that counts that bucket and finds the top 100, and repeat. Each time you merge the top 100 from the scan into the previous top 100.

In-place, no sorting of the entire list needed, and devolves to simpler and more optimal strategies when the initial "bucket" is narrow or small.

Solution 13 - Java

I assume that the point of the challenge is to process this large amount of data without consuming too much memory, and avoid parsing the input too many times.

Here's an algorithm that would require two not too large arrays. Don't know about java, but I am confident that this can be made to run very fast in C:

Create a Count array of size 2^n to count the number of input numbers based on their n most significant bits. That will require a first scan over the input data but is really straightforward to do. I would first try with n=20 (about one million buckets).

Obviously, we won't process the data one bucket at a time, as that would require reading the input a million times, instead we choose our optimal batch size B and allocate a Batch array of size B. B could be like 40M, so that we aim at reading the input about 100 times. (It all depends on available memory).

Then we iterate over the count array to group the first range of buckets so that the sum is close to, but doesn't exceed B.

For each such range, we parse the input data, look for numbers in range and copy those numbers to the batch array. Since we already know the size of each bucket, we can immediately copy them grouped per bucket, so that we only have to sort them bucket by bucket (you can repurpose the count array to store the indices for where to write the next entry). Next we count the identical items in the sorted batch array and keep track of the top 100 so far.

Proceed the next range of buckets for which the sum of counts is under size B, etc...

Optimizations:

  • Once we start having a decent top 100, you can skip entire buckets whose size is below our 100th entry. For this we can use a special value (such as -1) in the count array, to indicate there is not index. Depending on the data, this can drastically reduce the number of passes required.
  • When counting identical items in the sorted Batch, we can make jumps of the size of your 100th entry (and then take a few steps backwards. I can share pseudo-code if needed)

Potential issues with this approach: The input numbers could be concentrated in a small range, then you might get one or more single buckets that are larger than B. Possible solutions:

  1. You could try another selection of n bits instead (eg. the n least significant bits). Note that that still won't help if the same numbers appears a billion times.
  2. If the input is 32bit integers, then the range of possible values is limited, and there can only be a few thousand different numbers in each bucket. So if one bucket is really large, then we can process that bucket differently: Just keep a counter for each unique value in that range. We can repurpose the Batch array for that.

Attributions

All content for this solution is sourced from the original question on Stackoverflow.

The content on this page is licensed under the Attribution-ShareAlike 4.0 International (CC BY-SA 4.0) license.

Content TypeOriginal AuthorOriginal Content on Stackoverflow
Question阿尔曼View Question on Stackoverflow
Solution 1 - JavaDavid SorokoView Answer on Stackoverflow
Solution 2 - JavaSjoerdView Answer on Stackoverflow
Solution 3 - JavaBohemianView Answer on Stackoverflow
Solution 4 - Javagusto2View Answer on Stackoverflow
Solution 5 - JavaMaxWView Answer on Stackoverflow
Solution 6 - JavaDan BryantView Answer on Stackoverflow
Solution 7 - JavamartymcflyView Answer on Stackoverflow
Solution 8 - JavaJude NiroshanView Answer on Stackoverflow
Solution 9 - JavaTimurView Answer on Stackoverflow
Solution 10 - JavaJDługoszView Answer on Stackoverflow
Solution 11 - JavahanshenrikView Answer on Stackoverflow
Solution 12 - JavaYakk - Adam NevraumontView Answer on Stackoverflow
Solution 13 - JavaKris Van BaelView Answer on Stackoverflow