Is there an elegant way to process a stream in chunks?

JavaJava 8Java StreamChunking

Java Problem Overview


My exact scenario is inserting data to database in batches, so I want to accumulate DOM objects then every 1000, flush them.

I implemented it by putting code in the accumulator to detect fullness then flush, but that seems wrong - the flush control should come from the caller.

I could convert the stream to a List then use subList in an iterative fashion, but that too seems clunky.

It there a neat way to take action every n elements then continue with the stream while only processing the stream once?

Java Solutions


Solution 1 - Java

Elegance is in the eye of the beholder. If you don't mind using a stateful function in groupingBy, you can do this:

AtomicInteger counter = new AtomicInteger();

stream.collect(groupingBy(x->counter.getAndIncrement()/chunkSize))
    .values()
    .forEach(database::flushChunk);

This doesn't win any performance or memory usage points over your original solution because it will still materialize the entire stream before doing anything.

If you want to avoid materializing the list, stream API will not help you. You will have to get the stream's iterator or spliterator and do something like this:

Spliterator<Integer> split = stream.spliterator();
int chunkSize = 1000;

while(true) {
	List<Integer> chunk = new ArrayList<>(size);
	for (int i = 0; i < chunkSize && split.tryAdvance(chunk::add); i++){};
	if (chunk.isEmpty()) break;
	database.flushChunk(chunk);
}

Solution 2 - Java

Most of answers above do not use stream benefits like saving your memory. You can try to use iterator to resolve the problem

Stream<List<T>> chunk(Stream<T> stream, int size) {
  Iterator<T> iterator = stream.iterator();
  Iterator<List<T>> listIterator = new Iterator<>() {

    public boolean hasNext() {
      return iterator.hasNext();
    }

    public List<T> next() {
      List<T> result = new ArrayList<>(size);
      for (int i = 0; i < size && iterator.hasNext(); i++) {
        result.add(iterator.next());
      }
      return result;
    }
  };
  return StreamSupport.stream(((Iterable<List<T>>) () -> listIterator).spliterator(), false);
}

Solution 3 - Java

You can create a stream of chunks (List<T>) of a stream of items and a given chunk size by

  • grouping the items by the chunk index (element index / chunk size)
  • ordering the chunks by their index
  • reducing the map to their ordered elements only

Code:

public static <T> Stream<List<T>> chunked(Stream<T> stream, int chunkSize) {
    AtomicInteger index = new AtomicInteger(0);

    return stream.collect(Collectors.groupingBy(x -> index.getAndIncrement() / chunkSize))
            .entrySet().stream()
            .sorted(Map.Entry.comparingByKey()).map(Map.Entry::getValue);
}

Example usage:

Stream<Integer> stream = IntStream.range(0, 100).mapToObj(Integer::valueOf);
Stream<List<Integer>> chunked = chunked(stream, 8);
chunked.forEach(chunk -> System.out.println("Chunk: " + chunk));

Output:

Chunk: [0, 1, 2, 3, 4, 5, 6, 7]
Chunk: [8, 9, 10, 11, 12, 13, 14, 15]
Chunk: [16, 17, 18, 19, 20, 21, 22, 23]
Chunk: [24, 25, 26, 27, 28, 29, 30, 31]
Chunk: [32, 33, 34, 35, 36, 37, 38, 39]
Chunk: [40, 41, 42, 43, 44, 45, 46, 47]
Chunk: [48, 49, 50, 51, 52, 53, 54, 55]
Chunk: [56, 57, 58, 59, 60, 61, 62, 63]
Chunk: [64, 65, 66, 67, 68, 69, 70, 71]
Chunk: [72, 73, 74, 75, 76, 77, 78, 79]
Chunk: [80, 81, 82, 83, 84, 85, 86, 87]
Chunk: [88, 89, 90, 91, 92, 93, 94, 95]
Chunk: [96, 97, 98, 99]

Solution 4 - Java

If you have guava dependency on your project you could do this:

StreamSupport.stream(Iterables.partition(simpleList, 1000).spliterator(), false).forEach(...);

See https://google.github.io/guava/releases/23.0/api/docs/com/google/common/collect/Lists.html#partition-java.util.List-int-

Solution 5 - Java

Using library StreamEx solution would look like

Stream<Integer> stream = IntStream.iterate(0, i -> i + 1).boxed().limit(15);
AtomicInteger counter = new AtomicInteger(0);
int chunkSize = 4;

StreamEx.of(stream)
        .groupRuns((prev, next) -> counter.incrementAndGet() % chunkSize != 0)
        .forEach(chunk -> System.out.println(chunk));

Output:

[0, 1, 2, 3]
[4, 5, 6, 7]
[8, 9, 10, 11]
[12, 13, 14]

groupRuns accepts predicate that decides whether 2 elements should be in the same group.

It produces a group as soon as it finds first element that does not belong to it.

Solution 6 - Java

Look's like no, cause creating chunks means reducing stream, and reduce means termination. If you need to maintain stream nature and process chunks without collecting all data before here is my code (does not work for parallel streams):

private static <T> BinaryOperator<List<T>> processChunks(Consumer<List<T>> consumer, int chunkSize) {
	return (data, element) -> {
		if (data.size() < chunkSize) {
			data.addAll(element);
			return data;
		} else {
			consumer.accept(data);
			return element; // in fact it's new data list
		}
	};
}

private static <T> Function<T, List<T>> createList(int chunkSize) {
	AtomicInteger limiter = new AtomicInteger(0);
	return element -> {
		limiter.incrementAndGet();
		if (limiter.get() == 1) {
			ArrayList<T> list = new ArrayList<>(chunkSize);
			list.add(element);
			return list;
		} else if (limiter.get() == chunkSize) {
			limiter.set(0);
		}
		return Collections.singletonList(element);
	};
}

and how to use

Consumer<List<Integer>> chunkProcessor = (list) -> list.forEach(System.out::println);

	int chunkSize = 3;

	Stream.generate(StrTokenizer::getInt).limit(13)
			.map(createList(chunkSize))
			.reduce(processChunks(chunkProcessor, chunkSize))
			.ifPresent(chunkProcessor);

static Integer i = 0;

static Integer getInt()
{
	System.out.println("next");
	return i++;
}

it will print

> next next next next 0 1 2 next next next 3 4 5 next next next 6 7 8 next next next 9 10 11 12

the idea behind is to create lists in a map operation with 'pattern'

> [1,,],[2],[3],[4,,]...

and merge (+process) that with reduce.

> [1,2,3],[4,5,6],...

and don't forget to process the last 'trimmed' chunk with

.ifPresent(chunkProcessor);

Solution 7 - Java

As Misha rightfully said, Elegance is in the eye of the beholder. I personally think an elegant solution would be to let the class that inserts to the database do this task. Similar to a BufferedWriter. This way it does not depend on your original data structure and can be used even with multiple streams after one and another. I am not sure if this is exactly what you mean by having the code in the accumulator which you thought is wrong. I don't think it is wrong, since the existing classes like BufferedWriter work this way. You have some flush control from the caller this way by calling flush() on the writer at any point.

Something like the following code.

class BufferedDatabaseWriter implements Flushable {
    List<DomObject> buffer = new LinkedList<DomObject>();
    public void write(DomObject o) {
        buffer.add(o);
        if(buffer.length > 1000)
            flush();
    }
    public void flush() {
        //write buffer to database and clear it
    }
}

Now your stream gets processed like this:

BufferedDatabaseWriter writer = new BufferedDatabaseWriter();
stream.forEach(o -> writer.write(o));
//if you have more streams stream2.forEach(o -> writer.write(o));
writer.flush();

If you want to work multithreaded, you could run the flush asynchronous. The taking from the stream can't go in parallel but I don't think there is a way to count 1000 elements from a stream in parallel anyway.

You can also extend the writer to allow setting of the buffer size in constructor or you can make it implement AutoCloseable and run it in a try with ressources and more. The nice things you have from a BufferedWriter.

Solution 8 - Java

Here is simple wrapping spliterator implementation that groups source elements into chunks:

public class ChunkedSpliterator<T> implements Spliterator<List<T>> {
    private static final int PROMOTED_CHARACTERISTICS = Spliterator.ORDERED | Spliterator.DISTINCT | Spliterator.SIZED | Spliterator.IMMUTABLE | Spliterator.CONCURRENT;
    private static final int SELF_CHARACTERISTICS = Spliterator.NONNULL;

    private final Spliterator<T> src;
    private final int chunkSize;

    public ChunkedSpliterator(Spliterator<T> src, int chunkSize) {
        if (chunkSize < 1)
            throw new IllegalArgumentException("chunkSize must be at least 1");
        this.src = src;
        this.chunkSize = chunkSize;
    }

    public static <E> Stream<List<E>> chunkify(Stream<E> src, int chunkSize) {
        ChunkedSpliterator<E> wrap = new ChunkedSpliterator<>(src.spliterator(), chunkSize);
        return StreamSupport.stream(wrap, src.isParallel());
    }

    @Override
    public boolean tryAdvance(Consumer<? super List<T>> action) {
        List<T> result = new ArrayList<>((int) Math.min(src.estimateSize(), chunkSize));
        for (int i = 0; i < chunkSize; ++i) {
            if (!src.tryAdvance(result::add))
                break;
        }
        if (result.isEmpty())
            return false;
        action.accept(result);
        return true;
    }

    @Override
    public Spliterator<List<T>> trySplit() {
        Spliterator<T> srcSplit = src.trySplit();
        return srcSplit == null ? null : new ChunkedSpliterator<>(srcSplit, chunkSize);
    }

    @Override
    public long estimateSize() {
        long srcSize = src.estimateSize();
        if (srcSize <= 0L) return 0L;
        if (srcSize == Long.MAX_VALUE) return Long.MAX_VALUE;
        return (srcSize - 1) / chunkSize + 1;
    }

    @Override
    public int characteristics() {
        return (src.characteristics() & PROMOTED_CHARACTERISTICS) | SELF_CHARACTERISTICS;
    }
}

There is handy chunkify shortcut method to make things easier:

    Stream<T> input = ...;
    Stream<List<T>> chunked = ChunkedSpliterator.chunkify(input, 1000);

Despite the call Stream.spliterator() is terminal operation it actually does not forcibly exhaust the stream's source. So, it can be processed via its spliterator gradually, without fetching all the data in memory - only per chunk.

This spliterator preserves most of input's characteristics. However, it's not sub-sized (chunks may be split in middle), not sorted (not obvious how to sort chunks even if elements are sortable) and produce only non-null chunks (albeit chunks still may have null elements). I'm not 100% sure about concurrent/immutable, but it seems it should inherit these with no problem. Also, produced chunks may be not strictly of requested size, but never exceed it.

In fact, I'm very surprised such a popular question had no answer introducing custom spliterator for almost 7 (!) years.

Solution 9 - Java

You can use this class, https://github.com/1wpro2/jdk-patch/blob/main/FixedSizeSpliterator.java.

Pass in the chunk size as the THRESHOLD

new FixedSizeSpliterator(T[] values, int threshold)

Attributions

All content for this solution is sourced from the original question on Stackoverflow.

The content on this page is licensed under the Attribution-ShareAlike 4.0 International (CC BY-SA 4.0) license.

Content TypeOriginal AuthorOriginal Content on Stackoverflow
QuestionBohemianView Question on Stackoverflow
Solution 1 - JavaMishaView Answer on Stackoverflow
Solution 2 - JavadmitryvimView Answer on Stackoverflow
Solution 3 - JavaPeter WalserView Answer on Stackoverflow
Solution 4 - Javauser2814648View Answer on Stackoverflow
Solution 5 - JavaNazarii BardiukView Answer on Stackoverflow
Solution 6 - JavaYuraView Answer on Stackoverflow
Solution 7 - JavafinduslView Answer on Stackoverflow
Solution 8 - JavaVasily LiaskovskyView Answer on Stackoverflow
Solution 9 - JavaengineerView Answer on Stackoverflow