Concurrency model: Erlang vs Clojure

Concurrency Problem Overview

We are going to write a concurrent program using Clojure, which is going to extract keywords from a huge amount of incoming mail which will be cross-checked with a database.

One of my teammates has suggested to use Erlang to write this program.

Here I want to note something that I am new to functional programming so I am in a little doubt whether clojure is a good choice for writing this program, or Erlang is more suitable.

Concurrency Solutions

Solution 1 - Concurrency

Do you really mean concurrent or distributed?

If you mean concurrent (multi-threaded, multi-core etc.), then I'd say Clojure is the natural solution.

Clojure's STM model is perfectly designed for multi-core concurrency since it is very efficient at storing and managing shared state between threads. If you want to understand more, well worth looking at this excellent video.
Clojure STM allows safe mutation of data by concurrent threads. Erlang sidesteps this problem by making everything immutable, which is fine in itself but doesn't help when you genuinely need shared mutable state. If you want shared mutable state in Erlang, you have to implement it with a set of message interactions which is neither efficient nor convenient (that's the price of a nothing shared model....)
You will get inherently better performance with Clojure if you are in a concurrent setting in a large machine, since Clojure doesn't rely on message passing and hence communication between threads can be much more efficient.

If you mean distributed (i.e. many different machines sharing work over a network which are effectively running as isolated processes) then I'd say Erlang is the more natural solution:

Erlang's immutable, nothing-shared, message passing style forces you to write code in a way that can be distributed. So idiomatic Erlang automatically can be distributed across multiple machines and run in a distributed, fault-tolerant setting.
Erlang is therefore very well optimised for this use case, so would be the natural choice and would certainly be the quickest to get working.
Clojure could do it as well, but you will need to do much more work yourself (i.e. you'd either need to implement or choose some form of distributed computing framework) - Clojure does not currently come with such a framework by default.

In the long term, I hope that Clojure develops a distributed computing framework that matches Erlang - then you can have the best of both worlds!

Solution 2 - Concurrency

The two languages and runtimes take different approaches to concurrency:

Erlang structures programs as many lightweight processes communicating between one another. In this case, you will probably have a master process sending jobs and data to many workers and more processes to handle the resulting data.
Clojure favors a design where several threads share data and state using common data structures. It sounds particularly suitable for cases where many threads access the same data (read-only) and share little mutable state.

You need to analyze your application to determine which model suits you best. This may also depend on the external tools you use -- for example, the ability of the database to handle concurrent requests.

Another practical consideration is that clojure runs on the JVM where many open source libraries are available.

Solution 3 - Concurrency

Clojure is Lisp running on the Java JVM. Erlang is designed from the ground up to be highly fault tolerant and concurrent.

I believe the task is doable with either of these languages and many others as well. Your experience will depend on how well you understand the problem and how well you know the language. If you are new to both, I'd say the problem will be challenging no matter which one you choose.

Have you thought about something like Lucene/Solr? It's great software for indexing and searching documents. I don't know what "cross checking" means for your context, but this might be a good solution to consider.

Solution 4 - Concurrency

My approach would be to write a simple test in each language and test the performance of each one. Both languages are somewhat different to C style languages and if you aren't used to them (and you don't have a team that is used to them) you may end up with a maintenance nightmare.

I'd also look at using something like Groovy 1.8. Groovy now includes GPars to enable parallel computing. String and file manipulation in Groovy is very easy indeed.

Solution 5 - Concurrency

It depends what you mean by huge.
Strings in erlang are painful..

but:

If huge means tens of distributed machines, than go with erlang and write workers in text friendly languages (python?, perl?). You will have distributed layer on the top with highly concurrent local workers. Each worker would be represented by erlang process. If you need more performance, rewrite your worker into C. In Erlang it is super easy to talk to another languages.

If huge still means one strong machine go with JVM. It is not huge then.

If huge is hundreds of machines, I think you will need something stronger google-like (bigtable, map/reduce) probably on C++ stack. Erlang still OK, however you will need good devs to code it.

Content Type	Original Author	Original Content on Stackoverflow
Question	Quazi Farhan	View Question on Stackoverflow
Solution 1 - Concurrency	mikera	View Answer on Stackoverflow
Solution 2 - Concurrency	nimrodm	View Answer on Stackoverflow
Solution 3 - Concurrency	duffymo	View Answer on Stackoverflow
Solution 4 - Concurrency	Fortyrunner	View Answer on Stackoverflow
Solution 5 - Concurrency	user425720	View Answer on Stackoverflow