Performance of Java matrix math libraries?

Java Problem Overview

We are computing something whose runtime is bound by matrix operations. (Some details below if interested.) This experience prompted the following question:

Do folk have experience with the performance of Java libraries for matrix math (e.g., multiply, inverse, etc.)? For example:

I searched and found nothing.

Details of our speed comparison:

We are using Intel FORTRAN (ifort (IFORT) 10.1 20070913). We have reimplemented it in Java (1.6) using Apache commons math 1.2 matrix ops, and it agrees to all of its digits of accuracy. (We have reasons for wanting it in Java.) (Java doubles, Fortran real*8). Fortran: 6 minutes, Java 33 minutes, same machine. jvisualm profiling shows much time spent in RealMatrixImpl.{getEntry,isValidCoordinate} (which appear to be gone in unreleased Apache commons math 2.0, but 2.0 is no faster). Fortran is using Atlas BLAS routines (dpotrf, etc.).

Obviously this could depend on our code in each language, but we believe most of the time is in equivalent matrix operations.

In several other computations that do not involve libraries, Java has not been much slower, and sometimes much faster.

Java Solutions

Solution 1 - Java

I'm the author of Java Matrix Benchmark (JMatBench) and I'll give my thoughts on this discussion.

There are significant difference between Java libraries and while there is no clear winner across the whole range of operations, there are a few clear leaders as can be seen in the latest performance results (October 2013).

If you are working with "large" matrices and can use native libraries, then the clear winner (about 3.5x faster) is MTJ with system optimised netlib. If you need a pure Java solution then MTJ, OjAlgo, EJML and Parallel Colt are good choices. For small matrices EJML is the clear winner.

The libraries I did not mention showed significant performance issues or were missing key features.

Solution 2 - Java

Just to add my 2 cents. I've compared some of these libraries. I attempted to matrix multiply a 3000 by 3000 matrix of doubles with itself. The results are as follows.

Using multithreaded ATLAS with C/C++, Octave, Python and R, the time taken was around 4 seconds.

Using Jama with Java, the time taken was 50 seconds.

Using Colt and Parallel Colt with Java, the time taken was 150 seconds!

Using JBLAS with Java, the time taken was again around 4 seconds as JBLAS uses multithreaded ATLAS.

So for me it was clear that the Java libraries didn't perform too well. However if someone has to code in Java, then the best option is JBLAS. Jama, Colt and Parallel Colt are not fast.

Solution 3 - Java

I'm the main author of jblas and wanted to point out that I've released Version 1.0 in late December 2009. I worked a lot on the packaging, meaning that you can now just download a "fat jar" with ATLAS and JNI libraries for Windows, Linux, Mac OS X, 32 and 64 bit (except for Windows). This way you will get the native performance just by adding the jar file to your classpath. Check it out at http://jblas.org!

Solution 4 - Java

I just compared Apache Commons Math with jlapack.

Test: singular value decomposition of a random 1024x1024 matrix.

Machine: Intel(R) Core(TM)2 Duo CPU E6750 @ 2.66GHz, linux x64

Octave code: A=rand(1024); tic;[U,S,V]=svd(A);toc

results                                execution time

Octave 36.34 sec

JDK 1.7u2 64bit jlapack dgesvd 37.78 sec apache commons math SVD 42.24 sec

JDK 1.6u30 64bit jlapack dgesvd 48.68 sec apache commons math SVD 50.59 sec

Native routines Lapack* invoked from C: 37.64 sec Intel MKL 6.89 sec(!)

My conclusion is that jlapack called from JDK 1.7 is very close to the native binary performance of lapack. I used the lapack binary library coming with linux distro and invoked the dgesvd routine to get the U,S and VT matrices as well. All tests were done using double precision on exactly the same matrix each run (except Octave).

Disclaimer - I'm not an expert in linear algebra, not affiliated to any of the libraries above and this is not a rigorous benchmark. It's a 'home-made' test, as I was interested comparing the performance increase of JDK 1.7 to 1.6 as well as commons math SVD to jlapack.

Solution 5 - Java

I can't really comment on specific libraries, but in principle there's little reason for such operations to be slower in Java. Hotspot generally does the kinds of things you'd expect a compiler to do: it compiles basic math operations on Java variables to corresponding machine instructions (it uses SSE instructions, but only one per operation); accesses to elements of an array are compiled to use "raw" MOV instructions as you'd expect; it makes decisions on how to allocate variables to registers when it can; it re-orders instructions to take advantage of processor architecture... A possible exception is that as I mentioned, Hotspot will only perform one operation per SSE instruction; in principle you could have a fantastically optimised matrix library that performed multiple operations per instruction, although I don't know if, say, your particular FORTRAN library does so or if such a library even exists. If it does, there's currently no way for Java (or at least, Hotspot) to compete with that (though you could of course write your own native library with those optimisations to call from Java).

So what does all this mean? Well:

in principle, it is worth hunting around for a better-performing library, though unfortunately I can't recomend one
if performance is really critical to you, I would consider just coding your own matrix operations, because you may then be able perform certain optimisations that a library generally can't, or that a particular library your using doesn't (if you have a multiprocessor machine, find out if the library is actually multithreaded)

A hindrance to matrix operations is often data locality issues that arise when you need to traverse both row by row and column by column, e.g. in matrix multiplication, since you have to store the data in an order that optimises one or the other. But if you hand-write the code, you can sometimes combine operations to optimise data locality (e.g. if you're multiplying a matrix by its transformation, you can turn a column traversal into a row traversal if you write a dedicated function instead of combining two library functions). As usual in life, a library will give you non-optimal performance in exchange for faster development; you need to decide just how important performance is to you.

Solution 6 - Java

Jeigen https://github.com/hughperkins/jeigen

wraps Eigen C++ library http://eigen.tuxfamily.org , which is one of the fastest free C++ libraries available
relatively terse syntax, eg 'mmul', 'sub'
handles both dense and sparse matrices

A quick test, by multiplying two dense matrices, ie:

import static jeigen.MatrixUtil.*;

int K = 100;
int N = 100000;
DenseMatrix A = rand(N, K);
DenseMatrix B = rand(K, N);
Timer timer = new Timer();
DenseMatrix C = B.mmul(A);
timer.printTimeCheckMilliseconds();

Results:

Jama: 4090 ms
Jblas: 1594 ms
Ojalgo: 2381 ms (using two threads)
Jeigen: 2514 ms

Compared to jama, everything is faster :-P
Compared to jblas, Jeigen is not quite as fast, but it handles sparse matrices.
Compared to ojalgo, Jeigen takes about the same amount of elapsed time, but only using one core, so Jeigen uses half the total cpu. Jeigen has a terser syntax, ie 'mmul' versus 'multiplyRight'

Solution 7 - Java

There's a benchmark of various matrix packages available in java up on http://code.google.com/p/java-matrix-benchmark/ for a few different hardware configurations. But it's no substitute for doing your own benchmark.

Performance is going to vary with the type of hardware you've got (cpu, cores, memory, L1-3 cache, bus speed), the size of the matrices and the algorithms you intend to use. Different libraries have different takes on concurrency for different algorithms, so there's no single answer. You may also find that the overhead of translating to the form expected by a native library negates the performance advantage for your use case (some of the java libraries have more flexible options regarding matrix storage, which can be used for further performance optimizations).

Generally though, JAMA, Jampack and COLT are getting old, and do not represent the state of the current performance available in Java for linear algebra. More modern libraries make more effective use of multiple cores and cpu caches. JAMA was a reference implementation, and pretty much implements textbook algorithms with little regard to performance. COLT and IBM Ninja were the first java libraries to show that performance was possible in java, even if they lagged 50% behind native libraries.

Solution 8 - Java

I'm the author of la4j (Linear Algebra for Java) library and here is my point. I've been working on la4j for 3 years (the latest release is 0.4.0 [01 Jun 2013]) and only now I can start doing performace analysis and optimizations since I've just covered the minimal required functional. So, la4j isn't as fast as I wanted but I'm spending loads of my time to change it.

I'm currently in the middle of porting new version of la4j to JMatBench platform. I hope new version will show better performance then previous one since there are several improvement I made in la4j such as much faster internal matrix format, unsafe accessors and fast blocking algorithm for matrix multiplications.

Solution 9 - Java

Have you taken a look at the Intel Math Kernel Library? It claims to outperform even ATLAS. MKL can be used in Java through JNI wrappers.

Solution 10 - Java

Linalg code that relies heavily on Pentiums and later processors' vector computing capabilities (starting with the MMX extensions, like LAPACK and now Atlas BLAS) is not "fantastically optimized", but simply industry-standard. To replicate that perfomance in Java you are going to need native libraries. I have had the same performance problem as you describe (mainly, to be able to compute Choleski decompositions) and have found nothing really efficient: Jama is pure Java, since it is supposed to be just a template and reference kit for implementers to follow... which never happened. You know Apache math commons... As for COLT, I have still to test it but it seems to rely heavily on Ninja improvements, most of which were reached by building an ad-hoc Java compiler, so I doubt it's going to help. At that point, I think we "just" need a collective effort to build a native Jama implementation...

Solution 11 - Java

Building on Varkhan's post that Pentium-specific native code would do better:

jBLAS: An alpha-stage project with JNI wrappers for Atlas: http://www.jblas.org.
- Author's blog post: http://mikiobraun.blogspot.com/2008/10/matrices-jni-directbuffers-and-number.html.
MTJ: Another such project: http://code.google.com/p/matrix-toolkits-java/

Solution 12 - Java

We have used COLT for some pretty large serious financial calculations and have been very happy with it. In our heavily profiled code we have almost never had to replace a COLT implementation with one of our own.

In their own testing (obviously not independent) I think they claim within a factor of 2 of the Intel hand-optimised assembler routines. The trick to using it well is making sure that you understand their design philosophy, and avoid extraneous object allocation.

Solution 13 - Java

I have found that if you are creating a lot of high dimensional Matrices, you can make Jama about 20% faster if you change it to use a single dimensional array instead of a two dimensional array. This is because Java doesn't support multi-dimensional arrays as efficiently. ie. it creates an array of arrays.

Colt does this already, but I have found it is more complicated and more powerful than Jama which may explain why simple functions are slower with Colt.

The answer really depends on that you are doing. Jama doesn't support a fraction of the things Colt can do which make make more of a difference.

Solution 14 - Java

You may want to check out the jblas project. It's a relatively new Java library that uses BLAS, LAPACK and ATLAS for high-performance matrix operations.

The developer has posted some benchmarks in which jblas comes off favourably against MTJ and Colt.

Solution 15 - Java

For 3d graphics applications the lwjgl.util vector implementation out-performed above mentioned jblas by a factor of about 3.

I have done 1 million matrix multiplications of a vec4 with a 4x4 matrix.

lwjgl finished in about 18ms, jblas required about 60ms.

(I assume, that the JNI approach is not very suitable for fast successive application of relatively small multiplications. Since the translation/mapping may take more time than the actual execution of the multiplication.)

Solution 16 - Java

There's also http://www.ujmp.org/">UJMP</a>

Solution 17 - Java

There are many different freely available java linear algebra libraries. http://www.ujmp.org/java-matrix/benchmark/ Unfortunately that benchmark only gives you info about matrix multiplication (with transposing the test does not allow the different libraries to exploit their respective design features).

What you should look at is how these linear algebra libraries perform when asked to compute various matrix decompositions. http://ojalgo.org/matrix_compare.html

Solution 18 - Java

Matrix Tookits Java (MTJ) was already mentioned before, but perhaps it's worth mentioning again for anyone else stumbling onto this thread. For those interested, it seems like there's also talk about having MTJ replace the linalg library in the apache commons math 2.0, though I'm not sure how that's progressing lately.

Solution 19 - Java

You should add Apache Mahout to your shopping list.

Content Type	Original Author	Original Content on Stackoverflow
Question	dfrankow	View Question on Stackoverflow
Solution 1 - Java	lessthanoptimal	View Answer on Stackoverflow
Solution 2 - Java	Hamaad Shah	View Answer on Stackoverflow
Solution 3 - Java	Mikio Braun	View Answer on Stackoverflow
Solution 4 - Java	isti_spl	View Answer on Stackoverflow
Solution 5 - Java	Neil Coffey	View Answer on Stackoverflow
Solution 6 - Java	Hugh Perkins	View Answer on Stackoverflow
Solution 7 - Java	culana	View Answer on Stackoverflow
Solution 8 - Java	Vladimir Kostyukov	View Answer on Stackoverflow
Solution 9 - Java	Zach Scrivena	View Answer on Stackoverflow
Solution 10 - Java	Varkhan	View Answer on Stackoverflow
Solution 11 - Java	dfrankow	View Answer on Stackoverflow
Solution 12 - Java	Nick Fortescue	View Answer on Stackoverflow
Solution 13 - Java	Peter Lawrey	View Answer on Stackoverflow
Solution 14 - Java	Mark Reid	View Answer on Stackoverflow
Solution 15 - Java	Necrowizzard	View Answer on Stackoverflow
Solution 16 - Java	Chad Okere	View Answer on Stackoverflow
Solution 17 - Java	Anders Peterson	View Answer on Stackoverflow
Solution 18 - Java	Steve Lianoglou	View Answer on Stackoverflow
Solution 19 - Java	bmargulies	View Answer on Stackoverflow