In Scala how do I remove duplicates from a list?
ScalaScala Problem Overview
Suppose I have
val dirty = List("a", "b", "a", "c")
Is there a list operation that returns "a", "b", "c"
Scala Solutions
Solution 1 - Scala
Have a look at the ScalaDoc for Seq,
scala> dirty.distinct
res0: List[java.lang.String] = List(a, b, c)
Update. Others have suggested using Set
rather than List
. That's fine, but be aware that by default, the Set
interface doesn't preserve element order. You may want to use a Set implementation that explicitly does preserve order, such as collection.mutable.LinkedHashSet.
Solution 2 - Scala
scala.collection.immutable.List
now has a .distinct
method.
So calling dirty.distinct
is now possible without converting to a Set
or Seq
.
Solution 3 - Scala
Before using Kitpon's solution, think about using a Set
rather than a List
, it ensures each element is unique.
As most list operations (foreach
, map
, filter
, ...) are the same for sets and lists, changing collection could be very easy in the code.
Solution 4 - Scala
Using Set in the first place is the right way to do it, of course, but:
scala> List("a", "b", "a", "c").toSet.toList
res1: List[java.lang.String] = List(a, b, c)
Works. Or just toSet
as it supports the Seq Traversable
interface.
Solution 5 - Scala
For Already-Sorted Lists
If you happen to want the distinct items of a list that you know is already sorted, as I have often needed, the following performs about twice the speed as .distinct
:
def distinctOnSorted[V](seq: List[V]): List[V] =
seq.foldLeft(List[V]())((result, v) =>
if (result.isEmpty || v != result.head) v :: result else result)
.reverse
Performance results on a list of 100,000,000 random Ints from 0-99:
distinct : 0.6655373s
distinctOnSorted: 0.2848134s
Performance with MutableList or ListBuffer
While it would seem that a more mutable / non-functional programming approach might be faster than prepending to an immutable list, practice shows otherwise. The immutable implementation consistently performs better. My guess for the reason is that scala focuses its compiler optimizations on immutable collections, and does a good job at it. (I welcome others to submit better implementations.)
List size 1e7, random 0 to 1e6
------------------------------
distinct : 4562.2277ms
distinctOnSorted : 201.9462ms
distinctOnSortedMut1: 4399.7055ms
distinctOnSortedMut2: 246.099ms
distinctOnSortedMut3: 344.0758ms
distinctOnSortedMut4: 247.0685ms
List size 1e7, random 0 to 100
------------------------------
distinct : 88.9158ms
distinctOnSorted : 41.0373ms
distinctOnSortedMut1: 3283.8945ms
distinctOnSortedMut2: 54.4496ms
distinctOnSortedMut3: 58.6073ms
distinctOnSortedMut4: 51.4153ms
Implementations:
object ListUtil {
def distinctOnSorted[V](seq: List[V]): List[V] =
seq.foldLeft(List[V]())((result, v) =>
if (result.isEmpty || v != result.head) v :: result else result)
.reverse
def distinctOnSortedMut1[V](seq: List[V]): Seq[V] = {
if (seq.isEmpty) Nil
else {
val result = mutable.MutableList[V](seq.head)
seq.zip(seq.tail).foreach { case (prev, next) =>
if (prev != next) result += next
}
result //.toList
}
}
def distinctOnSortedMut2[V](seq: List[V]): Seq[V] = {
val result = mutable.MutableList[V]()
if (seq.isEmpty) return Nil
result += seq.head
var prev = seq.head
for (v <- seq.tail) {
if (v != prev) result += v
prev = v
}
result //.toList
}
def distinctOnSortedMut3[V](seq: List[V]): List[V] = {
val result = mutable.MutableList[V]()
if (seq.isEmpty) return Nil
result += seq.head
var prev = seq.head
for (v <- seq.tail) {
if (v != prev) v +=: result
prev = v
}
result.reverse.toList
}
def distinctOnSortedMut4[V](seq: List[V]): Seq[V] = {
val result = ListBuffer[V]()
if (seq.isEmpty) return Nil
result += seq.head
var prev = seq.head
for (v <- seq.tail) {
if (v != prev) result += v
prev = v
}
result //.toList
}
}
Test:
import scala.util.Random
class ListUtilTest extends UnitSpec {
"distinctOnSorted" should "return only the distinct elements in a sorted list" in {
val bigList = List.fill(1e7.toInt)(Random.nextInt(100)).sorted
val t1 = System.nanoTime()
val expected = bigList.distinct
val t2 = System.nanoTime()
val actual = ListUtil.distinctOnSorted[Int](bigList)
val t3 = System.nanoTime()
val actual2 = ListUtil.distinctOnSortedMut1(bigList)
val t4 = System.nanoTime()
val actual3 = ListUtil.distinctOnSortedMut2(bigList)
val t5 = System.nanoTime()
val actual4 = ListUtil.distinctOnSortedMut3(bigList)
val t6 = System.nanoTime()
val actual5 = ListUtil.distinctOnSortedMut4(bigList)
val t7 = System.nanoTime()
actual should be (expected)
actual2 should be (expected)
actual3 should be (expected)
actual4 should be (expected)
actual5 should be (expected)
val distinctDur = t2 - t1
val ourDur = t3 - t2
ourDur should be < (distinctDur)
print(s"distinct : ${distinctDur / 1e6}ms\n")
print(s"distinctOnSorted : ${ourDur / 1e6}ms\n")
print(s"distinctOnSortedMut1: ${(t4 - t3) / 1e6}ms\n")
print(s"distinctOnSortedMut2: ${(t5 - t4) / 1e6}ms\n")
print(s"distinctOnSortedMut3: ${(t6 - t5) / 1e6}ms\n")
print(s"distinctOnSortedMut4: ${(t7 - t6) / 1e6}ms\n")
}
}
Solution 6 - Scala
You can also use recursion and pattern matching:
def removeDuplicates[T](xs: List[T]): List[T] = xs match {
case Nil => xs
case head :: tail => head :: removeDuplicates(for (x <- tail if x != head) yield x)
}
Solution 7 - Scala
inArr.distinct foreach println _
Solution 8 - Scala
The algorithmic way...
def dedupe(str: String): String = {
val words = { str split " " }.toList
val unique = words.foldLeft[List[String]] (Nil) {
(l, s) => {
val test = l find { _.toLowerCase == s.toLowerCase }
if (test == None) s :: l else l
}
}.reverse
unique mkString " "
}