Efficient way to remove half of the duplicate items in a list

PythonAlgorithm

Python Problem Overview


If I have a list say l = [1, 8, 8, 8, 1, 3, 3, 8] and it's guaranteed that every element occurs an even number of times, how do I make a list with all elements of l now occurring n/2 times. So since 1 occurred 2 times, it should now occur once. Since 8 occurs 4 times, it should now occur twice. Since 3 occurred twice, it should occur once.

So the new list will be something like k=[1,8,8,3]

What is the fastest way to do this? I did list.count() for every element but it was very slow.

Python Solutions


Solution 1 - Python

If order isn't important, a way would be to get the odd or even indexes only after a sort. Those lists will be the same so you only need one of them.

l = [1,8,8,8,1,3,3,8]
l.sort()

# Get all odd indexes
odd = l[1::2]

# Get all even indexes
even = l[::2]

print(odd)
print(odd == even)

Result:

[1, 3, 8, 8]
True

Solution 2 - Python

Use a counter to keep track of the count of each element

from collections import Counter
l = [1,8,8,8,1,3,3,8]
res = []
count = Counter(l) # its like dict(1: 2, 8: 4, 3: 2)
for key, val in count.items():
    res.extend(val//2 * [key])
print(res)
# output
[1, 8, 8, 3]

Solution 3 - Python

Since you guarantee that each element of the list occurs a multiple of 2, then it is faster to build the counter as you build the output list, rather than building a counter (or sort) first and using it later.

l = [1,8,8,8,1,3,3,8]
count={}
res=[]
for i in l:
  if i in count: count[i]+=1
  else: count[i]=1
  if count[i]%2: res.append(i)

print(res)

Output

[1,8,8,3]

EDIT Comparing time/expense of each method

Using the timeit module shows that this approach is 2.7 times faster than using a counter first.

i.e.

def one():
  l = [1,8,8,8,1,3,3,8]
  count={}
  res=[]
  for i in l:
    if i in count: count[i]+=1
    else: count[i]=1
    if count[i]%2: res.append(i)

  #print(res)


def two():
  from collections import Counter
  l = [1,8,8,8,1,3,3,8]
  res = []
  count = Counter(l) # its like dict(1: 2, 8: 4, 3: 2)
  for key, val in count.items():
    res.extend(val//2 * [key])

o=timeit.Timer(one)

t=timeit.Timer(two)

print(o.timeit(100000))

print(t.timeit(100000))

print(o.timeit(100000))

print(t.timeit(100000))

Output (seconds)

0.28666
0.80822
0.28678
0.80113

If order isn't important, then Wimanicesir's method would be preferred with 4x greater speedup, with result of 0.07037 (~11 times faster than with counter approach).

UPDATE I suspected that using the Counter method in two (unordered) may come with significant bloat or slow down in import, so I tested the "count first, compile result later" method while counting with the simple method here from one (ordered)

count={}
for i in l:
  if i in count: count[i]+=1
  else: count[i]=1

which was much faster than Counter. Replacing Counter in two of the tests defined resulted in a time of 0.31 instead of 0.80. Still slightly faster to compile (ordered) result during counting as in two, however. And much faster for unordered result to use Wimanicesir's method.

Solution 4 - Python

This is a classic use case of sets and I'm quite surprised that no one else tried it out to see how it stacks against the Counter and dict implementations.

I implemented a solution using set instead as follows:

def set_impl(l):
  bag = set()
  res = []
  for i in l:
    if i in bag:
      res.append(i)
      bag.remove(i)
    else:
      bag.add(i)

This implementation is about 28% faster than using Counter and 51% faster than using a dictionary.

The sort and slice implementation given by Wimanicesir is the fastest, giving results 17 times faster than when using set. Note though, that because it sorts the items before removing duplicates, the order of appearance is not preserved unlike the other three.

Here are all the suggested implementations with timing for evaluation of comparative performance.
https://repl.it/@franzalex/StackOverflow-py#removeDuplicateHalf.py

import random
import statistics as stats
from collections import Counter as counter
from timeit import Timer

def slice_impl(l):
  l.sort()
  res = l[::2]

def dict_impl(l):
  count={}
  res=[]
  for i in l:
    if i in count:
      count[i] += 1
    else:
      count[i] = 1
    if count[i] % 2:
      res.append(i)

def counter_impl(l):
  count = counter(l)
  res = []
  for key, val in count.items():
    res.extend(val//2 * [key])

def set_impl(l):
  bag = set()
  res = []
  for i in l:
    if i in bag:
      res.append(i)
      bag.remove(i)
    else:
      bag.add(i)

def timed_run():
  for name, func in {"Sort and Slice": slice_impl, 
                     "Dictionary": dict_impl, 
                     "Counter": counter_impl, 
                     "Set": set_impl}.items():
    seq = list(range(50))*2
    results = []
    print(f"{name} Implementation Results")
    for i in range(50):
      if len(results) % 10: random.shuffle(seq) # shuffle after 10 runs
      results.append(Timer(lambda: func(seq)).timeit(10**4))
      # print(f"Run {i+1:02}: {results[i]:.6f}")
    print("")
    print(f"Median:  {stats.median(results):.6f}")
    print(f"Mean:    {stats.mean(results):.6f}")
    print(f"Std Dev: {stats.stdev(results):.6f}")
    print("\n\n")

timed_run()

Sample run result

Sort and Slice Implementation Results

Median: 0.009686 Mean: 0.009721 Std Dev: 0.000529

Dictionary Implementation Results

Median: 0.230081 Mean: 0.227631 Std Dev: 0.014584

Counter Implementation Results

Median: 0.192730 Mean: 0.194577 Std Dev: 0.008015

Set Implementation Results

Median: 0.149604 Mean: 0.151227 Std Dev: 0.006838

Solution 5 - Python

Instead of using a counter, which keeps track of an integer for each possible element of the list, try mapping elements to booleans using a dictionary. Map to true the first time they're seen, and then every time after that flip the bit, and if it's true skip the element.

Solution 6 - Python

If you are not concerned about preserving relative order, you can first get a count of each element using collections.Counter, then create a new list with each element duplicated half as many times.

>>> from collections import Counter
>>> from itertools import chain
>>> list(chain.from_iterable([key]*(count//2) for key, count in Counter(l).items()))
[1, 8, 8, 3]

Solution 7 - Python

you keep a list of all items that have been visited an uneven number of times. then you iterate over all list items.

in other langauges would probably use some map() or filter() method, but here is some simple code since i don't know python well enough! :)

l = [1,8,8,8,1,3,3,8]
seen = []
result = []
for num in l:
  if num in seen:
    seen.remove(num)
    #result.append(num) #print every even appearance
  else:
    seen.append(num)
    result.append(num) #print every odd appearance

if len(seen)==0:
  print(result)
else:
  print("Error: uneven elements found:", seen)

at the end the visited-array should be empty, so you can use that as a sanity check before returning the result-array.

edit: here's a version with filter that returns the odd appearances

l = [1,8,8,8,1,3,3,8]
seen = []
result = list(filter(lambda x: seen.append(x) is None if x not in seen else not seen.remove(x) is None, l))

if len(seen)==0:
  print(result)
else:
  print("Error: uneven elements found:", seen)

and this one returns the even appearances:

l = [1,8,8,8,1,3,3,8]
seen = []
result = list(filter(lambda x: seen.remove(x) is None if x in seen else not seen.append(x) is None, l))

if len(seen)==0:
  print(result)
else:
  print("Error: uneven elements found:", seen)

Solution 8 - Python

I like using a trie set, as you need to detect duplicates to remove them, or a big hash set (lots of buckets). The trie does not go unbalanced and you do not need to know the size of the final set. An alternative is a very parallel sort -- brute force.

Solution 9 - Python

I know this has been answered and there are some quite lengthy solutions. And it specifically mentioned Python. However, I thought a Powershell solution might be interesting (and simple!) for some:

Version 1 (grouping items - less efficient)

$OriginalArray = @("1","8","8","8","1","3","3","8") 
$NewArray = New-ObjectSystem.Collections.ArrayList
$ArrayGroup = $OriginalArray | Group-Object | Select-Object Count,Name

ForEach ($EachNumber in $ArrayGroup) {
    $HalfTheCount = (1..([Math]::Round($EachNumber.Count / 2)))
    ForEach ($Item in $HalfTheCount) {$NewArray.Add($EachNumber.Name) | Out-Null}   
    } 
$NewArray

Version 2 (picking every other item from a sorted array - more efficient)

$OriginalArray = @("1","8","8","8","1","3","3","8") 

$NewArray = New-Object System.Collections.ArrayList

$OddOrEven = "Even"
ForEach ($SortedItem in ($OriginalArray | Sort-Object)) {
    If ($OddOrEven -eq "Even") {$NewArray.Add($SortedItem);$EvenNumber = $True}
    If ($OddOrEven -eq "Odd") {$EvenNumber = $False}
    If ($EvenNumber -eq $True) {$OddOrEven = "Odd"} Else {$OddOrEven = "Even"} 
}
$NewArray

Solution 10 - Python

import itertools

st=time.time()
lst = [1,8,8,8,1,3,3,8]
list(itertools.chain.from_iterable(itertools.repeat(x, int(lst.count(x)/2)) for x in list(set(lst)) if lst.count(x)%2==0))

This gives sorted list

Solution 11 - Python

If the order matters, following code can work with O(N):

import collections
c = collections.Counter(l)
c2 = collections.Counter()
i, n = 0, len(l)
res=[]
for x in l:
    if i == n//2:break
    if c2[x] < c[x] // 2:
        res.append(x)
        c2[x] += 1
        i += 1

Solution 12 - Python

Maybe this.

    newList = []
    for number in l:
        if(newList.count(number) < l.count(number)/2):
            newList.append(number)

print(newList)

Attributions

All content for this solution is sourced from the original question on Stackoverflow.

The content on this page is licensed under the Attribution-ShareAlike 4.0 International (CC BY-SA 4.0) license.

Content TypeOriginal AuthorOriginal Content on Stackoverflow
QuestionNePtUnEView Question on Stackoverflow
Solution 1 - PythonWimanicesirView Answer on Stackoverflow
Solution 2 - Pythonuser13392187View Answer on Stackoverflow
Solution 3 - PythonjpfView Answer on Stackoverflow
Solution 4 - PythonAlex EssilfieView Answer on Stackoverflow
Solution 5 - PythonJad GhalayiniView Answer on Stackoverflow
Solution 6 - PythonCory KramerView Answer on Stackoverflow
Solution 7 - Pythonthe great meView Answer on Stackoverflow
Solution 8 - PythonDavid G. PickettView Answer on Stackoverflow
Solution 9 - PythonAndy PyneView Answer on Stackoverflow
Solution 10 - PythonJUGAL KISHOREView Answer on Stackoverflow
Solution 11 - PythonYang LiuView Answer on Stackoverflow
Solution 12 - Pythonfaiz-eView Answer on Stackoverflow