How to find and return a duplicate value in array

RubyArrays

Ruby Problem Overview


arr is array of strings:

["hello", "world", "stack", "overflow", "hello", "again"]

What would be an easy and elegant way to check if arr has duplicates, and if so, return one of them (no matter which)?

Examples:

["A", "B", "C", "B", "A"]    # => "A" or "B"
["A", "B", "C"]              # => nil

Ruby Solutions


Solution 1 - Ruby

a = ["A", "B", "C", "B", "A"]
a.detect{ |e| a.count(e) > 1 }

I know this isn't very elegant answer, but I love it. It's beautiful one liner code. And works perfectly fine unless you need to process huge data set.

Looking for faster solution? Here you go!

def find_one_using_hash_map(array)
  map = {}
  dup = nil
  array.each do |v|
    map[v] = (map[v] || 0 ) + 1

    if map[v] > 1
      dup = v
      break
    end
  end

  return dup
end

It's linear, O(n), but now needs to manage multiple lines-of-code, needs test cases, etc.

If you need an even faster solution, maybe try C instead.

And here is the gist comparing different solutions: https://gist.github.com/naveed-ahmad/8f0b926ffccf5fbd206a1cc58ce9743e

Solution 2 - Ruby

You can do this in a few ways, with the first option being the fastest:

ary = ["A", "B", "C", "B", "A"]

ary.group_by{ |e| e }.select { |k, v| v.size > 1 }.map(&:first)

ary.sort.chunk{ |e| e }.select { |e, chunk| chunk.size > 1 }.map(&:first)

And a O(N^2) option (i.e. less efficient):

ary.select{ |e| ary.count(e) > 1 }.uniq

Solution 3 - Ruby

Simply find the first instance where the index of the object (counting from the left) does not equal the index of the object (counting from the right).

arr.detect {|e| arr.rindex(e) != arr.index(e) }

If there are no duplicates, the return value will be nil.

I believe this is the fastest solution posted in the thread so far, as well, since it doesn't rely on the creation of additional objects, and #index and #rindex are implemented in C. The big-O runtime is N^2 and thus slower than Sergio's, but the wall time could be much faster due to the the fact that the "slow" parts run in C.

Solution 4 - Ruby

detect only finds one duplicate. find_all will find them all:

a = ["A", "B", "C", "B", "A"]
a.find_all { |e| a.count(e) > 1 }

Solution 5 - Ruby

Here are two more ways of finding a duplicate.

Use a set

require 'set'

def find_a_dup_using_set(arr)
  s = Set.new
  arr.find { |e| !s.add?(e) }
end

find_a_dup_using_set arr
  #=> "hello" 

Use select in place of find to return an array of all duplicates.

Use Array#difference

class Array
  def difference(other)
    h = other.each_with_object(Hash.new(0)) { |e,h| h[e] += 1 }
    reject { |e| h[e] > 0 && h[e] -= 1 }
  end
end

def find_a_dup_using_difference(arr)
  arr.difference(arr.uniq).first
end

find_a_dup_using_difference arr
  #=> "hello" 

Drop .first to return an array of all duplicates.

Both methods return nil if there are no duplicates.

I proposed that Array#difference be added to the Ruby core. More information is in my answer here.

Benchmark

Let's compare suggested methods. First, we need an array for testing:

CAPS = ('AAA'..'ZZZ').to_a.first(10_000)
def test_array(nelements, ndups)
  arr = CAPS[0, nelements-ndups]
  arr = arr.concat(arr[0,ndups]).shuffle
end

and a method to run the benchmarks for different test arrays:

require 'fruity'

def benchmark(nelements, ndups)
  arr = test_array nelements, ndups
  puts "\n#{ndups} duplicates\n"    
  compare(
    Naveed:    -> {arr.detect{|e| arr.count(e) > 1}},
    Sergio:    -> {(arr.inject(Hash.new(0)) {|h,e| h[e] += 1; h}.find {|k,v| v > 1} ||
                     [nil]).first },
    Ryan:      -> {(arr.group_by{|e| e}.find {|k,v| v.size > 1} ||
                     [nil]).first},
    Chris:     -> {arr.detect {|e| arr.rindex(e) != arr.index(e)} },
    Cary_set:  -> {find_a_dup_using_set(arr)},
    Cary_diff: -> {find_a_dup_using_difference(arr)}
  )
end

I did not include @JjP's answer because only one duplicate is to be returned, and when his/her answer is modified to do that it is the same as @Naveed's earlier answer. Nor did I include @Marin's answer, which, while posted before @Naveed's answer, returned all duplicates rather than just one (a minor point but there's no point evaluating both, as they are identical when return just one duplicate).

I also modified other answers that returned all duplicates to return just the first one found, but that should have essentially no effect on performance, as they computed all duplicates before selecting one.

The results for each benchmark are listed from fastest to slowest:

First suppose the array contains 100 elements:

benchmark(100, 0)
0 duplicates
Running each test 64 times. Test will take about 2 seconds.
Cary_set is similar to Cary_diff
Cary_diff is similar to Ryan
Ryan is similar to Sergio
Sergio is faster than Chris by 4x ± 1.0
Chris is faster than Naveed by 2x ± 1.0

benchmark(100, 1)
1 duplicates
Running each test 128 times. Test will take about 2 seconds.
Cary_set is similar to Cary_diff
Cary_diff is faster than Ryan by 2x ± 1.0
Ryan is similar to Sergio
Sergio is faster than Chris by 2x ± 1.0
Chris is faster than Naveed by 2x ± 1.0

benchmark(100, 10)
10 duplicates
Running each test 1024 times. Test will take about 3 seconds.
Chris is faster than Naveed by 2x ± 1.0
Naveed is faster than Cary_diff by 2x ± 1.0 (results differ: AAC vs AAF)
Cary_diff is similar to Cary_set
Cary_set is faster than Sergio by 3x ± 1.0 (results differ: AAF vs AAC)
Sergio is similar to Ryan

Now consider an array with 10,000 elements:

benchmark(10000, 0)
0 duplicates
Running each test once. Test will take about 4 minutes.
Ryan is similar to Sergio
Sergio is similar to Cary_set
Cary_set is similar to Cary_diff
Cary_diff is faster than Chris by 400x ± 100.0
Chris is faster than Naveed by 3x ± 0.1

benchmark(10000, 1)
1 duplicates
Running each test once. Test will take about 1 second.
Cary_set is similar to Cary_diff
Cary_diff is similar to Sergio
Sergio is similar to Ryan
Ryan is faster than Chris by 2x ± 1.0
Chris is faster than Naveed by 2x ± 1.0

benchmark(10000, 10)
10 duplicates
Running each test once. Test will take about 11 seconds.
Cary_set is similar to Cary_diff
Cary_diff is faster than Sergio by 3x ± 1.0 (results differ: AAE vs AAA)
Sergio is similar to Ryan
Ryan is faster than Chris by 20x ± 10.0
Chris is faster than Naveed by 3x ± 1.0

benchmark(10000, 100)
100 duplicates
Cary_set is similar to Cary_diff
Cary_diff is faster than Sergio by 11x ± 10.0 (results differ: ADG vs ACL)
Sergio is similar to Ryan
Ryan is similar to Chris
Chris is faster than Naveed by 3x ± 1.0

Note that find_a_dup_using_difference(arr) would be much more efficient if Array#difference were implemented in C, which would be the case if it were added to the Ruby core.

Conclusion

Many of the answers are reasonable but using a Set is the clear best choice. It is fastest in the medium-hard cases, joint fastest in the hardest and only in computationally trivial cases - when your choice won't matter anyway - can it be beaten.

The one very special case in which you might pick Chris' solution would be if you want to use the method to separately de-duplicate thousands of small arrays and expect to find a duplicate typically less than 10 items in. This will be a bit faster as it avoids the small additional overhead of creating the Set.

Solution 6 - Ruby

Alas most of the answers are O(n^2).

Here is an O(n) solution,

a = %w{the quick brown fox jumps over the lazy dog}
h = Hash.new(0)
a.find { |each| (h[each] += 1) == 2 } # => 'the"

What is the complexity of this?

  • Runs in O(n) and breaks on first match
  • Uses O(n) memory, but only the minimal amount

Now, depending on how frequent duplicates are in your array these runtimes might actually become even better. For example if the array of size O(n) has been sampled from a population of k << n different elements only the complexity for both runtime and space becomes O(k), however it is more likely that the original poster is validating input and wants to make sure there are no duplicates. In that case both runtime and memory complexity O(n) since we expect the elements to have no repetitions for the majority of inputs.

Solution 7 - Ruby

Ruby Array objects have a great method, select.

select {|item| block } → new_ary
select → an_enumerator

The first form is what interests you here. It allows you to select objects which pass a test.

Ruby Array objects have another method, count.

count → int
count(obj)int
count { |item| block } → int

In this case, you are interested in duplicates (objects which appear more than once in the array). The appropriate test is a.count(obj) > 1.

If a = ["A", "B", "C", "B", "A"], then

a.select{|item| a.count(item) > 1}.uniq
=> ["A", "B"]

You state that you only want one object. So pick one.

Solution 8 - Ruby

find_all() returns an array containing all elements of enum for which block is not false.

To get duplicate elements

>> arr = ["A", "B", "C", "B", "A"]
>> arr.find_all { |x| arr.count(x) > 1 }

=> ["A", "B", "B", "A"]

Or duplicate uniq elements

>> arr.find_all { |x| arr.count(x) > 1 }.uniq
=> ["A", "B"] 

Solution 9 - Ruby

Something like this will work

arr = ["A", "B", "C", "B", "A"]
arr.inject(Hash.new(0)) { |h,e| h[e] += 1; h }.
    select { |k,v| v > 1 }.
    collect { |x| x.first }

That is, put all values to a hash where key is the element of array and value is number of occurences. Then select all elements which occur more than once. Easy.

Solution 10 - Ruby

I know this thread is about Ruby specifically, but I landed here looking for how to do this within the context of Ruby on Rails with ActiveRecord and thought I would share my solution too.

class ActiveRecordClass < ActiveRecord::Base
  #has two columns, a primary key (id) and an email_address (string)
end

ActiveRecordClass.group(:email_address).having("count(*) > 1").count.keys

The above returns an array of all email addresses that are duplicated in this example's database table (which in Rails would be "active_record_classes").

Solution 11 - Ruby

a = ["A", "B", "C", "B", "A"]
a.each_with_object(Hash.new(0)) {|i,hash| hash[i] += 1}.select{|_, count| count > 1}.keys

This is a O(n) procedure.

Alternatively you can do either of the following lines. Also O(n) but only one iteration

a.each_with_object(Hash.new(0).merge dup: []){|x,h| h[:dup] << x if (h[x] += 1) == 2}[:dup]

a.inject(Hash.new(0).merge dup: []){|h,x| h[:dup] << x if (h[x] += 1) == 2;h}[:dup]

Solution 12 - Ruby

This code will return list of duplicated values. Hash keys are used as an efficient way of checking which values have already been seen. Based on whether value has been seen, the original array ary is partitioned into 2 arrays: first containing unique values and second containing duplicates.

ary = ["hello", "world", "stack", "overflow", "hello", "again"]

hash={}
arr.partition { |v| hash.has_key?(v) ? false : hash[v]=0 }.last.uniq

=> ["hello"]

You can further shorten it - albeit at a cost of slightly more complex syntax - to this form:

hash={}
arr.partition { |v| !hash.has_key?(v) && hash[v]=0 }.last.uniq

Solution 13 - Ruby

Ruby 2.7 introduced Enumerable#tally

And you can use it this way:

ary = ["A", "B", "C", "B", "A", "A"]

ary.tally.select { |_, count| count > 1 }.keys
# => ["A", "B"]
ary = ["A", "B", "C"]

ary.tally.select { |_, count| count > 1 }.keys
# => []

Solution 14 - Ruby

Here is my take on it on a big set of data - such as a legacy dBase table to find duplicate parts

# Assuming ps is an array of 20000 part numbers & we want to find duplicates
# actually had to it recently.
# having a result hash with part number and number of times part is 
# duplicated is much more convenient in the real world application
# Takes about 6  seconds to run on my data set
# - not too bad for an export script handling 20000 parts

h = {};

# or for readability

h = {} # result hash
ps.select{ |e| 
  ct = ps.count(e) 
  h[e] = ct if ct > 1
}; nil # so that the huge result of select doesn't print in the console

Solution 15 - Ruby

r = [1, 2, 3, 5, 1, 2, 3, 1, 2, 1]

r.group_by(&:itself).map { |k, v| v.size > 1 ? [k] + [v.size] : nil }.compact.sort_by(&:last).map(&:first)

Solution 16 - Ruby

each_with_object is your friend!

input = [:bla,:blubb,:bleh,:bla,:bleh,:bla,:blubb,:brrr]

# to get the counts of the elements in the array:
> input.each_with_object({}){|x,h| h[x] ||= 0; h[x] += 1}
=> {:bla=>3, :blubb=>2, :bleh=>2, :brrr=>1}

# to get only the counts of the non-unique elements in the array:
> input.each_with_object({}){|x,h| h[x] ||= 0; h[x] += 1}.reject{|k,v| v < 2}
=> {:bla=>3, :blubb=>2, :bleh=>2}

Solution 17 - Ruby

a = ["A", "B", "C", "B", "A"]
b = a.select {|e| a.count(e) > 1}.uniq
c = a - b
d = b + c

Results

 d
=> ["A", "B", "C"]

Solution 18 - Ruby

If you are comparing two different arrays (instead of one against itself) a very fast way is to use the intersect operator & provided by Ruby's Array class.

# Given
a = ['a', 'b', 'c', 'd']
b = ['e', 'f', 'c', 'd']

# Then this...
a & b # => ['c', 'd']

Solution 19 - Ruby

This runs very quickly (iterated through 2.3mil ids, took less than a second to push dups into their own array)

Had to do this at work with 2.3 mil IDs I imported into a file, I imported list as sorted, also can be sorted by ruby.

list = CSV.read(path).flatten.sort
  dup_list = []
  list.each_with_index do |id, index|
    dup_list.push(id) if id == list[index +1]
  end
  dup_list.to_set.to_a

Solution 20 - Ruby

I needed to find out how many duplicates there were and what they were so I wrote a function building off of what Naveed had posted earlier:

def print_duplicates(array)
  puts "Array count: #{array.count}"
  map = {}
  total_dups = 0
  array.each do |v|
    map[v] = (map[v] || 0 ) + 1
  end

  map.each do |k, v|
    if v != 1
      puts "#{k} appears #{v} times"
      total_dups += 1
    end
  end
  puts "Total items that are duplicated: #{total_dups}"
end

Solution 21 - Ruby

Try this! If you want to find the maximum duplicated element with their how many time is it has duplicated then should try

    def get_maximum_duplicated_element_with_count(input_array)
    	a = input_array
    	max_duplicated_val = max_duplicated_val_count = 0
    	a.each do |n| 
    		max_duplicated_val, max_duplicated_val_count = n, a.count(n) if a.count(n) >  max_duplicated_val_count 		
    	end
        puts "Maximun Duplicated element Is => #{max_duplicated_val}"
        puts "#{max_duplicated_val} is Duplicated #{max_duplicated_val_count} times"
    end
    get_maximum_duplicated_element_with_count([1, 4, 4, 5, 6, 6, 2, 6])

Output will be

Maximun Duplicated element Is => 6
6 is Duplicated 3 times

Solution 22 - Ruby

  1. Let's create duplication method that take array of elements as input
  2. In the method body, let's create 2 new array objects one is seen and another one is duplicate
  3. finally lets iterate through each object in given array and for every iteration lets find that object existed in seen array.
  4. if object existed in the seen_array, then it is considered as duplicate object and push that object into duplication_array
  5. if object not-existed in the seen, then it is considered as unique object and push that object into seen_array

let's demonstrate in Code Implementation

def duplication given_array
  seen_objects = []
  duplication_objects = []

  given_array.each do |element|
    duplication_objects << element if seen_objects.include?(element)
    seen_objects << element
  end

  duplication_objects
end

Now call duplication method and output return result -

dup_elements = duplication [1,2,3,4,4,5,6,6]
puts dup_elements.inspect

Solution 23 - Ruby

[1,2,3].uniq!.nil? => true
[1,2,3,3].uniq!.nil? => false

Notice the above is destructive

Attributions

All content for this solution is sourced from the original question on Stackoverflow.

The content on this page is licensed under the Attribution-ShareAlike 4.0 International (CC BY-SA 4.0) license.

Content TypeOriginal AuthorOriginal Content on Stackoverflow
QuestionMisha MoroshkoView Question on Stackoverflow
Solution 1 - RubyNaveedView Answer on Stackoverflow
Solution 2 - RubyRyan LeCompteView Answer on Stackoverflow
Solution 3 - RubyChris HealdView Answer on Stackoverflow
Solution 4 - RubyJjPView Answer on Stackoverflow
Solution 5 - RubyCary SwovelandView Answer on Stackoverflow
Solution 6 - RubyakuhnView Answer on Stackoverflow
Solution 7 - RubyMartin VelezView Answer on Stackoverflow
Solution 8 - RubyRokibul HasanView Answer on Stackoverflow
Solution 9 - RubySergio TulentsevView Answer on Stackoverflow
Solution 10 - RubydanielricecodesView Answer on Stackoverflow
Solution 11 - RubybenzhangView Answer on Stackoverflow
Solution 12 - RubycryptogopherView Answer on Stackoverflow
Solution 13 - RubymechnicovView Answer on Stackoverflow
Solution 14 - RubykonungView Answer on Stackoverflow
Solution 15 - RubyDorianView Answer on Stackoverflow
Solution 16 - RubyTiloView Answer on Stackoverflow
Solution 17 - RubyAmrit DhunganaView Answer on Stackoverflow
Solution 18 - RubyIAmNaNView Answer on Stackoverflow
Solution 19 - RubyCharlieView Answer on Stackoverflow
Solution 20 - RubymuneebahmadView Answer on Stackoverflow
Solution 21 - RubyGiridharanView Answer on Stackoverflow
Solution 22 - RubyYugesh PalvaiView Answer on Stackoverflow
Solution 23 - RubyMaxView Answer on Stackoverflow