Function to make a list as unsorted as possible

PythonAlgorithmSorting

Python Problem Overview


I am looking for a function to make the list as unsorted as possible. Preferably in Python.

Backstory:

I want to check URLs statuses and see if URLs give a 404 or not. I just use asyncio and requests modules. Nothing fancy.

Now I don't want to overload servers, so I want to minimize checking URLs which are on the same domain at the same time. I have this idea to sort the URLs in a way that items which are close to one another (having the same sort key = domain name) are placed as far apart from each other in the list as possible.

An example with numbers:

a=[1,1,2,3,3]  # <== sorted list, sortness score = 2
   0,1,2,3,4   # <== positions

could be unsorted as:

b=[1,3,2,1,3]  # <== unsorted list, sortness score = 6
   0,1,2,3,4   # <== positions

I would say that we can compute a sortness score by summing up the distances between equal items (which have the same key = domain name). Higher sortness means better unsorted. Maybe there is a better way for testing unsortness.

The sortness score for list a is 2. The sum of distances for 1 is (1-0)=1, for 2 is 0, for 3 is (4-3)=1.

The sortness score for list b is 6. The sum of distances for 1 is (3-0)=3, for 2 is 0, for 3 is (4-1)=3.

URLs list would look something like a list of (domain, URL) tuples:

[
   ('example.com', 'http://example.com/404'),
   ('test.com', 'http://test.com/404'),
   ('test.com', 'http://test.com/405'),
   ('example.com', 'http://example.com/405'),
   ...
]

I am working on a prototype which works Ok-ish, but not optimal as I can find some variants which are better unsorted by hand.

Anyone wants to give it a go?

This is my code, but it's not great :):

from collections import Counter
from collections import defaultdict
import math


def test_unsortness(lst:list) -> float:
    pos = defaultdict(list)
    score = 0
    # Store positions for each key
    # input = [1,3,2,3,1] => {1: [0, 4], 3: [1, 3], 2: [2]}
    for c,l in enumerate(lst):
        pos[l].append(c)
    for k,poslst in pos.items():
        for i in range(len(poslst)-1):
            score += math.sqrt(poslst[i+1] - poslst[i])
    return score


def unsort(lst:list) -> list:
    free_positions = list(range(0,len(lst)))
    output_list = [None] * len(free_positions)
    for val, count in Counter(lst).most_common():
        pos = 0
        step = len(free_positions) / count
        for i in range(count):
            output_list[free_positions[int(pos)]] = val
            free_positions[int(pos)] = None  # Remove position later
            pos = pos + step
        free_positions = [p for p in free_positions if p]
    return output_list


lsts = list()
lsts.append( [1,1,2,3,3] )
lsts.append( [1,3,2,3,1] )       # This has the worst score after unsort()
lsts.append( [1,2,3,0,1,2,3] )   # This has the worst score after unsort()
lsts.append( [3,2,1,0,1,2,3] )   # This has the worst score after unsort()
lsts.append( [3,2,1,3,1,2,3] )   # This has the worst score after unsort()
lsts.append( [1,2,3,4,5] )

for lst in lsts:
    ulst = unsort(lst)
    print( ( lst, '%.2f'%test_unsortness(lst), '====>', ulst, '%.2f'%test_unsortness(ulst), ) )

#  Original               score             Unsorted               score
#  -------                -----             --------               -----
# ([1, 1, 2, 3, 3],       '2.00',  '====>', [1, 3, 1, 3, 2],       '2.83')
# ([1, 3, 2, 3, 1],       '3.41',  '====>', [1, 3, 1, 3, 2],       '2.83')
# ([1, 2, 3, 0, 1, 2, 3], '6.00',  '====>', [1, 2, 3, 1, 2, 3, 0], '5.20')
# ([3, 2, 1, 0, 1, 2, 3], '5.86',  '====>', [3, 2, 1, 3, 2, 1, 0], '5.20')
# ([3, 2, 1, 3, 1, 2, 3], '6.88',  '====>', [3, 2, 3, 1, 3, 2, 1], '6.56')
# ([1, 2, 3, 4, 5],       '0.00',  '====>', [1, 2, 3, 4, 5],       '0.00')

PS. I am not looking just for a randomize function and I know there are crawlers which can manage domain loads, but this is for the sake of exercise.

Python Solutions


Solution 1 - Python

Instead of unsorting your list of URLs, why not grouping them by domain, each in a queue, then process them asynchronously with a delay (randomised?) in between?

It looks to me less complex than what you're trying to do to achieve the same thing and if you have a lot of domain, you can always throttle the number to run through concurrently at that point.

Solution 2 - Python

I used Google OR Tools to solve this problem. I framed it as a constraint optimization problem and modeled it that way.

from collections import defaultdict
from itertools import chain, combinations
from ortools.sat.python import cp_model

model = cp_model.CpModel()
data = [
   ('example.com', 'http://example.com/404'),
   ('test.com', 'http://test.com/404'),
   ('test.com', 'http://test.com/405'),
   ('example.com', 'http://example.com/405'),
   ('google.com', 'http://google.com/404'),
   ('example.com', 'http://example.com/406'),
   ('stackoverflow.com', 'http://stackoverflow.com/404'),
   ('test.com', 'http://test.com/406'),
   ('example.com', 'http://example.com/407')
]

tmp = defaultdict(list)
for (domain, url) in sorted(data):
    var = model.NewIntVar(0, len(data) - 1, url)
    tmp[domain].append(var)  # store URLs as model variables where the key is the domain

vals = list(chain.from_iterable(tmp.values()))  # create a single list of all variables
model.AddAllDifferent(vals)  # all variables must occupy a unique spot in the output

constraint = []
for urls in tmp.values():
    if len(urls) == 1:  # a single domain does not need a specific constraint
        constraint.append(urls[0])
        continue
    combos = combinations(urls, 2)
    for (x, y) in combos:  # create combinations between each URL of a specific domain
        constraint.append((x - y))

model.Maximize(sum(constraint))  # maximize the distance between similar URLs from our constraint list

solver = cp_model.CpSolver()
status = solver.Solve(model)
output = [None for _ in range(len(data))]

if status == cp_model.OPTIMAL or status == cp_model.FEASIBLE:
    for val in vals:
        idx = solver.Value(val)
        output[idx] = val.Name()

print(output)
['http://example.com/407',
 'http://test.com/406',
 'http://example.com/406',
 'http://test.com/405',
 'http://example.com/405',
 'http://stackoverflow.com/404',
 'http://google.com/404',
 'http://test.com/404',
 'http://example.com/404']

Solution 3 - Python

There is no obvious definition of unsortedness that would work best for you, but here's something that at least works well:

  1. Sort the list
  2. If the length of the list is not a power of two, then spread the items out evenly in a list with the next power of two size
  3. Find a new index for each item by reversing the bits in its old index.
  4. Remove the gaps to bring the list back to its original size.

In sorted order, the indexes of items that are close together usually differ only in the smallest bits. By reversing the bit order, you make the new indexes for items that are close together differ in the largest bits, so they will end up far apart.

def bitreverse(x, bits):
    # reverse the lower 32 bits
    x = ((x & 0x55555555) << 1) | ((x & 0xAAAAAAAA) >> 1)
    x = ((x & 0x33333333) << 2) | ((x & 0xCCCCCCCC) >> 2)
    x = ((x & 0x0F0F0F0F) << 4) | ((x & 0xF0F0F0F0) >> 4)
    x = ((x & 0x00FF00FF) << 8) | ((x & 0xFF00FF00) >> 8)
    x = ((x & 0x0000FFFF) << 16) | ((x & 0xFFFF0000) >> 16)
    # take only the appropriate length
    return (x>>(32-bits)) & ((1<<bits)-1)

def antisort(inlist): 
    if len(inlist) < 3:
        return inlist
    inlist = sorted(inlist)
    #get the next power of 2 list length
    p2len = 2
    bits = 1
    while p2len < len(inlist):
        p2len *= 2
        bits += 1
    templist = [None] * p2len
    for i in range(len(inlist)):
        newi = i * p2len // len(inlist)
        newi = bitreverse(newi, bits)
        templist[newi] = inlist[i]
    return [item for item in templist if item != None]

print(antisort(["a","b","c","d","e","f","g",
    "h","i","j","k","l","m","n","o","p","q","r",
    "s","t","u","v","w","x","y","z"]))

Output:

['a', 'n', 'h', 'u', 'e', 'r', 'k', 'x', 'c', 'p', 'f', 's', 'm', 'z', 'b', 'o', 'i', 'v', 'l', 'y', 'd', 'q', 'j', 'w', 'g', 't']

Solution 4 - Python

You could implement an inverted binary search.

from typing import Union, List

sorted_int_list = [1, 1, 2, 3, 3]
unsorted_int_list = [1, 3, 2, 1, 3]

sorted_str_list = [
    "example.com",
    "example.com",
    "test.com",
    "stackoverflow.com",
    "stackoverflow.com",
]
unsorted_str_list = [
    "example.com",
    "stackoverflow.com",
    "test.com",
    "example.com",
    "stackoverflow.com",
]


def inverted_binary_search(
    input_list: List[Union[str, int]],
    search_elem: Union[int, str],
    list_selector_start: int,
    list_selector_end: int,
) -> int:
    if list_selector_end - list_selector_start <= 1:
        if search_elem < input_list[list_selector_start]:
            return list_selector_start - 1
        else:
            return list_selector_start

    list_selector_mid = (list_selector_start + list_selector_end) // 2
    if input_list[list_selector_mid] > search_elem:
        return inverted_binary_search(
            input_list=input_list,
            search_elem=search_elem,
            list_selector_start=list_selector_mid,
            list_selector_end=list_selector_end,
        )
    elif input_list[list_selector_mid] < search_elem:
        return inverted_binary_search(
            input_list=input_list,
            search_elem=search_elem,
            list_selector_start=list_selector_start,
            list_selector_end=list_selector_mid,
        )
    else:
        return list_selector_mid


def inverted_binary_insertion_sort(your_list: List[Union[str, int]]):
    for idx in range(1, len(your_list)):
        selected_elem = your_list[idx]
        inverted_binary_search_position = (
            inverted_binary_search(
                input_list=your_list,
                search_elem=selected_elem,
                list_selector_start=0,
                list_selector_end=idx,
            )
            + 1
        )

        for idk in range(idx, inverted_binary_search_position, -1):
            your_list[idk] = your_list[idk - 1]

        your_list[inverted_binary_search_position] = selected_elem
    return your_list

Output

inverted_sorted_int_list = inverted_binary_insertion_sort(sorted_int_list)
print(inverted_sorted_int_list)
>> [1, 3, 3, 2, 1]

inverted_sorted_str_list = inverted_binary_insertion_sort(sorted_str_list)
print(inverted_sorted_str_list)
>> ['example.com', 'stackoverflow.com', 'stackoverflow.com', 'test.com', 'example.com']

Update:

Given the comments, you could also run the function twice. This will untangle duplicates.

inverted_sorted_int_list = inverted_binary_insertion_sort(
    inverted_binary_insertion_sort(sorted_int_list)
)
>> [1, 3, 2, 1, 3]

Solution 5 - Python

Here's a stab at it, but I am not sure it wouldn't degenerate a bit given particular input sets.

We pick the most frequent found item and append its first occurrence to a list. Then same with the 2nd most frequent and so on.

Repeat half the size of the most found item. That's the left half of the list.

Then moving from least frequent to most frequent, pick first item and add its values. When an item is found less than half the max, pick on which side you want to put it.

Essentially, we layer key by key and end up with more frequent items at left-most and right-most positions in the unsorted list, leaving less frequent ones in the middle.

def unsort(lst:list) -> list:
    """
    build a dictionary by frequency first
    then loop thru the keys and append 
    key by key with the other keys in between
    """

    result = []
    
    #dictionary by keys (this would be domains to urls)
    di = defaultdict(list)
    for v in lst:
        di[v].append(v)

    #sort by decreasing dupes length
    li_len = [(len(val),key) for key, val in di.items()]
    li_len.sort(reverse=True)

    #most found count
    max_c = li_len[0][0]
    #halfway point
    odd = max_c % 2
    num = max_c // 2
    if odd:
        num += 1

    #frequency, high to low
    by_freq = [tu[1] for tu in li_len]


    di_side = {}

    #for the first half, pick from frequent to infrequent
    #alternating by keys
    for c in range(num):

        #frequent to less
        for key in by_freq:

            entries = di[key]

            #first pass:  less than half the number of values 
            #we don't want to put all the infrequents first
            #and have a more packed second side
            if not c:
                #pick on way in or out?
                if len(entries) <= num:
                    #might be better to alternate L,R,L
                    di_side[key] = random.choice(["L","R"])
                else:
                    #pick on both
                    di_side[key] = "B"
                

            #put in the left half
            do_it = di_side[key] in ("L","B")
            
            if do_it and entries:
                result.append(entries.pop(0))

    #once you have mid point pick from infrequent to frequent
    for c in range(num):
        #frequent to less
        for key in reversed(by_freq):
            entries = di[key]

            #put in the right half 
            do_it = di_side[key] in ("R","B")
            
            if entries:
                result.append(entries.pop(0)) 

    return result

Running this I got:

([1, 1, 2, 3, 3], '2.00', '====>', [3, 1, 2, 1, 3], '3.41')
([1, 3, 2, 3, 1], '3.41', '====>', [3, 1, 2, 1, 3], '3.41')
([1, 2, 3, 0, 1, 2, 3], '6.00', '====>', [3, 2, 1, 0, 1, 2, 3], '5.86')
([3, 2, 1, 0, 1, 2, 3], '5.86', '====>', [3, 2, 1, 0, 1, 2, 3], '5.86')
([3, 2, 1, 3, 1, 2, 3], '6.88', '====>', [3, 2, 3, 2, 1, 3, 1], '5.97')
([1, 2, 3, 4, 5], '0.00', '====>', [5, 1, 2, 3, 4], '0.00')

Oh, and I also added an assert to check nothing had been dropped or altered by the unsorting:

assert(sorted(lst) == sorted(ulst))
alternate approach?

I'll leave it as a footnote for now, but the general idea of not clustering (not the OP's specific application of not overloading domains) seems like it would be a candidate for a force-repulsive approach, where identical domains would try to keep as far from each other as possible. i.e. 1, 1, 2 => 1, 2, 1 because the 1s would repel each other. That's a wholly different algorithmic approach however.

Solution 6 - Python

When I faced a similar problem, here's how I solved it:

  1. Define the "distance" between two strings (URLs in this case) as their Levenshtein distance (code to compute this value is readily available)
  2. Adopt your favorite travelling-salesman algorithm to find the (approximate) shortest path through your set of strings (finding the exact shortest path isn't computationally feasible but the approximate algorithms are fairly efficient)
  3. Now modify your "distance" metric to be inverted -- i.e. compute the distance between two strings (s1,s2) as MAX_INT - LevenshteinDistance(s1,s2)
  4. With this modification, the "shortest path" through your set is now really the longest path, i.e. the most un-sorted one.

Solution 7 - Python

I hope this algorithm works correctly:

unsorted_list = ['c', 'a', 'a', 'a', 'a', 'b', 'b']

d = {i: unsorted_list.count(i) for i in unsorted_list}
print(d)  # {'c': 1, 'a': 4, 'b': 2}

d = {k: v for k, v in sorted(d.items(), key=lambda item: item[1], reverse=True)}
print(d)  # {'a': 4, 'b': 2, 'c': 1}

result = [None] * len(unsorted_list)
border_index_left = 0
border_index_right = len(unsorted_list) - 1

it = iter(d)


def set_recursively(k, nk):
    set_borders(k)
    set_borders(nk)
    if d[k]:
        set_recursively(k, nk)


def set_borders(key):
    global border_index_left, border_index_right
    if key is not None and d[key]:
        result[border_index_left] = key
        d[key] = d[key] - 1
        border_index_left = border_index_left + 1
    if key is not None and d[key]:
        result[border_index_right] = key
        d[key] = d[key] - 1
        border_index_right = border_index_right - 1


next_element = next(it, None)
for k, v in d.items():
    next_element = next(it, None)
    set_recursively(k, next_element)

print(result)  # ['a', 'b', 'a', 'c', 'a', 'b', 'a']

Visually, it looks as walking from the edge to the middle:

[2, 3, 3, 3, 1, 1, 0]

[None, None, None, None, None, None, None]
[3, None, None, None, None, None, None]
[3, None, None, None, None, None, 3]
[3, 1, None, None, None, None, 3]
[3, 1, None, None, None, 1, 3]
[3, 1, 3, None, None, 1, 3]
[3, 1, 3, 2, None, 1, 3]
[3, 1, 3, 2, 0, 1, 3]

Solution 8 - Python

An easy way to scramble a list is to maximize its "sortness" score using a genetic algorithm with a permutation chromosome. I was able to hack quickly a version in R using the GA package. I'm not a Python user, but I am sure there are GA libraries for Python that include permutation chromosomes. If not, a general GA library with real-valued vector chromosomes can be adapted. You just use a vector with values in [0, 1] as a chromosome and convert each vector to its sort index.

Solution 9 - Python

Just saying, putting a short time delay would work just fine. I think someone mentioned this already. It is very simple and very reliable. You could do something like:

from random import sample
from time import sleep
import requests
intervalList = list(range(0.1, 0.5))
error404 = []
connectionError = []
for i in your_URL_list:
    ststusCode = req.get(str(i)).status_code
    if ststusCode == 404:
        error404.append(i)
    sleep(sample(intervalList,1))

Cheers

Attributions

All content for this solution is sourced from the original question on Stackoverflow.

The content on this page is licensed under the Attribution-ShareAlike 4.0 International (CC BY-SA 4.0) license.

Content TypeOriginal AuthorOriginal Content on Stackoverflow
QuestionddofborgView Question on Stackoverflow
Solution 1 - PythongfacheView Answer on Stackoverflow
Solution 2 - Pythongold_cyView Answer on Stackoverflow
Solution 3 - PythonMatt TimmermansView Answer on Stackoverflow
Solution 4 - PythonHommesView Answer on Stackoverflow
Solution 5 - PythonJL PeyretView Answer on Stackoverflow
Solution 6 - PythonJeremy FriesnerView Answer on Stackoverflow
Solution 7 - PythonIgorZView Answer on Stackoverflow
Solution 8 - PythonprubinView Answer on Stackoverflow
Solution 9 - PythonKuba SobolewskiView Answer on Stackoverflow