Welcome to Software Development on Codidact!
Will you help us build our independent community of developers helping developers? We're small and trying to grow. We welcome questions about all aspects of software development, from design to code to QA and more. Got questions? Got answers? Got code you'd like someone to review? Please join us.
Generating combinations of elements, which are themselves sequences of the same length, where the elements have no value in common at any position
This question is adapted from a question I just helped fix up on Stack Overflow. I'm giving a different motivating example which I think will make it clearer what the requirements are and why a naive approach would take prohibitively long. (Also, I want to attempt an answer, but by personal policy I don't write new answers on Stack Overflow any more.)
I have a list of two-letter English words from the Official Scrabble Players' Dictionary, filtered down to those using only the 8 most common letters among those words (which happen to be all vowels including 'Y', plus 'M' and 'H'):
two = [
'AA', 'AE', 'AH', 'AI', 'AM', 'AY', 'EA', 'EE', 'EH',
'EM', 'HA', 'HE', 'HI', 'HM', 'HO', 'IO', 'MA', 'ME',
'MI', 'MM', 'MO', 'MU', 'MY', 'OE', 'OH', 'OI', 'OM',
'OO', 'OU', 'OY', 'UH', 'UM', 'YA', 'YE', 'YO', 'YU'
]
(There are 36 words, out of 64 possible pairs of letters. I'm showing Python code here, but I'm also interested in language-agnostic discussion of the algorithm.)
More generally: I have a list of sequences that are all n elements long. (Here, the sequences are strings, the elements are English letters, and n=2.
I want to find combinations of k sequences (say k=3 for example), restricted such that, within each combination, for all 0 <= i < n, the sequences in the combination have a distinct value at the ith position. So, for the two-letter words: groups of words that all have a different first letter and all also have a different second letter.
The alphabetically first such combination is ('AA', 'EE', 'HI')
. Note that e.g. the combination ('AA', 'EA', 'HA')
isn't valid because these words have the same second letter 'A'
.
My naive approach is to make all combinations of the words using the standard library itertools.combinations
(which already rejects invalid "combinations" using the same word more than once, and takes order into account), then filter out combinations with a common first word:
from itertools import combinations
word_triples = [
(a, b, c) for a, b, c in combinations(two, 3)
if a[0] != b[0] and a[0] != c[0] and b[0] != c[0]
and a[1] != b[1] and a[1] != c[1] and b[1] != c[1]
]
This gives a correct result in about a tenth of a second on my machine.
A more generalized implementation:
def brute_force(candidates, k):
n = len(candidates[0]) # assume they're all the same length
return [
c for c in combinations(candidates, k)
if all(
all(a[i] != b[i] for a, b in combinations(c, 2))
for i in range(n)
)
]
However, this clearly won't scale. For groups of 3 words it's simple enough: itertools.combinations
generates merely 7140 possibilities, and a fair amount of them (3056) are actually valid. But with more words in a combination, the unfiltered input will scale exponentially, while the desired result will grow much more slowly, and even shrink.
In the extreme: there are over 94 million combinations of 9 words from this list, but clearly none of them satisfy the condition as there are only 8 possible letters for each position. This problem is small enough that it takes less than a minute on my machine to find all 127 combinations of 8 words - but that already seems excessive, and the problem is obviously completely intractable if we were to use the full set of words, let alone words of more than 2 letters.
How can I do this more efficiently and directly? I'm looking for a fully general approach - it should be able to handle arbitrary values for n and k, and it should not be dependent on the type (or other properties) of the input sequences or their elements.
2 answers
Using a recursive generator
There isn't anything like a built-in solution for this algorithm, but it's easy to express the desired output recursively. A valid output combination of k-many inputs looks like an arbitrarily selected first element, followed by a combination of k-1-many elements that:
-
appear after the first element within the original input sequence (to distinguish combinations from permutations)
-
don't share a common ith value, for
0 <= i < n
Thus, we can filter the input sequence according to those criteria, then use recursion to get the possibilities for the remaining elements, and prepend our chosen first element. By iterating over possible first elements within the recursion, we get every desired result.
Accumulating a list of the results would be annoying with recursion, but in Python we can write a recursive generator to avoid the problem. We iterate over possibilities yield
ed from a recursive call to prepend a first element, and then iterate over possible first elements.
To clarify the logic, we can use a helper function to do the filtering.
The resulting implementation:
def no_common_elements(candidates, word):
return [
c for c in candidates
if all(cl != wl for cl, wl in zip(c, word))
]
def distinct_word_combinations(candidates, k):
if len(candidates) < k:
return # avoid doing tons of recursion through hopeless cases
if k == 0: # base case
yield ()
return
for i, first in enumerate(candidates):
new_candidates = no_common_elements(candidates[i+1:], first)
for result in distinct_word_combinations(new_candidates, k-1):
yield (first,) + result
This gives up the performance benefits that itertools.combinations
can offer by iterating and/or recursing at the C level, but it prunes branches ahead of time so that everything is generated directly.
In my testing, this is slower for short combinations, but (as expected) much faster for the hard cases:
>>> timeit.timeit("brute_force(two, 3)", globals=globals(), number=1)
0.011705408978741616
>>> timeit.timeit("list(distinct_word_combinations(two, 3))", globals=globals(), number=1)
0.015882118023000658
>>> timeit.timeit("brute_force(two, 8)", globals=globals(), number=1)
51.998020388011355
>>> timeit.timeit("list(distinct_word_combinations(two, 8))", globals=globals(), number=1)
0.048423305968753994
When given a larger input set, however, this approach gets bogged down very easily. With a full set of two-letter words, the brute force approach is about 6 times as fast, and the original version that's specific to n and k values is much faster still. Even with n=4, the recursive generator still loses handily.
This approach also, unfortunately, can't recognize the case where k exceeds the number of possibilities for one of the elements. With the restricted alphabet, this isn't a problem; but while a human can instantly tell that there are no 27-word combinations meeting the condition among the full set of 127 candidate words, this algorithm will still take impossibly long to reach the same conclusion.
0 comment threads
Using filtered products of candidates, over combinations of first letters
Let's consider the n=2 case first. (For higher n, we can use the same sort of recursive generator approach as in my other answer, but recursing on n instead of k.)
Partition the input words according to the first letter; we can take at most one word from each. If we generate combinations of k of these groups, then for each such combination, we have a list of possibilities for the first, second etc. word in a valid output.
So, we can generate the Cartesian product of such a combination in order to get a bunch of candidates (where the first letter is guaranteed distinct), and then filter them to ensure the second letter is distinct as well.
Thus:
from itertools import combinations, product
def distinct_word_combinations(words, k):
by_first_letter = {}
for word in words:
by_first_letter.setdefault(word[0], []).append(word)
for first_letters in combinations(by_first_letter.keys(), k):
word_lists = [by_first_letter[l] for l in first_letters]
for candidate in product(*word_lists):
if len({word[1] for word in candidate}) == len(candidate):
yield candidate
Also notice the improved test for distinct letters in a given position: instead of doing O(N^2) comparisons, we build a set
and check its length, which is O(N).
This doesn't do as good of a job of avoiding generating useless candidates; but in many ways it's the best of both worlds. Even for small k it's much faster:
>>> timeit.timeit("list(distinct_word_combinations(two, 3))", globals=globals(), number=1)
0.0022177270147949457
and it still does enough pruning to beat the recursive generator approach on this input for 8-word groups:
>>> timeit.timeit("list(distinct_word_combinations(two, 8))", globals=globals(), number=1)
0.030769726959988475
We can easily calculate how many combinations it considers for filtering:
from itertools import combinations
from math import prod
def candidate_count(words, k):
result = 0
by_first_letter = {}
for word in words:
by_first_letter[word[0]] = by_first_letter.get(word[0], 0) + 1
for first_letters in combinations(by_first_letter.keys(), k):
result += prod([by_first_letter[l] for l in first_letters])
return result
We find that, for example, 23437 candidates are generated for k=4 (of which 9698 are valid), compared to 58905 for the brute force approach.
We can also comfortably handle egregious cases where the number of words requested is impossibly high (not enough symbols are used in the first position) - since combinations(by_first_letter.keys(), k)
will promptly generate an empty sequence when k > len(by_first_letter.keys())
.
Of course, the choice of the "key" position to use is arbitrary - for the n=2 case we could equally well group the input according to the second letter, generate candidates, and then make sure the first letter is distinct.
0 comment threads