Welcome to Software Development on Codidact!

Will you help us build our independent community of developers helping developers? We're small and trying to grow. We welcome questions about all aspects of software development, from design to code to QA and more. Got questions? Got answers? Got code you'd like someone to review? Please join us.

Post History

60%

+1 −0

Q&A Generating combinations of elements, which are themselves sequences of the same length, where the elements have no value in common at any position

This question is adapted from a question I just helped fix up on Stack Overflow. I'm giving a different motivating example which I think will make it clearer what the requirements are and why a n...

2 answers · posted 7mo ago by Karl Knechtel‭ · last activity 7mo ago by Karl Knechtel‭

Question python algorithms combinatorics

#5: Post edited by

Karl Knechtel‭ · 2024-11-12T22:57:58Z (7 months ago)
Reduce the input set to a more tractable size to facilitate comparisons and discussion

Copy Link

Raw

Markdown

<section class="notice">
This question is adapted from [a question I just helped fix up on Stack Overflow](https://stackoverflow.com/questions/79182496). I'm giving a different motivating example which I think will make it clearer what the requirements are and why a naive approach would take prohibitively long. (Also, I want to attempt an answer, but by personal policy I don't write new answers on Stack Overflow any more.)
</section>
~~I have a list of two-letter English words [from the Official Scrabble Players' Dictionary](https://scrabble.collinsdictionary.com/word-lists/two-letter-words-in-scrabble/):~~
```
two = [
~~'AA', 'AB', 'AD', 'AE', 'AG', 'AH', 'AI', 'AL', 'AM', 'AN', 'AR', 'AS',~~
~~'AT', 'AW', 'AX', 'AY', 'BA', 'BE', 'BI', 'BO', 'BY', 'CH', 'DA', 'DE',~~
~~'DI', 'DO', 'EA', 'ED', 'EE', 'EF', 'EH', 'EL', 'EM', 'EN', 'ER', 'ES',~~
~~'ET', 'EW', 'EX', 'FA', 'FE', 'FY', 'GI', 'GO', 'GU', 'HA', 'HE', 'HI',~~
~~'HM', 'HO', 'ID', 'IF', 'IN', 'IO', 'IS', 'IT', 'JA', 'JO', 'KA', 'KI',~~
~~'KO', 'KY', 'LA', 'LI', 'LO', 'MA', 'ME', 'MI', 'MM', 'MO', 'MU', 'MY',~~
'NA', 'NE', 'NO', 'NU', 'NY', 'OB', 'OD', 'OE', 'OF', 'OH', 'OI', 'OK',
'OM', 'ON', 'OO', 'OP', 'OR', 'OS', 'OU', 'OW', 'OX', 'OY', 'PA', 'PE',
~~'PI', 'PO', 'QI', 'RE', 'SH', 'SI', 'SO', 'ST', 'TA', 'TE', 'TI', 'TO',~~
~~'UG', 'UH', 'UM', 'UN', 'UP', 'UR', 'US', 'UT', 'WE', 'WO', 'XI', 'XU',~~
'YA', 'YE', 'YO', 'YU', 'ZA', 'ZE', 'ZO'
]
```
~~(There are 127 words, out of 676 possible pairs of letters. I'm showing Python code here, but I'm also interested in language-agnostic discussion of the algorithm.)~~
More generally: I have a list of sequences that are all *n* elements long. (Here, the sequences are strings, the elements are English letters, and *n*=2.
I want to find combinations of *k* sequences (say *k*=3 for example), restricted such that, *within each combination*, *for all 0 <= i < n*, the sequences in the combination have a distinct value at the *i*th position. So, for the two-letter words: groups of words that all have a different first letter and all also have a different second letter.
~~The alphabetically first such combination is `('AA', 'BE', 'CH')`. Note that the combination `('AA', 'BA', 'CH')` isn't valid because the first two words have the same *second* letter `'A'`.~~
My naive approach is to make all combinations of the words using the standard library `itertools.combinations` (which already rejects invalid "combinations" using the *same word* more than once, and takes order into account), then filter out combinations with a common first word:
```python
from itertools import combinations
word_triples = [
(a, b, c) for a, b, c in combinations(two, 3)
if a[0] != b[0] and a[0] != c[0] and b[0] != c[0]
and a[1] != b[1] and a[1] != c[1] and b[1] != c[1]
]
```
This gives a correct result in about a tenth of a second on my machine.
~~However, there are some problems:~~
~~1. The filtering logic would be hard to generalize.~~
1. **This clearly won't scale**. For groups of 3 words it's simple enough: `itertools.combinations` generates `333375` possibilities, and most of them are actually valid - `len(word_triples)` reports `217503`. But with more words in a combination, the unfiltered input will scale exponentially, while the desired result will grow much more slowly, and even shrink. (In the extreme: there are about 3 * 10<sup>27</sup> possible combinations of 27 words from this list, but clearly *none* of them satisfy the condition as there are only 26 possible letters for each position.)
I did test this for groups of 4 words, writing out all the conditions manually - and as expected, it's dozens of times slower. It has to do twice as many filtering comparisons, while generating 31 times as many candidates. (A more generalized approach using set comprehensions to check for distinct letters is actually even slower, though that presumably reverses at some point.)
**How can I do this more efficiently and directly**? I'm looking for a *fully general* approach - it should be able to handle arbitrary values for *n* and *k*, and it should not be dependent on the type (or other properties) of the input sequences or their elements.

<section class="notice">
This question is adapted from [a question I just helped fix up on Stack Overflow](https://stackoverflow.com/questions/79182496). I'm giving a different motivating example which I think will make it clearer what the requirements are and why a naive approach would take prohibitively long. (Also, I want to attempt an answer, but by personal policy I don't write new answers on Stack Overflow any more.)
</section>
I have a list of two-letter English words [from the Official Scrabble Players' Dictionary](https://scrabble.collinsdictionary.com/word-lists/two-letter-words-in-scrabble/), filtered down to those using only the 8 most common letters among those words (which happen to be all vowels including 'Y', plus 'M' and 'H'):
```
two = [
'AA', 'AE', 'AH', 'AI', 'AM', 'AY', 'EA', 'EE', 'EH',
'EM', 'HA', 'HE', 'HI', 'HM', 'HO', 'IO', 'MA', 'ME',
'MI', 'MM', 'MO', 'MU', 'MY', 'OE', 'OH', 'OI', 'OM',
'OO', 'OU', 'OY', 'UH', 'UM', 'YA', 'YE', 'YO', 'YU'
]
```
(There are 36 words, out of 64 possible pairs of letters. I'm showing Python code here, but I'm also interested in language-agnostic discussion of the algorithm.)
More generally: I have a list of sequences that are all *n* elements long. (Here, the sequences are strings, the elements are English letters, and *n*=2.
I want to find combinations of *k* sequences (say *k*=3 for example), restricted such that, *within each combination*, *for all 0 <= i < n*, the sequences in the combination have a distinct value at the *i*th position. So, for the two-letter words: groups of words that all have a different first letter and all also have a different second letter.
The alphabetically first such combination is `('AA', 'EE', 'HI')`. Note that e.g. the combination `('AA', 'EA', 'HA')` isn't valid because these words have the same *second* letter `'A'`.
My naive approach is to make all combinations of the words using the standard library `itertools.combinations` (which already rejects invalid "combinations" using the *same word* more than once, and takes order into account), then filter out combinations with a common first word:
```python
from itertools import combinations
word_triples = [
(a, b, c) for a, b, c in combinations(two, 3)
if a[0] != b[0] and a[0] != c[0] and b[0] != c[0]
and a[1] != b[1] and a[1] != c[1] and b[1] != c[1]
]
```
This gives a correct result in about a tenth of a second on my machine.
A more generalized implementation:
```python
def brute_force(candidates, k):
n = len(candidates[0]) # assume they're all the same length
return [
c for c in combinations(candidates, k)
if all(
all(a[i] != b[i] for a, b in combinations(c, 2))
for i in range(n)
)
]
```
However, **this clearly won't scale**. For groups of 3 words it's simple enough: `itertools.combinations` generates merely 7140 possibilities, and a fair amount of them (3056) are actually valid. But with more words in a combination, the unfiltered input will scale exponentially, while the desired result will grow much more slowly, and even shrink.
In the extreme: there are over 94 million combinations of 9 words from this list, but clearly *none* of them satisfy the condition as there are only 8 possible letters for each position. This problem is small enough that it takes less than a minute on my machine to find all 127 combinations of 8 words - but that already seems excessive, and the problem is obviously completely intractable if we were to use the full set of words, let alone words of more than 2 letters.
**How can I do this more efficiently and directly**? I'm looking for a *fully general* approach - it should be able to handle arbitrary values for *n* and *k*, and it should not be dependent on the type (or other properties) of the input sequences or their elements.

#4: Post edited by

Karl Knechtel‭ · 2024-11-12T21:32:42Z (7 months ago)

Copy Link

Raw

Markdown

<section class="notice">
This question is adapted from [a question I just helped fix up on Stack Overflow](https://stackoverflow.com/questions/79182496). I'm giving a different motivating example which I think will make it clearer what the requirements are and why a naive approach would take prohibitively long. (Also, I want to attempt an answer, but by personal policy I don't write new answers on Stack Overflow any more.)
</section>
I have a list of two-letter English words [from the Official Scrabble Players' Dictionary](https://scrabble.collinsdictionary.com/word-lists/two-letter-words-in-scrabble/):
```
two = [
'AA', 'AB', 'AD', 'AE', 'AG', 'AH', 'AI', 'AL', 'AM', 'AN', 'AR', 'AS',
'AT', 'AW', 'AX', 'AY', 'BA', 'BE', 'BI', 'BO', 'BY', 'CH', 'DA', 'DE',
'DI', 'DO', 'EA', 'ED', 'EE', 'EF', 'EH', 'EL', 'EM', 'EN', 'ER', 'ES',
'ET', 'EW', 'EX', 'FA', 'FE', 'FY', 'GI', 'GO', 'GU', 'HA', 'HE', 'HI',
'HM', 'HO', 'ID', 'IF', 'IN', 'IO', 'IS', 'IT', 'JA', 'JO', 'KA', 'KI',
'KO', 'KY', 'LA', 'LI', 'LO', 'MA', 'ME', 'MI', 'MM', 'MO', 'MU', 'MY',
'NA', 'NE', 'NO', 'NU', 'NY', 'OB', 'OD', 'OE', 'OF', 'OH', 'OI', 'OK',
'OM', 'ON', 'OO', 'OP', 'OR', 'OS', 'OU', 'OW', 'OX', 'OY', 'PA', 'PE',
'PI', 'PO', 'QI', 'RE', 'SH', 'SI', 'SO', 'ST', 'TA', 'TE', 'TI', 'TO',
'UG', 'UH', 'UM', 'UN', 'UP', 'UR', 'US', 'UT', 'WE', 'WO', 'XI', 'XU',
'YA', 'YE', 'YO', 'YU', 'ZA', 'ZE', 'ZO'
]
```
(There are 127 words, out of 676 possible pairs of letters. I'm showing Python code here, but I'm also interested in language-agnostic discussion of the algorithm.)
~~More generally: I have a list of sequences that are all *n* elements long. (Here, the sequences are strings, the elements are English letters, and n=2.~~
I want to find combinations of *k* sequences (say k=3 for example), restricted such that, *within each combination*, *for all 0 <= i < n*, the sequences in the combination have a distinct value at the *i*th position. So, for the two-letter words: groups of words that all have a different first letter and all also have a different second letter.
~~The alphabetically first such combination is `['AA', 'BE', 'CH']`. Note that the combination `['AA', 'BA', 'CH']` isn't valid because the first two words have the same *second* letter `'A'`.~~
My naive approach is to make all combinations of the words using the standard library `itertools.combinations` (which already rejects invalid "combinations" using the *same word* more than once, and takes order into account), then filter out combinations with a common first word:
~~```~~
from itertools import combinations
word_triples = [
(a, b, c) for a, b, c in combinations(two, 3)
if a[0] != b[0] and a[0] != c[0] and b[0] != c[0]
and a[1] != b[1] and a[1] != c[1] and b[1] != c[1]
]
```
This gives a correct result in about a tenth of a second on my machine.
However, there are some problems:
1. The filtering logic would be hard to generalize.
1. **This clearly won't scale**. For groups of 3 words it's simple enough: `itertools.combinations` generates `333375` possibilities, and most of them are actually valid - `len(word_triples)` reports `217503`. But with more words in a combination, the unfiltered input will scale exponentially, while the desired result will grow much more slowly, and even shrink. (In the extreme: there are about 3 * 10<sup>27</sup> possible combinations of 27 words from this list, but clearly *none* of them satisfy the condition as there are only 26 possible letters for each position.)
I did test this for groups of 4 words, writing out all the conditions manually - and as expected, it's dozens of times slower. It has to do twice as many filtering comparisons, while generating 31 times as many candidates. (A more generalized approach using set comprehensions to check for distinct letters is actually even slower, though that presumably reverses at some point.)
**How can I do this more efficiently and directly**? I'm looking for a *fully general* approach - it should be able to handle arbitrary values for *n* and *k*, and it should not be dependent on the type (or other properties) of the input sequences or their elements.

<section class="notice">
This question is adapted from [a question I just helped fix up on Stack Overflow](https://stackoverflow.com/questions/79182496). I'm giving a different motivating example which I think will make it clearer what the requirements are and why a naive approach would take prohibitively long. (Also, I want to attempt an answer, but by personal policy I don't write new answers on Stack Overflow any more.)
</section>
I have a list of two-letter English words [from the Official Scrabble Players' Dictionary](https://scrabble.collinsdictionary.com/word-lists/two-letter-words-in-scrabble/):
```
two = [
'AA', 'AB', 'AD', 'AE', 'AG', 'AH', 'AI', 'AL', 'AM', 'AN', 'AR', 'AS',
'AT', 'AW', 'AX', 'AY', 'BA', 'BE', 'BI', 'BO', 'BY', 'CH', 'DA', 'DE',
'DI', 'DO', 'EA', 'ED', 'EE', 'EF', 'EH', 'EL', 'EM', 'EN', 'ER', 'ES',
'ET', 'EW', 'EX', 'FA', 'FE', 'FY', 'GI', 'GO', 'GU', 'HA', 'HE', 'HI',
'HM', 'HO', 'ID', 'IF', 'IN', 'IO', 'IS', 'IT', 'JA', 'JO', 'KA', 'KI',
'KO', 'KY', 'LA', 'LI', 'LO', 'MA', 'ME', 'MI', 'MM', 'MO', 'MU', 'MY',
'NA', 'NE', 'NO', 'NU', 'NY', 'OB', 'OD', 'OE', 'OF', 'OH', 'OI', 'OK',
'OM', 'ON', 'OO', 'OP', 'OR', 'OS', 'OU', 'OW', 'OX', 'OY', 'PA', 'PE',
'PI', 'PO', 'QI', 'RE', 'SH', 'SI', 'SO', 'ST', 'TA', 'TE', 'TI', 'TO',
'UG', 'UH', 'UM', 'UN', 'UP', 'UR', 'US', 'UT', 'WE', 'WO', 'XI', 'XU',
'YA', 'YE', 'YO', 'YU', 'ZA', 'ZE', 'ZO'
]
```
(There are 127 words, out of 676 possible pairs of letters. I'm showing Python code here, but I'm also interested in language-agnostic discussion of the algorithm.)
More generally: I have a list of sequences that are all *n* elements long. (Here, the sequences are strings, the elements are English letters, and *n*=2.
I want to find combinations of *k* sequences (say *k*=3 for example), restricted such that, *within each combination*, *for all 0 <= i < n*, the sequences in the combination have a distinct value at the *i*th position. So, for the two-letter words: groups of words that all have a different first letter and all also have a different second letter.
The alphabetically first such combination is `('AA', 'BE', 'CH')`. Note that the combination `('AA', 'BA', 'CH')` isn't valid because the first two words have the same *second* letter `'A'`.
My naive approach is to make all combinations of the words using the standard library `itertools.combinations` (which already rejects invalid "combinations" using the *same word* more than once, and takes order into account), then filter out combinations with a common first word:
```python
from itertools import combinations
word_triples = [
(a, b, c) for a, b, c in combinations(two, 3)
if a[0] != b[0] and a[0] != c[0] and b[0] != c[0]
and a[1] != b[1] and a[1] != c[1] and b[1] != c[1]
]
```
This gives a correct result in about a tenth of a second on my machine.
However, there are some problems:
1. The filtering logic would be hard to generalize.
1. **This clearly won't scale**. For groups of 3 words it's simple enough: `itertools.combinations` generates `333375` possibilities, and most of them are actually valid - `len(word_triples)` reports `217503`. But with more words in a combination, the unfiltered input will scale exponentially, while the desired result will grow much more slowly, and even shrink. (In the extreme: there are about 3 * 10<sup>27</sup> possible combinations of 27 words from this list, but clearly *none* of them satisfy the condition as there are only 26 possible letters for each position.)
I did test this for groups of 4 words, writing out all the conditions manually - and as expected, it's dozens of times slower. It has to do twice as many filtering comparisons, while generating 31 times as many candidates. (A more generalized approach using set comprehensions to check for distinct letters is actually even slower, though that presumably reverses at some point.)
**How can I do this more efficiently and directly**? I'm looking for a *fully general* approach - it should be able to handle arbitrary values for *n* and *k*, and it should not be dependent on the type (or other properties) of the input sequences or their elements.

#3: Post edited by

Karl Knechtel‭ · 2024-11-12T21:30:52Z (7 months ago)
More clarification and mathematical rigor.

Copy Link

Raw

Markdown

<section class="notice">
This question is adapted from [a question I just helped fix up on Stack Overflow](https://stackoverflow.com/questions/79182496). I'm giving a different motivating example which I think will make it clearer what the requirements are and why a naive approach would take prohibitively long. (Also, I want to attempt an answer, but by personal policy I don't write new answers on Stack Overflow any more.)
</section>
I have a list of two-letter English words [from the Official Scrabble Players' Dictionary](https://scrabble.collinsdictionary.com/word-lists/two-letter-words-in-scrabble/):
```
two = [
'AA', 'AB', 'AD', 'AE', 'AG', 'AH', 'AI', 'AL', 'AM', 'AN', 'AR', 'AS',
'AT', 'AW', 'AX', 'AY', 'BA', 'BE', 'BI', 'BO', 'BY', 'CH', 'DA', 'DE',
'DI', 'DO', 'EA', 'ED', 'EE', 'EF', 'EH', 'EL', 'EM', 'EN', 'ER', 'ES',
'ET', 'EW', 'EX', 'FA', 'FE', 'FY', 'GI', 'GO', 'GU', 'HA', 'HE', 'HI',
'HM', 'HO', 'ID', 'IF', 'IN', 'IO', 'IS', 'IT', 'JA', 'JO', 'KA', 'KI',
'KO', 'KY', 'LA', 'LI', 'LO', 'MA', 'ME', 'MI', 'MM', 'MO', 'MU', 'MY',
'NA', 'NE', 'NO', 'NU', 'NY', 'OB', 'OD', 'OE', 'OF', 'OH', 'OI', 'OK',
'OM', 'ON', 'OO', 'OP', 'OR', 'OS', 'OU', 'OW', 'OX', 'OY', 'PA', 'PE',
'PI', 'PO', 'QI', 'RE', 'SH', 'SI', 'SO', 'ST', 'TA', 'TE', 'TI', 'TO',
'UG', 'UH', 'UM', 'UN', 'UP', 'UR', 'US', 'UT', 'WE', 'WO', 'XI', 'XU',
'YA', 'YE', 'YO', 'YU', 'ZA', 'ZE', 'ZO'
]
```
(There are 127 words, out of 676 possible pairs of letters. I'm showing Python code here, but I'm also interested in language-agnostic discussion of the algorithm.)
I want to find *combinations of k words* (say k=3 for example), restricted such that, *within each combination*, all words have a distinct first letter *and* a distinct second letter. More generally, the length of the words isn't an essential element of the problem - sub-elements of the elements chosen should be unique at *each* position.
The alphabetically first such combination is `['AA', 'BE', 'CH']`. Note that the combination `['AA', 'BA', 'CH']` isn't valid because the first two words have the same *second* letter `'A'`.
My naive approach is to make all combinations of the words using the standard library `itertools.combinations` (which already rejects invalid "combinations" using the *same word* more than once, and takes order into account), then filter out combinations with a common first word:
```
from itertools import combinations
word_triples = [
(a, b, c) for a, b, c in combinations(two, 3)
if a[0] != b[0] and a[0] != c[0] and b[0] != c[0]
and a[1] != b[1] and a[1] != c[1] and b[1] != c[1]
]
```
This gives a correct result in about a tenth of a second on my machine.
However, there are some problems:
1. The filtering logic would be hard to generalize.
1. **This clearly won't scale**. For groups of 3 words it's simple enough: `itertools.combinations` generates `333375` possibilities, and most of them are actually valid - `len(word_triples)` reports `217503`. But with more words in a combination, the unfiltered input will scale exponentially, while the desired result will grow much more slowly, and even shrink. (In the extreme: there are about 3 * 10<sup>27</sup> possible combinations of 27 words from this list, but clearly *none* of them satisfy the condition as there are only 26 possible letters for each position.)
I did test this for groups of 4 words, writing out all the conditions manually - and as expected, it's dozens of times slower. It has to do twice as many filtering comparisons, while generating 31 times as many candidates. (A more generalized approach using set comprehensions to check for distinct letters is actually even slower, though that presumably reverses at some point.)
**How can I do this more efficiently and directly**? I'm looking for a *fully general* approach - it should be able to handle:
* Inputs where the elements to combine are longer (as long as they're all the same length)
* Inputs which are arbitrary sequences, rather than just strings of English letters
* An arbitrary value for `k`

<section class="notice">
This question is adapted from [a question I just helped fix up on Stack Overflow](https://stackoverflow.com/questions/79182496). I'm giving a different motivating example which I think will make it clearer what the requirements are and why a naive approach would take prohibitively long. (Also, I want to attempt an answer, but by personal policy I don't write new answers on Stack Overflow any more.)
</section>
I have a list of two-letter English words [from the Official Scrabble Players' Dictionary](https://scrabble.collinsdictionary.com/word-lists/two-letter-words-in-scrabble/):
```
two = [
'AA', 'AB', 'AD', 'AE', 'AG', 'AH', 'AI', 'AL', 'AM', 'AN', 'AR', 'AS',
'AT', 'AW', 'AX', 'AY', 'BA', 'BE', 'BI', 'BO', 'BY', 'CH', 'DA', 'DE',
'DI', 'DO', 'EA', 'ED', 'EE', 'EF', 'EH', 'EL', 'EM', 'EN', 'ER', 'ES',
'ET', 'EW', 'EX', 'FA', 'FE', 'FY', 'GI', 'GO', 'GU', 'HA', 'HE', 'HI',
'HM', 'HO', 'ID', 'IF', 'IN', 'IO', 'IS', 'IT', 'JA', 'JO', 'KA', 'KI',
'KO', 'KY', 'LA', 'LI', 'LO', 'MA', 'ME', 'MI', 'MM', 'MO', 'MU', 'MY',
'NA', 'NE', 'NO', 'NU', 'NY', 'OB', 'OD', 'OE', 'OF', 'OH', 'OI', 'OK',
'OM', 'ON', 'OO', 'OP', 'OR', 'OS', 'OU', 'OW', 'OX', 'OY', 'PA', 'PE',
'PI', 'PO', 'QI', 'RE', 'SH', 'SI', 'SO', 'ST', 'TA', 'TE', 'TI', 'TO',
'UG', 'UH', 'UM', 'UN', 'UP', 'UR', 'US', 'UT', 'WE', 'WO', 'XI', 'XU',
'YA', 'YE', 'YO', 'YU', 'ZA', 'ZE', 'ZO'
]
```
(There are 127 words, out of 676 possible pairs of letters. I'm showing Python code here, but I'm also interested in language-agnostic discussion of the algorithm.)
More generally: I have a list of sequences that are all *n* elements long. (Here, the sequences are strings, the elements are English letters, and n=2.
I want to find combinations of *k* sequences (say k=3 for example), restricted such that, *within each combination*, *for all 0 <= i < n*, the sequences in the combination have a distinct value at the *i*th position. So, for the two-letter words: groups of words that all have a different first letter and all also have a different second letter.
The alphabetically first such combination is `['AA', 'BE', 'CH']`. Note that the combination `['AA', 'BA', 'CH']` isn't valid because the first two words have the same *second* letter `'A'`.
My naive approach is to make all combinations of the words using the standard library `itertools.combinations` (which already rejects invalid "combinations" using the *same word* more than once, and takes order into account), then filter out combinations with a common first word:
```
from itertools import combinations
word_triples = [
(a, b, c) for a, b, c in combinations(two, 3)
if a[0] != b[0] and a[0] != c[0] and b[0] != c[0]
and a[1] != b[1] and a[1] != c[1] and b[1] != c[1]
]
```
This gives a correct result in about a tenth of a second on my machine.
However, there are some problems:
1. The filtering logic would be hard to generalize.
1. **This clearly won't scale**. For groups of 3 words it's simple enough: `itertools.combinations` generates `333375` possibilities, and most of them are actually valid - `len(word_triples)` reports `217503`. But with more words in a combination, the unfiltered input will scale exponentially, while the desired result will grow much more slowly, and even shrink. (In the extreme: there are about 3 * 10<sup>27</sup> possible combinations of 27 words from this list, but clearly *none* of them satisfy the condition as there are only 26 possible letters for each position.)
I did test this for groups of 4 words, writing out all the conditions manually - and as expected, it's dozens of times slower. It has to do twice as many filtering comparisons, while generating 31 times as many candidates. (A more generalized approach using set comprehensions to check for distinct letters is actually even slower, though that presumably reverses at some point.)
**How can I do this more efficiently and directly**? I'm looking for a *fully general* approach - it should be able to handle arbitrary values for *n* and *k*, and it should not be dependent on the type (or other properties) of the input sequences or their elements.

#2: Post edited by

Karl Knechtel‭ · 2024-11-12T21:26:23Z (7 months ago)

Copy Link

Raw

Markdown

<section class="notice">
This question is adapted from [a question I just helped fix up on Stack Overflow](https://stackoverflow.com/questions/79182496). I'm giving a different motivating example which I think will make it clearer what the requirements are and why a naive approach would take prohibitively long. (Also, I want to attempt an answer, but by personal policy I don't write new answers on Stack Overflow any more.)
</section>
I have a list of two-letter English words [from the Official Scrabble Players' Dictionary](https://scrabble.collinsdictionary.com/word-lists/two-letter-words-in-scrabble/):
```
two = [
'AA', 'AB', 'AD', 'AE', 'AG', 'AH', 'AI', 'AL', 'AM', 'AN', 'AR', 'AS',
'AT', 'AW', 'AX', 'AY', 'BA', 'BE', 'BI', 'BO', 'BY', 'CH', 'DA', 'DE',
'DI', 'DO', 'EA', 'ED', 'EE', 'EF', 'EH', 'EL', 'EM', 'EN', 'ER', 'ES',
'ET', 'EW', 'EX', 'FA', 'FE', 'FY', 'GI', 'GO', 'GU', 'HA', 'HE', 'HI',
'HM', 'HO', 'ID', 'IF', 'IN', 'IO', 'IS', 'IT', 'JA', 'JO', 'KA', 'KI',
'KO', 'KY', 'LA', 'LI', 'LO', 'MA', 'ME', 'MI', 'MM', 'MO', 'MU', 'MY',
'NA', 'NE', 'NO', 'NU', 'NY', 'OB', 'OD', 'OE', 'OF', 'OH', 'OI', 'OK',
'OM', 'ON', 'OO', 'OP', 'OR', 'OS', 'OU', 'OW', 'OX', 'OY', 'PA', 'PE',
'PI', 'PO', 'QI', 'RE', 'SH', 'SI', 'SO', 'ST', 'TA', 'TE', 'TI', 'TO',
'UG', 'UH', 'UM', 'UN', 'UP', 'UR', 'US', 'UT', 'WE', 'WO', 'XI', 'XU',
'YA', 'YE', 'YO', 'YU', 'ZA', 'ZE', 'ZO'
]
```
(There are 127 words, out of 676 possible pairs of letters. I'm showing Python code here, but I'm also interested in language-agnostic discussion of the algorithm.)
~~I want to find *combinations of k words* (say k=3 for example), restricted such that, *within each combination*, all words have a distinct first letter *and* a distinct second letter.~~
The alphabetically first such combination is `['AA', 'BE', 'CH']`. Note that the combination `['AA', 'BA', 'CH']` isn't valid because the first two words have the same *second* letter `'A'`.
My naive approach is to make all combinations of the words using the standard library `itertools.combinations` (which already rejects invalid "combinations" using the *same word* more than once, and takes order into account), then filter out combinations with a common first word:
```
from itertools import combinations
word_triples = [
(a, b, c) for a, b, c in combinations(two, 3)
if a[0] != b[0] and a[0] != c[0] and b[0] != c[0]
and a[1] != b[1] and a[1] != c[1] and b[1] != c[1]
]
```
This gives a correct result in about a tenth of a second on my machine.
However, there are some problems:
1. The filtering logic would be hard to generalize.
1. **This clearly won't scale**. For groups of 3 words it's simple enough: `itertools.combinations` generates `333375` possibilities, and most of them are actually valid - `len(word_triples)` reports `217503`. But with more words in a combination, the unfiltered input will scale exponentially, while the desired result will grow much more slowly, and even shrink. (In the extreme: there are about 3 * 10<sup>27</sup> possible combinations of 27 words from this list, but clearly *none* of them satisfy the condition as there are only 26 possible letters for each position.)
I did test this for groups of 4 words, writing out all the conditions manually - and as expected, it's dozens of times slower. It has to do twice as many filtering comparisons, while generating 31 times as many candidates. (A more generalized approach using set comprehensions to check for distinct letters is actually even slower, though that presumably reverses at some point.)
**How can I do this more efficiently and directly**? I'm looking for a *fully general* approach - it should be able to handle:
* Inputs where the elements to combine are longer (as long as they're all the same length)
* Inputs which are arbitrary sequences, rather than just strings of English letters
* An arbitrary value for `k`

<section class="notice">
This question is adapted from [a question I just helped fix up on Stack Overflow](https://stackoverflow.com/questions/79182496). I'm giving a different motivating example which I think will make it clearer what the requirements are and why a naive approach would take prohibitively long. (Also, I want to attempt an answer, but by personal policy I don't write new answers on Stack Overflow any more.)
</section>
I have a list of two-letter English words [from the Official Scrabble Players' Dictionary](https://scrabble.collinsdictionary.com/word-lists/two-letter-words-in-scrabble/):
```
two = [
'AA', 'AB', 'AD', 'AE', 'AG', 'AH', 'AI', 'AL', 'AM', 'AN', 'AR', 'AS',
'AT', 'AW', 'AX', 'AY', 'BA', 'BE', 'BI', 'BO', 'BY', 'CH', 'DA', 'DE',
'DI', 'DO', 'EA', 'ED', 'EE', 'EF', 'EH', 'EL', 'EM', 'EN', 'ER', 'ES',
'ET', 'EW', 'EX', 'FA', 'FE', 'FY', 'GI', 'GO', 'GU', 'HA', 'HE', 'HI',
'HM', 'HO', 'ID', 'IF', 'IN', 'IO', 'IS', 'IT', 'JA', 'JO', 'KA', 'KI',
'KO', 'KY', 'LA', 'LI', 'LO', 'MA', 'ME', 'MI', 'MM', 'MO', 'MU', 'MY',
'NA', 'NE', 'NO', 'NU', 'NY', 'OB', 'OD', 'OE', 'OF', 'OH', 'OI', 'OK',
'OM', 'ON', 'OO', 'OP', 'OR', 'OS', 'OU', 'OW', 'OX', 'OY', 'PA', 'PE',
'PI', 'PO', 'QI', 'RE', 'SH', 'SI', 'SO', 'ST', 'TA', 'TE', 'TI', 'TO',
'UG', 'UH', 'UM', 'UN', 'UP', 'UR', 'US', 'UT', 'WE', 'WO', 'XI', 'XU',
'YA', 'YE', 'YO', 'YU', 'ZA', 'ZE', 'ZO'
]
```
(There are 127 words, out of 676 possible pairs of letters. I'm showing Python code here, but I'm also interested in language-agnostic discussion of the algorithm.)
I want to find *combinations of k words* (say k=3 for example), restricted such that, *within each combination*, all words have a distinct first letter *and* a distinct second letter. More generally, the length of the words isn't an essential element of the problem - sub-elements of the elements chosen should be unique at *each* position.
The alphabetically first such combination is `['AA', 'BE', 'CH']`. Note that the combination `['AA', 'BA', 'CH']` isn't valid because the first two words have the same *second* letter `'A'`.
My naive approach is to make all combinations of the words using the standard library `itertools.combinations` (which already rejects invalid "combinations" using the *same word* more than once, and takes order into account), then filter out combinations with a common first word:
```
from itertools import combinations
word_triples = [
(a, b, c) for a, b, c in combinations(two, 3)
if a[0] != b[0] and a[0] != c[0] and b[0] != c[0]
and a[1] != b[1] and a[1] != c[1] and b[1] != c[1]
]
```
This gives a correct result in about a tenth of a second on my machine.
However, there are some problems:
1. The filtering logic would be hard to generalize.
1. **This clearly won't scale**. For groups of 3 words it's simple enough: `itertools.combinations` generates `333375` possibilities, and most of them are actually valid - `len(word_triples)` reports `217503`. But with more words in a combination, the unfiltered input will scale exponentially, while the desired result will grow much more slowly, and even shrink. (In the extreme: there are about 3 * 10<sup>27</sup> possible combinations of 27 words from this list, but clearly *none* of them satisfy the condition as there are only 26 possible letters for each position.)
I did test this for groups of 4 words, writing out all the conditions manually - and as expected, it's dozens of times slower. It has to do twice as many filtering comparisons, while generating 31 times as many candidates. (A more generalized approach using set comprehensions to check for distinct letters is actually even slower, though that presumably reverses at some point.)
**How can I do this more efficiently and directly**? I'm looking for a *fully general* approach - it should be able to handle:
* Inputs where the elements to combine are longer (as long as they're all the same length)
* Inputs which are arbitrary sequences, rather than just strings of English letters
* An arbitrary value for `k`

#1: Initial revision by

Karl Knechtel‭ · 2024-11-12T21:24:40Z (7 months ago)

Copy Link

Raw

Markdown

Generating combinations of elements, which are themselves sequences of the same length, where the elements have no value in common at any position

<section class="notice">

This question is adapted from [a question I just helped fix up on Stack Overflow](https://stackoverflow.com/questions/79182496). I'm giving a different motivating example which I think will make it clearer what the requirements are and why a naive approach would take prohibitively long. (Also, I want to attempt an answer, but by personal policy I don't write new answers on Stack Overflow any more.)
</section>

I have a list of two-letter English words [from the Official Scrabble Players' Dictionary](https://scrabble.collinsdictionary.com/word-lists/two-letter-words-in-scrabble/):

```
two = [
    'AA', 'AB', 'AD', 'AE', 'AG', 'AH', 'AI', 'AL', 'AM', 'AN', 'AR', 'AS',
    'AT', 'AW', 'AX', 'AY', 'BA', 'BE', 'BI', 'BO', 'BY', 'CH', 'DA', 'DE',
    'DI', 'DO', 'EA', 'ED', 'EE', 'EF', 'EH', 'EL', 'EM', 'EN', 'ER', 'ES',
    'ET', 'EW', 'EX', 'FA', 'FE', 'FY', 'GI', 'GO', 'GU', 'HA', 'HE', 'HI',
    'HM', 'HO', 'ID', 'IF', 'IN', 'IO', 'IS', 'IT', 'JA', 'JO', 'KA', 'KI',
    'KO', 'KY', 'LA', 'LI', 'LO', 'MA', 'ME', 'MI', 'MM', 'MO', 'MU', 'MY',
    'NA', 'NE', 'NO', 'NU', 'NY', 'OB', 'OD', 'OE', 'OF', 'OH', 'OI', 'OK',
    'OM', 'ON', 'OO', 'OP', 'OR', 'OS', 'OU', 'OW', 'OX', 'OY', 'PA', 'PE',
    'PI', 'PO', 'QI', 'RE', 'SH', 'SI', 'SO', 'ST', 'TA', 'TE', 'TI', 'TO',
    'UG', 'UH', 'UM', 'UN', 'UP', 'UR', 'US', 'UT', 'WE', 'WO', 'XI', 'XU',
    'YA', 'YE', 'YO', 'YU', 'ZA', 'ZE', 'ZO'
]
```
(There are 127 words, out of 676 possible pairs of letters. I'm showing Python code here, but I'm also interested in language-agnostic discussion of the algorithm.)

I want to find *combinations of k words* (say k=3 for example), restricted such that, *within each combination*, all words have a distinct first letter *and* a distinct second letter.

The alphabetically first such combination is `['AA', 'BE', 'CH']`. Note that the combination `['AA', 'BA', 'CH']` isn't valid because the first two words have the same *second* letter `'A'`.

My naive approach is to make all combinations of the words using the standard library `itertools.combinations` (which already rejects invalid "combinations" using the *same word* more than once, and takes order into account), then filter out combinations with a common first word:

```
from itertools import combinations

word_triples = [
    (a, b, c) for a, b, c in combinations(two, 3)
    if a[0] != b[0] and a[0] != c[0] and b[0] != c[0]
    and a[1] != b[1] and a[1] != c[1] and b[1] != c[1]
]
```
This gives a correct result in about a tenth of a second on my machine.

However, there are some problems:

1. The filtering logic would be hard to generalize.

1. **This clearly won't scale**. For groups of 3 words it's simple enough: `itertools.combinations` generates `333375` possibilities, and most of them are actually valid - `len(word_triples)` reports `217503`. But with more words in a combination, the unfiltered input will scale exponentially, while the desired result will grow much more slowly, and even shrink. (In the extreme: there are about 3 * 10<sup>27</sup> possible combinations of 27 words from this list, but clearly *none* of them satisfy the condition as there are only 26 possible letters for each position.)

I did test this for groups of 4 words, writing out all the conditions manually - and as expected, it's dozens of times slower. It has to do twice as many filtering comparisons, while generating 31 times as many candidates. (A more generalized approach using set comprehensions to check for distinct letters is actually even slower, though that presumably reverses at some point.)

**How can I do this more efficiently and directly**? I'm looking for a *fully general* approach - it should be able to handle:

* Inputs where the elements to combine are longer (as long as they're all the same length)

* Inputs which are arbitrary sequences, rather than just strings of English letters

* An arbitrary value for `k`

python algorithms combinatorics

Communities

Post History