Communities

Writing
Writing
Codidact Meta
Codidact Meta
The Great Outdoors
The Great Outdoors
Photography & Video
Photography & Video
Scientific Speculation
Scientific Speculation
Cooking
Cooking
Electrical Engineering
Electrical Engineering
Judaism
Judaism
Languages & Linguistics
Languages & Linguistics
Software Development
Software Development
Mathematics
Mathematics
Christianity
Christianity
Code Golf
Code Golf
Music
Music
Physics
Physics
Linux Systems
Linux Systems
Power Users
Power Users
Tabletop RPGs
Tabletop RPGs
Community Proposals
Community Proposals
tag:snake search within a tag
answers:0 unanswered questions
user:xxxx search by author id
score:0.5 posts with 0.5+ score
"snake oil" exact phrase
votes:4 posts with 4+ votes
created:<1w created < 1 week ago
post_type:xxxx type of post
Search help
Notifications
Mark all as read See all your notifications »
Q&A

Welcome to Software Development on Codidact!

Will you help us build our independent community of developers helping developers? We're small and trying to grow. We welcome questions about all aspects of software development, from design to code to QA and more. Got questions? Got answers? Got code you'd like someone to review? Please join us.

How to group a flat list of attributes into a nested lists?

+2
−2

I have a flat list where each item is the key and value for an attribute. I want to transform this into a nested list where each attribute is a sublist.

Example input:

[
  "attr1 apple 1",
  "attr1 banana 2",
  "attr2 grapes 1",
  "attr2 oranges 2",
  "attr3 watermelon 0"
]

The output should be:

[
  [
    "attr1 apple 1",
    "attr1 banana 2",
  ],
  [
    "attr2 grapes 1",
    "attr2 oranges 2",
  ],
  [
    "attr3 watermelon 0"
  ]
]

I tried this program, but the result is incorrect.

import re

# regex pattern definition
pattern = re.compile(r'attr\d+')

# Open the file for reading
with open(r"file path") as file:
    # Initialize an empty list to store matching lines
    matching_lines = []

    # reading each line 
    for line in file:
        # regex pattern match
        if pattern.search(line):
            # matching line append to the list
            matching_lines.append(line.strip())

# Grouping the  elements based on the regex pattern

#The required list
grouped_elements = []

#Temporary list for sublist grouping
current_group = []

for sentence in matching_lines:
    if pattern.search(sentence):
        current_group.append(sentence)
    else:
        if current_group:
            grouped_elements.append(current_group)
        current_group = [sentence]

if current_group:
    grouped_elements.append(current_group)

# Print the grouped elements
for group in grouped_elements:
    print(group)

I am getting this output:

[
    'attr1 apple 1',
    'attr1 banana 2', 
    'attr2 grapes 1',
    'attr2 oranges 2', 
    'attr3 watermelon 0'
]
History
Why does this post require moderator attention?
You might want to add some details to your flag.
Why should this post be closed?

4 comment threads

Post code in code reviews (1 comment)
Get rid of the code (1 comment)
I want to answer (1 comment)
Needs more focus (2 comments)

2 answers

You are accessing this answer with a direct link, so it's being shown above all other answers regardless of its score. You can return to the normal view.

+1
−0

You could create a dictionary to map each attribute to its respective list of items. Then you get the dictionary values to create the final list.

Something like this:

import re
pattern = re.compile(r'attr\d+')

# just to simulate a "file"
file = [ 'attr1 apple 1', 'attr1 banana 2', 'attr2 grapes 1', 'attr2 oranges 2', 'attr3 watermelon 0' ]

##############################################################
all_attrs = {} # dictionary to map each attribute to its items
for line in file:
    # regex pattern match
    if pattern.search(line):
        attr, item = line.strip().split(maxsplit=1)
        # if attr is not in the dictionary, create an empty list for it
        # add item to attr's list
        all_attrs.setdefault(attr, []).append(f'{attr} {item}')

# get all the sub-lists and create a list with them
grouped_elements = list(all_attrs.values())
print(grouped_elements) # [['attr1 apple 1', 'attr1 banana 2'], ['attr2 grapes 1', 'attr2 oranges 2'], ['attr3 watermelon 0']]

When reading the input, you map each attribute to a list. setdefault(attr, []) creates a new list if the attribute is not in the dictionary yet, otherwise it returns the existing list. Then I add the current string ("attribute + item name") to this list.

By the end, the dictionary will have all attributes as keys ("attr1", "attr2", etc), and their respective values will be the lists with the strings associated with that attribute - so "attr1" key will have the list ['attr1 apple 1', 'attr1 banana 2'] as value, and so on.

To get the final list, just take all the dictionary values and convert them to a list.


As a side note, you can also use the regex to extract the attribute and item names directly, instead of spliting the string:

import re
pattern = re.compile(r'(attr\d+) ([^\n]+)')

all_attrs = {} # dictionary to map each attribute to its items
for line in file:
    match = pattern.match(line)
    if match:
        attr, item = match.group(1, 2)
        all_attrs.setdefault(attr, []).append(f'{attr} {item}')

Now the regex has two capturing groups (each pair of parenthesis is a group): the first one has the attribute name, and the second one has the rest of the string, except for the new line at the end (thus eliminating the need to call strip()).

And if you're using Python >= 3.8, you can use an Assignment Expression:

for line in file:
    if match := pattern.match(line): # assignment expression: assigns "match" and test it at the same line
        attr, item = match.group(1, 2)
        # ... the rest is the same

Of course you can change the regex to match a specific pattern (such as "items must have only letters or numbers", etc). But the exact format wasn't specified, so I'm assuming it's just "everything after the attribute name".


Finally, to get the formatted output, you can use the json module:

import json
print(json.dumps(grouped_elements, indent=2))

Output:

[
  [
    "attr1 apple 1",
    "attr1 banana 2"
  ],
  [
    "attr2 grapes 1",
    "attr2 oranges 2"
  ],
  [
    "attr3 watermelon 0"
  ]
]

But I guess that's beside the point. Once you have the final list, you can format it any way you want.


Alternative (considering previous edit)

Based on a previous version of the question, it suggests that the file has blank lines separating each group of items. Which means that it'd something like this:

attr1 item 1
attr1 item 2
              <--- blank line separating attr1 from attr2
attr2 item 4
attr2 item 5
              <--- blank line separating attr2 from attr3
attr3 item 5

I'm also assuming (as it wasn't clearly stated in the question) that the attributes are not shuffled - which means that the file has all items related to attr1, then a blank line, then all attr2's items, a blank line, and so on.

If that's the case, you just need to create a new sublist when a blank line is found:

import re
pattern = re.compile(r'(attr\d+) ([^\n]+)')

grouped_elements = []
current_group = []
for line in file:
    if match := pattern.match(line):
        attr, item = match.group(1, 2)
        current_group.append(f'{attr} {item}')
    else:
        grouped_elements.append(current_group)
        current_group = []

if current_group: # if the current group is not empty
    grouped_elements.append(current_group)

Your code didn't work because when reading the file, you discarded the blank lines, so in the second loop all attributes were considered to be in the same group.

Please note that the code above makes all the assumptions previously mentioned (file has blank lines separating each group). If that's not the case, it won't work, and the first approach using the dictionary is the preferred solution.

History
Why does this post require moderator attention?
You might want to add some details to your flag.

0 comment threads

+1
−0

The obvious way would be to simply start with an empty list of lists, loop through the input, and for each item decide which sublist to put it in.

It's not super dev-friendly to remember which list was which. So instead, I think it's better to construct each sublist separately, and then combine them, so that you never have the problem of digging up a list.

The easy way to do this is to do multiple passes, so you can append each sublist to the main list once it's done:

nested = []
attrs = ['attr1', 'attr2', 'attr3']
for a in attrs:
    sublist = []
    for i in flat:
        if i.startswith('attr1'):
            sublist.append(i)
    if sublist:
        nested.append(sublist)

print(nested)

This may strike you as wildly inefficient, but it's not so bad. If the item is N things and there are K attributes, it's only O(K*N) which is not terrible. Furthermore, the runtime of each individual iteration is dominated by sublist.append which is more costly than str.startswith. It only does append O(N) times, and only startswith is evaluated O(K*N) times, which is pretty tolerable.


You might not want to provide a hard coded list of attributes. These are easy to construct:

attrs = set()
for i in flat:
    a = i.split()[0]
    attrs.add(a)

This is a set, not a list like in my code (to avoid dealing with duplicates), but you can run a for loop on it just the same.


The more natural way to do this would be a dictionary:

nested = {}
for i in flat:
    # Extract attribute
    a = split()[0]
    
    # initialize list if it doesn't exist
    sublist = nested.setdefault(a, [])

    sublist.append(a)

print(nested)

And the result would look like:

{ 
    "attr1": ["attr1 apple 1", "attr1 banana 2"],
    "attr2": ["attr2 grapes 1", "attr2 oranges 2"],
    "attr3": ["attr3 watermelon 0"]
}

This has the advantage of doing only O(N) iterations. Of course dictionaries themselves have different performance characteristics than lists.

History
Why does this post require moderator attention?
You might want to add some details to your flag.

0 comment threads

Sign up to answer this question »