Welcome to Software Development on Codidact!

Will you help us build our independent community of developers helping developers? We're small and trying to grow. We welcome questions about all aspects of software development, from design to code to QA and more. Got questions? Got answers? Got code you'd like someone to review? Please join us.

How can I write an egrep (grep -E) regexp that matches lines containing two stanzas in arbitrary order?

−1

I have line-based data on the form

x1=y2; a3=b4; c5=d6; ...

Matching this with a extended regular expression is fairly straightforward; for example, one can do something not entirely dissimilar from

^([^;]+; )*x1=y2; ([^;]+; )*c5=d6;

to match the x1=y2 stanza anywhere within the input, and the c5=d6 stanza also occuring after the x1=y2 stanza.

However, the syntax for this data allows the tuples to be listed in any order, so it's just as valid to have as input

a3=b4; c5=d6; x1=y2; ...

How to write an extended regular expression that will match either of these inputs, while requiring that both x1=y2 and c5=d6 stanzas are specified (with those respective values)? Can that even be done without having to repeat either or resorting to more advanced processing than pure regular expressions (such as, for example, awk or Perl)?

regex grep

posted over 4 years ago

CC BY-SA 4.0

4y ago by Alexei‭

Canina‭

1499 reputation 2 23 151 31

Raw

Markdown

History

is a duplicate

This question has been asked before and has already been answered. It should be marked as a duplicate.

Please enter the URL of the proposed duplicate in the details field below.

not constructive

This question cannot be answered in a way that is helpful to anyone. It's not possible to learn something from possible answers, except for the solution for the specific problem of the asker.

0 comment threads

3 answers

Score Active Age

You are accessing this answer with a direct link, so it's being shown above all other answers regardless of its score. You can return to the normal view.

−2

I think such Regexp is way too much and would probably lead to future confusion.

If you are using some kind of Bash, what about using the fact that the file can be sourced and use its assignations?

So if you have a file with this content:

x1=y2

And you source it (. file or source file), then you have the variable $x1 ready for use.

With that in mind, we can do:

( source file; [ "$x1" = "y2" ] && [ "$c5" = "d6" ] && echo "yes" || echo "no" )

This sources the file and then checks if the values of vars $x1 and $c5 are the ones we want. If so, it prints a plain "yes"; otherwise, a "no".

posted over 4 years ago

CC BY-SA 4.0

fedorqui‭

27 reputation 1 1 3 0

Copy Link

Raw

Markdown

History

0 comment threads

−0

Can that even be done without having to repeat either or resorting to more advanced processing than pure regular expressions

I don't think it can. If you don't want to repeat x1=y2 and c5=d6, you'll have to use more advanced features, such as lookaheads:

grep -P "^(?=([^;]+; )*x1=y2)(?=([^;]+; )*c5=d6)" your_input

The -P option tells grep to use PCRE, which supports the lookahead feature (not supported by grep's default BRE). You can check all the differences between regex flavors in this table (there are 2 comboboxes at the top, that you can use to choose different regex flavors to be compared).

Anyway, the idea of a lookahead is to... look ahead the current position, searching for whatever it is between (?= and ). So this regex has 2 lookaheads.

The first one: (?=([^;]+; )*x1=y2) searches for zero or more occurrences of ([^;]+; ) (which is one or more characters that are not ;, followed by a ; and a space), and then followed by x1=y2.

The "trick" is that a lookahead only "takes a look", and if it finds the match, it "comes back" to the position it was (which is, in this case, the anchor ^ - the beginning of the string). So, this lookahead checks if anywhere in the string there's a x1=y2, and then it "comes back" to the beginning, and proceeds evaluating the rest of the expression.

The next part of the expression is another lookahead, which is very similar to the first and checks if anywhere in the string there's a c5=d6.

If both x1=y2 and c5=d6 exist, their respective lookaheads succeed and the regex reports a match. And this happens regardless of their relative order: x1=y2 can be either before or after c5=d6. That's because both lookaheads start searching from the beginning of the string.

If one of them is not in the string, the respective lookahead fails and the regex doesn't match.

Unfortunately, with BRE or ERE, you'll have to repeat x1=y2 and c5=d6 (make one alternative where x1=y2 is before, and another one where it's after). Something like that:

grep -E "^(([^;]+; )*x1=y2; ([^;]+; )*c5=d6;|([^;]+; )*c5=d6; ([^;]+; )*x1=y2;)" your_input

The regex suggested by the other answer doesn't work, because it doesn't require both x1=y2 and c5=d6 to be in the string: it also matches a line containing just one of them twice, such as a3=b4; x1=y2; x1=y2; ... (see here).

Another solution is to use a script to read the lines and check if they contain everything you want:

while IFS="; " read -r -a line || [ -n "$line" ]
do
    x=0
    c=0
    for i in ${line[@]}
    do
        if [ "$i" = "x1=y2" ]; then
            x=1
        elif [ "$i" = "c5=d6" ]; then
            c=1
        fi
    done
    if [ "$x" -eq 1 -a "$c" -eq 1 ]; then
        echo "both were found"
    fi
done < your_input

It sets IFS to use ; followed by space as a separator/delimiter, so read creates an array containing all the variable=value tokens. We just loop through this array checking if it contains both x1=y2 and c5=d6.

Just for the record, I'd use some other programming language to process the lines. Regex is cool, but it's not always the best solution.

posted over 4 years ago

CC BY-SA 4.0

4y ago

hkotsubo‭

5235 reputation 21 70 590 239

Copy Link

Raw

Markdown

History

1 comment thread

General comments (1 comment)

−2

This doesn't answer the question in full generality, but the assumption made seems reasonable to me: match lines containing (x1=y2;|c5=c6;) twice. I.e.

^(([^;]+; )*(x1=y2;|c5=c6;) ?){2}

posted over 4 years ago

CC BY-SA 4.0

Peter Taylor‭

1302 reputation 7 34 141 7

Copy Link

Raw

Markdown

History

1 comment thread

General comments (2 comments)

Communities

How can I write an egrep (grep -E) regexp that matches lines containing two stanzas in arbitrary order?

0 comment threads

3 answers

0 comment threads

1 comment thread

1 comment thread