Communities

Writing
Writing
Codidact Meta
Codidact Meta
The Great Outdoors
The Great Outdoors
Photography & Video
Photography & Video
Scientific Speculation
Scientific Speculation
Cooking
Cooking
Electrical Engineering
Electrical Engineering
Judaism
Judaism
Languages & Linguistics
Languages & Linguistics
Software Development
Software Development
Mathematics
Mathematics
Christianity
Christianity
Code Golf
Code Golf
Music
Music
Physics
Physics
Linux Systems
Linux Systems
Power Users
Power Users
Tabletop RPGs
Tabletop RPGs
Community Proposals
Community Proposals
tag:snake search within a tag
answers:0 unanswered questions
user:xxxx search by author id
score:0.5 posts with 0.5+ score
"snake oil" exact phrase
votes:4 posts with 4+ votes
created:<1w created < 1 week ago
post_type:xxxx type of post
Search help
Notifications
Mark all as read See all your notifications »
Q&A

Welcome to Software Development on Codidact!

Will you help us build our independent community of developers helping developers? We're small and trying to grow. We welcome questions about all aspects of software development, from design to code to QA and more. Got questions? Got answers? Got code you'd like someone to review? Please join us.

grep AND search for multiple words in files

+9
−0

I have text (xml actually) files. Some files contain 'foo', some contain 'bar', some contain neither and some contain both. It's the both I'm interested in.

How do I do an AND search on words in files in folders (recursively) with grep? I'm using git bash in windows, so either a cygwin or win10 solution works.

I had thought pipeing to grep would work as it seems to be the solution for multiple text on a line, but I don't think I've changed it to work for files correctly.

This is what I tried:

$ grep -rnw . -e 'foo' | grep -e 'bar'

Can someone tell me how to fix my grep call?

History
Why does this post require attention from curators or moderators?
You might want to add some details to your flag.
Why should this post be closed?

0 comment threads

5 answers

You are accessing this answer with a direct link, so it's being shown above all other answers regardless of its score. You can return to the normal view.

+7
−0

Your grep invocation will first search for files which contain foo and print a list of the lines from each which contain the word foo; the second grep invocation will take this list and filter it down to only those lines of output which also contain bar.

A complicating factor is that XML syntax is relatively complex and does not really lend itself well to line-based utilities such as grep. But ignoring that caveat...

If you want to know which files contain both foo and bar on the same line, you can use a regular expression which will match the two in either order. This is similar to my question How can I write an egrep (grep -E) regexp that matches lines containing two stanzas in arbitrary order?, but for something like your use case, one can use for example

grep -rn '\b(foo|bar)\b.*\b(foo|bar)\b' .

This isn't perfect, since it will match a line containing, say, foo foo. (It's probably possible to use negative lookbehind to find only those where the other word occurs later.)

A more robust solution for finding those files which actually contain both is to search for one, then search each of those files for the other. For this, grep's -l/--files-with-matches comes in handy.

grep -rlw foo .

will print a list of all files under the current directory which contain the word-separated string foo. Adding -Z will also cause grep to print the file names separated by NULL characters instead of newlines, which helps to protect against special characters. Then use xargs (and its corresponding -0 option to expect input entries to be separated by NULLs rather than newlines) to pass that list of files to a second grep invocation which looks for bar in each of those files:

Proposed possible solution

grep -rlwZ foo . | xargs -0 grep -w bar

This should produce a list of all files, and only those files, under the current directory, which contain both foo and bar as whole words, but without regard to their relative location within the file.

History
Why does this post require attention from curators or moderators?
You might want to add some details to your flag.

0 comment threads

+9
−0

Between-lines relations are not easy to look for with grep, which is a line filter. You could use a regex that spans lines, but I find this annoying because of all the flags you have to set.

Grep has a switch for printing the filenames instead of matching lines. You can put each in a file. Once you have both files, you can use comm to do the union.

grep -r . -e 'foo' > foo.txt
grep -r . -e 'bar' > bar.txt
comm foo.txt bar.txt -12

If you don't want the temp files, you can use the command inline: https://linux.codidact.com/posts/288328/288329#answer-288329 However, I simply put the files in /tmp/ where they get automatically wiped at next system shutdown.

History
Why does this post require attention from curators or moderators?
You might want to add some details to your flag.

1 comment thread

Likely no need for temporary files if you're using bash (3 comments)
+5
−0

From your description ...

I have text (xml actually) files. Some files contain 'foo', some contain 'bar', some contain neither and some contain both. It's the both I'm interested in.

... I conclude that you are interested in files that contain both 'foo' and 'bar', but not necessarily on the same line. Thus I will discuss that aspect first.

'foo' and 'bar' possibly on different lines:

As an alternative to the approach proposed by @Canina this can be solved by a combination of find and grep in the following way:

find . -type f -exec grep -wq foo {} \; -exec grep -wl bar {} \;

The first -exec condition of the find command (-exec grep -q foo {} \;) succeeds on files containing 'foo': The -q option instructs grep to be quiet and only report whether a match was found in the return status.

The second -exec condition (-exec grep -l bar {} \;) operates on the files for which the first -exec condition was true (i.e. which contained 'foo'). This time grep searches for 'bar', and due to the -l option prints the name of the file where a match was found.

Note that I left out the -n option here, as it does not seem to make sense in this case: There are possibly two lines involved (or even more if 'foo' or 'bar' occur more often), but only one would be printed.

'foo' and 'bar' on the same line:

The solution that you have written is able to find lines where 'foo' and 'bar' appear on the same line:

grep -rnw . -e 'foo' | grep -e 'bar'

The first grep searches for all lines with 'foo' and prints the lines that were found to stdout prefixed with the file name and the line number. This output is then piped to the second grep, which filters for those lines containing 'bar'.

This will (mostly) work for the case where you want to find files with 'foo' and 'bar' appearing on the same line. It will fail if one of the file names happens to contain 'bar' as a word.

An alternative approach to find lines where 'foo' and 'bar' are expected on the same line is the following, which checks for lines with 'foo' followed by 'bar' or lines with 'bar' followed by 'foo':

grep -rn -E -e '(\<foo\>.*\<bar\>)|(\<bar\>.*\<foo\>)' .
History
Why does this post require attention from curators or moderators?
You might want to add some details to your flag.

0 comment threads

+3
−0

An alternative to grep is Awk, which makes this pretty easy.

To find lines which contain both:

find . -type f -exec awk '/foo/ && /bar/' {} +

(Maybe add { print FILENAME ":" FNR ":" $0 } before the closing quote if you want the filename and the line number.)

To find files which contain both;

find . -type f -exec awk 'FNR==1 { foo=0; bar=0 }
  foo && bar { print FILENAME; nextfile }
  /foo/ { ++foo }
  /bar/ { ++bar }' {} +
History
Why does this post require attention from curators or moderators?
You might want to add some details to your flag.

1 comment thread

I think the "find _files_ which contain both" is the relevant part here, in particular since in case ... (2 comments)
+1
−0

You can utilize grep's PERL regexes, more specifically lookarounds, to check presence of two (or more) words.

$ grep -Prn . -e '^(?=.*\bfoo\b)(?=.*\bbar\b)'

Here regex checks, that beginning of the line (^) is followed by words foo and bar somewhere in that line.

Please notice, that since we are not actually matching said words, option -w is of no use, and we need to surround filtering words with boundary symbols.

Demo of similar command online here.

Demo of regex with some additional explanation can be seen here.

History
Why does this post require attention from curators or moderators?
You might want to add some details to your flag.

0 comment threads

Sign up to answer this question »