grep AND search for multiple words in files

+9

−0

I have text (xml actually) files. Some files contain 'foo', some contain 'bar', some contain neither and some contain both. It's the both I'm interested in.

How do I do an AND search on words in files in folders (recursively) with grep? I'm using git bash in windows, so either a cygwin or win10 solution works.

I had thought pipeing to grep would work as it seems to be the solution for multiple text on a line, but I don't think I've changed it to work for files correctly.

This is what I tried:

$ grep -rnw . -e 'foo' | grep -e 'bar'

Can someone tell me how to fix my grep call?

grep file-handling

posted almost 2 years ago

CC BY-SA 4.0

2y ago by rene‭

mcalex‭

130 reputation 9 1 18 5

Raw

Markdown

History

is a duplicate

This question has been asked before and has already been answered. It should be marked as a duplicate.

Please enter the URL of the proposed duplicate in the details field below.

not constructive

This question cannot be answered in a way that is helpful to anyone. It's not possible to learn something from possible answers, except for the solution for the specific problem of the asker.

+9

−0

Between-lines relations are not easy to look for with grep, which is a line filter. You could use a regex that spans lin …

2y ago

+7

−0

Your `grep` invocation will first search for files which contain `foo` and print a list of the lines from each which con …

2y ago

+5

−0

From your description ... > I have text (xml actually) files. Some files contain 'foo', some contain 'bar', some con …

2y ago

+3

−0

An alternative to `grep` is Awk, which makes this pretty easy. To find lines which contain both: ``` find . -type …

2y ago

+1

−0

You can utilize grep's PERL regexes, more specifically lookarounds, to check presence of two (or more) words. ```sh …

2y ago

0 comment threads

You are accessing this answer with a direct link, so it's being shown above all other answers regardless of its score. You can return to the normal view.

+9

−0

Between-lines relations are not easy to look for with grep, which is a line filter. You could use a regex that spans lines, but I find this annoying because of all the flags you have to set.

Grep has a switch for printing the filenames instead of matching lines. You can put each in a file. Once you have both files, you can use comm to do the union.

grep -r . -e 'foo' > foo.txt
grep -r . -e 'bar' > bar.txt
comm foo.txt bar.txt -12

If you don't want the temp files, you can use the command inline: https://linux.codidact.com/posts/288328/288329#answer-288329 However, I simply put the files in /tmp/ where they get automatically wiped at next system shutdown.

posted almost 2 years ago

CC BY-SA 4.0

2y ago

matthewsnyder‭

2285 reputation 52 61 267 93

Copy Link

Raw

Markdown

History

1 comment thread

Likely no need for temporary files if you're using bash (3 comments)

+7

−0

Your grep invocation will first search for files which contain foo and print a list of the lines from each which contain the word foo; the second grep invocation will take this list and filter it down to only those lines of output which also contain bar.

A complicating factor is that XML syntax is relatively complex and does not really lend itself well to line-based utilities such as grep. But ignoring that caveat...

If you want to know which files contain both foo and bar on the same line, you can use a regular expression which will match the two in either order. This is similar to my question How can I write an egrep (grep -E) regexp that matches lines containing two stanzas in arbitrary order?, but for something like your use case, one can use for example

grep -rn '\b(foo|bar)\b.*\b(foo|bar)\b' .

This isn't perfect, since it will match a line containing, say, foo foo. (It's probably possible to use negative lookbehind to find only those where the other word occurs later.)

A more robust solution for finding those files which actually contain both is to search for one, then search each of those files for the other. For this, grep's -l/--files-with-matches comes in handy.

grep -rlw foo .

will print a list of all files under the current directory which contain the word-separated string foo. Adding -Z will also cause grep to print the file names separated by NULL characters instead of newlines, which helps to protect against special characters. Then use xargs (and its corresponding -0 option to expect input entries to be separated by NULLs rather than newlines) to pass that list of files to a second grep invocation which looks for bar in each of those files:

Proposed possible solution

grep -rlwZ foo . | xargs -0 grep -w bar

This should produce a list of all files, and only those files, under the current directory, which contain both foo and bar as whole words, but without regard to their relative location within the file.

posted almost 2 years ago

CC BY-SA 4.0

Canina‭

1499 reputation 2 23 151 31

Copy Link

Raw

Markdown

History

0 comment threads

+5

−0

From your description ...

I have text (xml actually) files. Some files contain 'foo', some contain 'bar', some contain neither and some contain both. It's the both I'm interested in.

... I conclude that you are interested in files that contain both 'foo' and 'bar', but not necessarily on the same line. Thus I will discuss that aspect first.

'foo' and 'bar' possibly on different lines:

As an alternative to the approach proposed by @Canina this can be solved by a combination of find and grep in the following way:

find . -type f -exec grep -wq foo {} \; -exec grep -wl bar {} \;

The first -exec condition of the find command (-exec grep -q foo {} \;) succeeds on files containing 'foo': The -q option instructs grep to be quiet and only report whether a match was found in the return status.

The second -exec condition (-exec grep -l bar {} \;) operates on the files for which the first -exec condition was true (i.e. which contained 'foo'). This time grep searches for 'bar', and due to the -l option prints the name of the file where a match was found.

Note that I left out the -n option here, as it does not seem to make sense in this case: There are possibly two lines involved (or even more if 'foo' or 'bar' occur more often), but only one would be printed.

'foo' and 'bar' on the same line:

The solution that you have written is able to find lines where 'foo' and 'bar' appear on the same line:

grep -rnw . -e 'foo' | grep -e 'bar'

The first grep searches for all lines with 'foo' and prints the lines that were found to stdout prefixed with the file name and the line number. This output is then piped to the second grep, which filters for those lines containing 'bar'.

This will (mostly) work for the case where you want to find files with 'foo' and 'bar' appearing on the same line. It will fail if one of the file names happens to contain 'bar' as a word.

An alternative approach to find lines where 'foo' and 'bar' are expected on the same line is the following, which checks for lines with 'foo' followed by 'bar' or lines with 'bar' followed by 'foo':

grep -rn -E -e '(\<foo\>.*\<bar\>)|(\<bar\>.*\<foo\>)' .

posted almost 2 years ago

CC BY-SA 4.0

2y ago

Dirk Herrmann‭

1397 reputation 1 31 142 48

Copy Link

Raw

Markdown

History

0 comment threads

+3

−0

An alternative to grep is Awk, which makes this pretty easy.

To find lines which contain both:

find . -type f -exec awk '/foo/ && /bar/' {} +

(Maybe add { print FILENAME ":" FNR ":" $0 } before the closing quote if you want the filename and the line number.)

To find files which contain both;

find . -type f -exec awk 'FNR==1 { foo=0; bar=0 }
  foo && bar { print FILENAME; nextfile }
  /foo/ { ++foo }
  /bar/ { ++bar }' {} +

posted almost 2 years ago

CC BY-SA 4.0

tripleee‭

181 reputation 0 4 18 14

Copy Link

Raw

Markdown

History

1 comment thread

I think the "find _files_ which contain both" is the relevant part here, in particular since in case ... (2 comments)

+1

−0

You can utilize grep's PERL regexes, more specifically lookarounds, to check presence of two (or more) words.

$ grep -Prn . -e '^(?=.*\bfoo\b)(?=.*\bbar\b)'

Here regex checks, that beginning of the line (^) is followed by words foo and bar somewhere in that line.

Please notice, that since we are not actually matching said words, option -w is of no use, and we need to surround filtering words with boundary symbols.

Demo of similar command online here.

Demo of regex with some additional explanation can be seen here.

posted almost 2 years ago

CC BY-SA 4.0

markalex‭

11 reputation 0 1 1 0

Copy Link

Raw

Markdown

History

Communities

grep AND search for multiple words in files

0 comment threads

5 answers

1 comment thread

Proposed possible solution

0 comment threads

0 comment threads

1 comment thread

0 comment threads