Lecture 04: Filename Matching with Globbing and find
Table of Contents
1 Pattern Matching in Unix
A common programming need is to match a pattern to a string. This could be a simple processes, say, is the string exactly what I'm looking for, like in password checking. Alternatively, this could be more nuanced. For example, we might want to describe a pattern as all strings that start with the letter "a" and end with the sequence "am". If that is the case, then the string "adam" matches the pattern as does "a man walks to the dam". If you didn't intend to match so much between the "a" and "am" then you need to specify your pattern differently, such as, all strings that start with the letter "a" followed by at most 1 other letter, followed by the sequence "am".
The process of describing patterns to match a string is well
studied in computer science and is the basis of many theoretical
thought of compassion. In this class we will apply pattern matching
in the shell to perform simple tasks, and, in fact, you've seen some
of this already when using grep
in the last lab. Generally, we
will use two pattern matching techniques:
- Globbing: A shell driven pattern matching facility that allows the user to generally matching patterns in files and directories.
- Regular Expressions : An expressive language of pattern
matching that can match contents and names of files and
directories through other command line tools, like
sed
andgrep
.
In this lecture, we will focus primarily on globbing, and in the lab, you will have some basic exposure to regular expressions.
2 Globbing
The globbing process is a filename expansion which provides a way for the user to refer to multiple files at the same time. The key idea behind globing is that you can use a wildcard that can match anything and the wildcard then expands with the matched files. Think of this like a wildcard in poker: if you have a wildcard, then it can become any card you like. In the same way, when you use a wildcard in a pattern, it can inhabit any character you want to match in some specified way.
There are three primary wildcards used for globbing:
- * – match zero or more
- ? – match exactly one
- [] – match exactly one from the set
- [^ ] – match exactly one thing not from the set
2.1 Match zero or more with *
As an example, consider the contents of this directory with 200:
#> ls file.0 file.115 file.132 file.15 file.167 file.184 file.200 file.38 file.55 file.72 file.9 file.1 file.116 file.133 file.150 file.168 file.185 file.21 file.39 file.56 file.73 file.90 file.10 file.117 file.134 file.151 file.169 file.186 file.22 file.4 file.57 file.74 file.91 file.100 file.118 file.135 file.152 file.17 file.187 file.23 file.40 file.58 file.75 file.92 file.101 file.119 file.136 file.153 file.170 file.188 file.24 file.41 file.59 file.76 file.93 file.102 file.12 file.137 file.154 file.171 file.189 file.25 file.42 file.6 file.77 file.94 file.103 file.120 file.138 file.155 file.172 file.19 file.26 file.43 file.60 file.78 file.95 file.104 file.121 file.139 file.156 file.173 file.190 file.27 file.44 file.61 file.79 file.96 file.105 file.122 file.14 file.157 file.174 file.191 file.28 file.45 file.62 file.8 file.97 file.106 file.123 file.140 file.158 file.175 file.192 file.29 file.46 file.63 file.80 file.98 file.107 file.124 file.141 file.159 file.176 file.193 file.3 file.47 file.64 file.81 file.99 file.108 file.125 file.142 file.16 file.177 file.194 file.30 file.48 file.65 file.82 file.109 file.126 file.143 file.160 file.178 file.195 file.31 file.49 file.66 file.83 file.11 file.127 file.144 file.161 file.179 file.196 file.32 file.5 file.67 file.84 file.110 file.128 file.145 file.162 file.18 file.197 file.33 file.50 file.68 file.85 file.111 file.129 file.146 file.163 file.180 file.198 file.34 file.51 file.69 file.86 file.112 file.13 file.147 file.164 file.181 file.199 file.35 file.52 file.7 file.87 file.113 file.130 file.148 file.165 file.182 file.2 file.36 file.53 file.70 file.88 file.114 file.131 file.149 file.166 file.183 file.20 file.37 file.54 file.71 file.89
Now Suppose I want to just match all files that begin with file.4
but can end in anything. So the file, file.4
would match that
pattern as well as the file file.44
and file.41
. To do so I can
use a wildcard, the * or Asterix like this file.4*
.
#>ls file.4* file.4 file.40 file.41 file.42 file.43 file.44 file.45 file.46 file.47 file.48 file.49
The * symbole semantically says: "Match zero or more items." That
means file.4
matches since there are zero items following the "4",
as well as file.41
matches since there is one item following the
"4", the "1". You can more clearly see that * matches zero or more
items by looking aht the glob file.1*
:
#>ls file.1* file.1 file.109 file.119 file.129 file.139 file.149 file.159 file.169 file.179 file.189 file.199 file.10 file.11 file.12 file.13 file.14 file.15 file.16 file.17 file.18 file.19 file.100 file.110 file.120 file.130 file.140 file.150 file.160 file.170 file.180 file.190 file.101 file.111 file.121 file.131 file.141 file.151 file.161 file.171 file.181 file.191 file.102 file.112 file.122 file.132 file.142 file.152 file.162 file.172 file.182 file.192 file.103 file.113 file.123 file.133 file.143 file.153 file.163 file.173 file.183 file.193 file.104 file.114 file.124 file.134 file.144 file.154 file.164 file.174 file.184 file.194 file.105 file.115 file.125 file.135 file.145 file.155 file.165 file.175 file.185 file.195 file.106 file.116 file.126 file.136 file.146 file.156 file.166 file.176 file.186 file.196 file.107 file.117 file.127 file.137 file.147 file.157 file.167 file.177 file.187 file.197 file.108 file.118 file.128 file.138 file.148 file.158 file.168 file.178 file.188 file.198
2.2 Match exactly 1 with ?
There are situations when you only want to match only 1 character.
For example, suppose we want to only list the files that file.40
through file.49
, but not list file.41
. If the only wildcard we
had was the *, then we would not be able to write a glob for that,
and worse, if the possible strings also contained items like
file.400
, then excluding that would not be possible.
Instead, we need a glob wildcard that can do a limited match. For
that we use a ? wildcard which matchs exactly 1 item. So we can now
write the condition for file.40
through file.49
as the glob
file.4?
.
#>ls file.4? file.40 file.41 file.42 file.43 file.44 file.45 file.46 file.47 file.48 file.49
Notice that file.4
does not match file.4?
because file.4
does
not have a suffix with at least one more character, as necessitated
by the ? wildcard. You can also include the ? within a glob.
#> ls file.1?5 file.105 file.115 file.125 file.135 file.145 file.155 file.165 file.175 file.185 file.195
On your own, what would file.1??
match?
2.3 Match from a set with [] and []
Finally, to complete the matching capabilitie of globs, we need a
way to match from a subset of choices. Consider a situation where
you want to match all files matching file.13?
or file.15?
. That
is, the second digit in the file can either be a 3 or a 5, and we
can describe that using a [] wildcard like file.1[35]?
#> ls file.1[35]? file.130 file.132 file.134 file.136 file.138 file.150 file.152 file.154 file.156 file.158 file.131 file.133 file.135 file.137 file.139 file.151 file.153 file.155 file.157 file.159
You can also negate a set, stating matching anything that is not in the [].
#>ls file.1[^35]? file.100 file.108 file.116 file.124 file.142 file.160 file.168 file.176 file.184 file.192 file.101 file.109 file.117 file.125 file.143 file.161 file.169 file.177 file.185 file.193 file.102 file.110 file.118 file.126 file.144 file.162 file.170 file.178 file.186 file.194 file.103 file.111 file.119 file.127 file.145 file.163 file.171 file.179 file.187 file.195 file.104 file.112 file.120 file.128 file.146 file.164 file.172 file.180 file.188 file.196 file.105 file.113 file.121 file.129 file.147 file.165 file.173 file.181 file.189 file.197 file.106 file.114 file.122 file.140 file.148 file.166 file.174 file.182 file.190 file.198 file.107 file.115 file.123 file.141 file.149 file.167 file.175 file.183 file.191 file.199
Note, that set glob is like a ? in that it matches 1 or more, so
the glob file.[1][1][1]
will only match file.111
.
2.4 Subdirectory Matching
Globs can be used to match subdirectories. Consider the following directory layout:
#>ls dir.a/ dir.ab/ dir.ad/ dir.ba/ dir.bc/ dir.be/ dir.ca/ dir.cc/ dir.ce/ dir.da/ dir.dc/ dir.de/ dir.eb/ dir.ed/ dir.aa/ dir.ac/ dir.b/ dir.bb/ dir.bd/ dir.c/ dir.cb/ dir.cd/ dir.d/ dir.db/ dir.dd/ dir.ea/ dir.ec/ dir.ee/
And each directory has the following files, which you can explore with the ls
glob.
#>ls * dir.a: file.0 file.10 file.12 file.14 file.16 file.18 file.2 file.3 file.5 file.7 file.9 file.1 file.11 file.13 file.15 file.17 file.19 file.20 file.4 file.6 file.8 dir.aa: file.0 file.10 file.12 file.14 file.16 file.18 file.2 file.3 file.5 file.7 file.9 file.1 file.11 file.13 file.15 file.17 file.19 file.20 file.4 file.6 file.8 (...)
The file expansion mechanism for a pattern allow the user to match
the entire path. Consider an individual file file.10
suppose we
wanted to match all instances of that file across the
subdirectories, we can use the following pattern:
#> ls */file.10 dir.a/file.10 dir.ad/file.10 dir.bc/file.10 dir.ca/file.10 dir.ce/file.10 dir.dc/file.10 dir.eb/file.10 dir.aa/file.10 dir.b/file.10 dir.bd/file.10 dir.cb/file.10 dir.d/file.10 dir.dd/file.10 dir.ec/file.10 dir.ab/file.10 dir.ba/file.10 dir.be/file.10 dir.cc/file.10 dir.da/file.10 dir.de/file.10 dir.ed/file.10 dir.ac/file.10 dir.bb/file.10 dir.c/file.10 dir.cd/file.10 dir.db/file.10 dir.ea/file.10 dir.ee/file.10
All the directories match the * and thus the pattern refers to
file.10
as exists in each of the subdirectories. These can be
built upon each other in ever more complex ways. For example, here
is a pattern to match all files ending with a 5 or a 0, but not
file.5 or file.0, and only in directories ending in "a."
#> ls dir.*a/file.?[05] dir.a/file.10 dir.aa/file.10 dir.ba/file.10 dir.ca/file.10 dir.da/file.10 dir.ea/file.10 dir.a/file.15 dir.aa/file.15 dir.ba/file.15 dir.ca/file.15 dir.da/file.15 dir.ea/file.15 dir.a/file.20 dir.aa/file.20 dir.ba/file.20 dir.ca/file.20 dir.da/file.20 dir.ea/file.20
3 File name matching with find
While globbing to search for files is incredibly useful, there are
times where you need a more powerful facility to identify which
files or directories you are interested. For that, there is a very
powerful Unix tool called find
. For a detailed description, you
should refer to the man page.
The basic of find
command is a lot like ls
, except it finds each
of file and directory and outputs it one per line. For exampl,e
using the same directory structure as the above example, we can find
all the local files and directory using find .
#> find . . ./dir.a ./dir.a/file.0 ./dir.a/file.1 ./dir.a/file.10 ./dir.a/file.11 ./dir.a/file.12 ./dir.a/file.13 ./dir.a/file.14 ./dir.a/file.15 ./dir.a/file.16 ./dir.a/file.17 ./dir.a/file.18 ./dir.a/file.19 ./dir.a/file.2 ./dir.a/file.20 ./dir.a/file.3 ./dir.a/file.4 ./dir.a/file.5 ./dir.a/file.6 ./dir.a/file.7 ./dir.a/file.8 ./dir.a/file.9 ./dir.aa ./dir.aa/file.0 ./dir.aa/file.1 ./dir.aa/file.10 ./dir.aa/file.11 (...)
Note that find
finds all the files and directory, including the
.
or current directory. You can also find on a specific path:
#> find dir.aa dir.aa dir.aa/file.0 dir.aa/file.1 dir.aa/file.10 dir.aa/file.11 dir.aa/file.12 dir.aa/file.13 dir.aa/file.14 dir.aa/file.15 dir.aa/file.16 dir.aa/file.17 dir.aa/file.18 dir.aa/file.19 dir.aa/file.2 dir.aa/file.20 dir.aa/file.3 dir.aa/file.4 dir.aa/file.5 dir.aa/file.6 dir.aa/file.7 dir.aa/file.8 dir.aa/file.9
This may seem like just reinventing the ls
wheel, but find
can
do much more, including the matching of files. These additional
options referred to as expressions for find, and we'll look at a
useful subset of options. You should refer to the manual for many,
many more options.
3.1 Specifying the matching pattern with -name
find
provides an expression for matching a file name or directory
name, or, more specifically, the last item in the path. So, suppose
you want to mach all files whose number start with 1.
#> find . -name "file.1*" ./dir.a/file.1 ./dir.a/file.10 ./dir.a/file.11 ./dir.a/file.12 ./dir.a/file.13 ./dir.a/file.14 ./dir.a/file.15 ./dir.a/file.16 ./dir.a/file.17 ./dir.a/file.18 ./dir.a/file.19 ./dir.aa/file.1 ./dir.aa/file.10 (...)
The pattern provided to the -name
option is the same as the
globbing file expansion, but this poses a problem. How do we ensure
that bash does not perform file expansion on the pattern to be
passed to find
? To ensure that bash does not perform a globbing
file expansion, you have to place quotes around the pattern, like ab
The -name
option only refers to the last
part of the path and not the directories, so to search specific sub
directories you need to describe them on the command line.
#> find dir.a dir.b -name "file.1*" dir.a/file.1 dir.a/file.10 dir.a/file.11 dir.a/file.12 dir.a/file.13 dir.a/file.14 dir.a/file.15 dir.a/file.16 dir.a/file.17 dir.a/file.18 dir.a/file.19 dir.b/file.1 dir.b/file.10 dir.b/file.11 dir.b/file.12 dir.b/file.13 dir.b/file.14 dir.b/file.15 dir.b/file.16 dir.b/file.17 dir.b/file.18 dir.b/file.19
Since find
takes multiple sub directories to search, we can take
the next step and use bash'es file expansion with a find
command
to do the same tasks as before. Recall the following pattern
dir.*a/file.?[05]
, here's the equivalent find
command:
#> find dir.*a -name "file.?[05]" dir.a/file.10 dir.a/file.15 dir.a/file.20 dir.aa/file.10 dir.aa/file.15 dir.aa/file.20 dir.ba/file.10 dir.ba/file.15 dir.ba/file.20 dir.ca/file.10 dir.ca/file.15 dir.ca/file.20 dir.da/file.10 dir.da/file.15 dir.da/file.20 dir.ea/file.10 dir.ea/file.15 dir.ea/file.20
3.2 Specifying the type with -type
While the properties of the -name
expression can be duplicated
with globbing, consider a situation where you would need to only
select files and not directories or just directories and not files.
These conditions do not exist in file expansion, but they do in
find
using ht -type
option. There are two arguments that can be
provided to -type
.
-type f
: match only files-type d
: match only directories
For the following example, consider a directory structure that is more complicated with nested subdirectories and files. Each directory and subdirectory is symmetric and looks the same.
#> ls dir.a/ dir.b/ dir.c/ file.a file.b #> ls dir.a/ file.a file.b sub.a/ sub.aa/ sub.ab/ sub.b/ #> ls dir.a/sub.a file.0 file.10 file.15 file.20 file.5 file.a file.b
Try writing a single glob to match just the files or just the directories by the ending "a" or "b"? It's not posible or becomes complicated very fast; however, with find, you can do so very simply.
#> find . -type d . ./dir.a ./dir.a/sub.a ./dir.a/sub.aa ./dir.a/sub.ab ./dir.a/sub.b ./dir.b ./dir.b/sub.a ./dir.b/sub.aa ./dir.b/sub.ab ./dir.b/sub.b ./dir.c ./dir.c/sub.a ./dir.c/sub.aa ./dir.c/sub.ab ./dir.c/sub.b
Now consider writing a glob to identify all directories that end with just "a". The following doesn't work:
find . -name "*.a" ./dir.a ./dir.a/file.a ./dir.a/sub.a ./dir.a/sub.a/file.a ./dir.a/sub.aa/file.a ./dir.a/sub.ab/file.a ./dir.a/sub.b/file.a ./dir.b/file.a ./dir.b/sub.a ./dir.b/sub.a/file.a ./dir.b/sub.aa/file.a ./dir.b/sub.ab/file.a ./dir.b/sub.b/file.a ./dir.c/file.a ./dir.c/sub.a ./dir.c/sub.a/file.a ./dir.c/sub.aa/file.a ./dir.c/sub.ab/file.a ./dir.c/sub.b/file.a ./file.a
But, if we use a -type
argument, we get what we want:
#>find . -type d -name "*.a" /dir.a ./dir.a/sub.a ./dir.b/sub.a ./dir.c/sub.a
3.3 Specifying the depth with -maxdepth
and -mindepth
The last expression we'll cover is matching a file or directory based on how far down the file system hiearchy the item is. You can do this either by identify a minimum depth or maximun depth.
For example, in the above directory layout, there exists a file.a
at many levels:
#> find . -type f -name "*.a" ./dir.a/file.a ./dir.a/sub.a/file.a ./dir.a/sub.aa/file.a ./dir.a/sub.ab/file.a ./dir.a/sub.b/file.a ./dir.b/file.a ./dir.b/sub.a/file.a ./dir.b/sub.aa/file.a ./dir.b/sub.ab/file.a ./dir.b/sub.b/file.a ./dir.c/file.a ./dir.c/sub.a/file.a ./dir.c/sub.aa/file.a ./dir.c/sub.ab/file.a ./dir.c/sub.b/file.a ./file.a
Suppose we only want to identify those that are on the bottom level,
such as ./dir.a/sub.a/file.a
. We can use the -mindepth
option to
do state that it must be at least that depth down the file hierarchy
tree.
#> find . -mindepth 3 -type f -name "*.a" ./dir.a/sub.a/file.a ./dir.a/sub.aa/file.a ./dir.a/sub.ab/file.a ./dir.a/sub.b/file.a ./dir.b/sub.a/file.a ./dir.b/sub.aa/file.a ./dir.b/sub.ab/file.a ./dir.b/sub.b/file.a ./dir.c/sub.a/file.a ./dir.c/sub.aa/file.a ./dir.c/sub.ab/file.a ./dir.c/sub.b/file.a
Note that depth is determined by counting the number of sub directories.
0 1 2 3 <-- depth ./dir.c/sub.aa/file.a
So if we wanted to specify a -maxdepth
of 2 then that will list only
the file.a
's that exist under the dir.*
directories and the .
directory.
>find . -maxdepth 2 -type f -name "*.a" ./dir.a/file.a ./dir.b/file.a ./dir.c/file.a ./file.a
And if you only wanted to identify the file.a
's under the dir.*
directories, you can mix -maxdepth
with -mindepth
#> find . -maxdepth 2 -mindepth 2 -type f -name "*.a" ./dir.a/file.a ./dir.b/file.a ./dir.c/file.a
4 Operations per item
Performing an action for each matched file is the first step in command line scripting. In this lab, we will first look at some simple mechanism for doing so, and we will expand into writing full fledged bash scripts.
4.1 xargs
The xargs
command line tool is extremely useful way to apply an
operation to an output of another operation. Consider a situation
where you have the following files in your directory.
#> ls file.0 file.13 file.18 file.22 file.27 file.31 file.36 file.40 file.45 file.5 file.9 file.1 file.14 file.19 file.23 file.28 file.32 file.37 file.41 file.46 file.50 file.10 file.15 file.2 file.24 file.29 file.33 file.38 file.42 file.47 file.6 file.11 file.16 file.20 file.25 file.3 file.34 file.39 file.43 file.48 file.7 file.12 file.17 file.21 file.26 file.30 file.35 file.4 file.44 file.49 file.8
And each file is of a different, random size.
#> ls -l total 1800 -rw-r--r-- 1 aviv staff 15296 Jan 8 09:18 file.0 -rw-r--r-- 1 aviv staff 29476 Jan 8 09:18 file.1 -rw-r--r-- 1 aviv staff 4139 Jan 8 09:18 file.10 -rw-r--r-- 1 aviv staff 4501 Jan 8 09:18 file.11 -rw-r--r-- 1 aviv staff 20022 Jan 8 09:18 file.12 -rw-r--r-- 1 aviv staff 27465 Jan 8 09:18 file.13 -rw-r--r-- 1 aviv staff 16888 Jan 8 09:18 file.14 -rw-r--r-- 1 aviv staff 293 Jan 8 09:18 file.15 (...)
We might need a way to remove only the top 10 biggest files. We
can get the top 10 files fairly easily using a combination of ls
and head
, ls -S
sorts by size, and head -10
gives us the
top 10.
#> ls -S | head -10 file.31 file.29 file.36 file.24 file.1 file.32 file.42 file.13 file.17 file.35
Now that they have been identified, all we need to delete those
files. Conceptually, what we would like to do is apply rm
to each
of the files, like so:
rm file.31 file.29 file.36 file.24 file.1 file.32 file.42 file.13 file.17 file.35
This is exactly what xargs
can do, but along a pipeline by setting
the input to xargs
as the arguments to the operation listed on the
command line.
ls -S | head -10 | xargs rm
This is a basic, single line bash script that can perform a relatively complex task in a compact space. It is the unix design philosophy in action.
4.2 find -exec
We can do similar action per item using things with find
as well,
using the -exec
expression. Simply, the -exec
expression states
to perform the operation on each of the found items.
find . -name pattern -tpye f -exec mv {} dest \;
The {}
symbols get replaced with each found item, and, so, the
above find command will find all files that match the pattern and
move them to some destination folder. Again in a very compact
space, you can do quite complex tasks, and this is the power of
command line scripting.
For example, consider a situations where we have to identify all the zero length files and directories move them into a new folder. Here's the directory structure.
#> ls dir.a/ dir.b/ dir.c/ empty-dirs/ empty-files/ #> find . -type f -empty ./dir.a/file.3 ./dir.a/file.43 ./dir.a/file.9 ./dir.b/file.12 ./dir.b/file.33 ./dir.b/file.43 ./dir.b/file.8 ./dir.c/file.25 ./dir.c/file.33 ./dir.c/file.43 ./dir.c/file.55
And now we can move them to the empty-files
directory, which
might report some errors since some files are the same name.
#> find . -type f -empty -exec mv {} empty-files \;
And we can see that they are all in empty-files
.
On your own: Figure out how to do the same for empty directories.
However, there are still many tasks that can't be accomplished with
an xarg
or a -exec
expression, such as redirect or executing
multiple commands. For those situations, we need more advanced scripting techniques.