IC221: Systems Programming (SP14)


Home Policy Calendar Syllabus Resources Piazza

Lecture 04: Filename Matching with Globbing and find

Table of Contents

1 Pattern Matching in Unix

A common programming need is to match a pattern to a string. This could be a simple processes, say, is the string exactly what I'm looking for, like in password checking. Alternatively, this could be more nuanced. For example, we might want to describe a pattern as all strings that start with the letter "a" and end with the sequence "am". If that is the case, then the string "adam" matches the pattern as does "a man walks to the dam". If you didn't intend to match so much between the "a" and "am" then you need to specify your pattern differently, such as, all strings that start with the letter "a" followed by at most 1 other letter, followed by the sequence "am".

The process of describing patterns to match a string is well studied in computer science and is the basis of many theoretical thought of compassion. In this class we will apply pattern matching in the shell to perform simple tasks, and, in fact, you've seen some of this already when using grep in the last lab. Generally, we will use two pattern matching techniques:

  1. Globbing: A shell driven pattern matching facility that allows the user to generally matching patterns in files and directories.
  2. Regular Expressions : An expressive language of pattern matching that can match contents and names of files and directories through other command line tools, like sed and grep.

In this lecture, we will focus primarily on globbing, and in the lab, you will have some basic exposure to regular expressions.

2 Globbing

The globbing process is a filename expansion which provides a way for the user to refer to multiple files at the same time. The key idea behind globing is that you can use a wildcard that can match anything and the wildcard then expands with the matched files. Think of this like a wildcard in poker: if you have a wildcard, then it can become any card you like. In the same way, when you use a wildcard in a pattern, it can inhabit any character you want to match in some specified way.

There are three primary wildcards used for globbing:

  1. * – match zero or more
  2. ? – match exactly one
  3. [] – match exactly one from the set
  4. [^ ] – match exactly one thing not from the set

2.1 Match zero or more with *

As an example, consider the contents of this directory with 200:

#> ls
file.0    file.115  file.132  file.15   file.167  file.184  file.200  file.38   file.55   file.72   file.9
file.1    file.116  file.133  file.150  file.168  file.185  file.21   file.39   file.56   file.73   file.90
file.10   file.117  file.134  file.151  file.169  file.186  file.22   file.4    file.57   file.74   file.91
file.100  file.118  file.135  file.152  file.17   file.187  file.23   file.40   file.58   file.75   file.92
file.101  file.119  file.136  file.153  file.170  file.188  file.24   file.41   file.59   file.76   file.93
file.102  file.12   file.137  file.154  file.171  file.189  file.25   file.42   file.6    file.77   file.94
file.103  file.120  file.138  file.155  file.172  file.19   file.26   file.43   file.60   file.78   file.95
file.104  file.121  file.139  file.156  file.173  file.190  file.27   file.44   file.61   file.79   file.96
file.105  file.122  file.14   file.157  file.174  file.191  file.28   file.45   file.62   file.8    file.97
file.106  file.123  file.140  file.158  file.175  file.192  file.29   file.46   file.63   file.80   file.98
file.107  file.124  file.141  file.159  file.176  file.193  file.3    file.47   file.64   file.81   file.99
file.108  file.125  file.142  file.16   file.177  file.194  file.30   file.48   file.65   file.82
file.109  file.126  file.143  file.160  file.178  file.195  file.31   file.49   file.66   file.83
file.11   file.127  file.144  file.161  file.179  file.196  file.32   file.5    file.67   file.84
file.110  file.128  file.145  file.162  file.18   file.197  file.33   file.50   file.68   file.85
file.111  file.129  file.146  file.163  file.180  file.198  file.34   file.51   file.69   file.86
file.112  file.13   file.147  file.164  file.181  file.199  file.35   file.52   file.7    file.87
file.113  file.130  file.148  file.165  file.182  file.2    file.36   file.53   file.70   file.88
file.114  file.131  file.149  file.166  file.183  file.20   file.37   file.54   file.71   file.89

Now Suppose I want to just match all files that begin with file.4 but can end in anything. So the file, file.4 would match that pattern as well as the file file.44 and file.41. To do so I can use a wildcard, the * or Asterix like this file.4*.

#>ls file.4*
file.4   file.40  file.41  file.42  file.43  file.44  file.45  file.46  file.47  file.48  file.49 

The * symbole semantically says: "Match zero or more items." That means file.4 matches since there are zero items following the "4", as well as file.41 matches since there is one item following the "4", the "1". You can more clearly see that * matches zero or more items by looking aht the glob file.1*:

#>ls file.1*
file.1    file.109  file.119  file.129  file.139  file.149  file.159  file.169  file.179  file.189  file.199
file.10   file.11   file.12   file.13   file.14   file.15   file.16   file.17   file.18   file.19
file.100  file.110  file.120  file.130  file.140  file.150  file.160  file.170  file.180  file.190
file.101  file.111  file.121  file.131  file.141  file.151  file.161  file.171  file.181  file.191
file.102  file.112  file.122  file.132  file.142  file.152  file.162  file.172  file.182  file.192
file.103  file.113  file.123  file.133  file.143  file.153  file.163  file.173  file.183  file.193
file.104  file.114  file.124  file.134  file.144  file.154  file.164  file.174  file.184  file.194
file.105  file.115  file.125  file.135  file.145  file.155  file.165  file.175  file.185  file.195
file.106  file.116  file.126  file.136  file.146  file.156  file.166  file.176  file.186  file.196
file.107  file.117  file.127  file.137  file.147  file.157  file.167  file.177  file.187  file.197
file.108  file.118  file.128  file.138  file.148  file.158  file.168  file.178  file.188  file.198

2.2 Match exactly 1 with ?

There are situations when you only want to match only 1 character. For example, suppose we want to only list the files that file.40 through file.49, but not list file.41. If the only wildcard we had was the *, then we would not be able to write a glob for that, and worse, if the possible strings also contained items like file.400, then excluding that would not be possible.

Instead, we need a glob wildcard that can do a limited match. For that we use a ? wildcard which matchs exactly 1 item. So we can now write the condition for file.40 through file.49 as the glob file.4?.

#>ls file.4?
file.40  file.41  file.42  file.43  file.44  file.45  file.46  file.47  file.48  file.49

Notice that file.4 does not match file.4? because file.4 does not have a suffix with at least one more character, as necessitated by the ? wildcard. You can also include the ? within a glob.

#> ls file.1?5
file.105  file.115  file.125  file.135  file.145  file.155  file.165  file.175  file.185  file.195

On your own, what would file.1?? match?

2.3 Match from a set with [] and []

Finally, to complete the matching capabilitie of globs, we need a way to match from a subset of choices. Consider a situation where you want to match all files matching file.13? or file.15?. That is, the second digit in the file can either be a 3 or a 5, and we can describe that using a [] wildcard like file.1[35]?

#> ls file.1[35]?
file.130  file.132  file.134  file.136  file.138  file.150  file.152  file.154  file.156  file.158
file.131  file.133  file.135  file.137  file.139  file.151  file.153  file.155  file.157  file.159

You can also negate a set, stating matching anything that is not in the [].

#>ls file.1[^35]?
file.100  file.108  file.116  file.124  file.142  file.160  file.168  file.176  file.184  file.192
file.101  file.109  file.117  file.125  file.143  file.161  file.169  file.177  file.185  file.193
file.102  file.110  file.118  file.126  file.144  file.162  file.170  file.178  file.186  file.194
file.103  file.111  file.119  file.127  file.145  file.163  file.171  file.179  file.187  file.195
file.104  file.112  file.120  file.128  file.146  file.164  file.172  file.180  file.188  file.196
file.105  file.113  file.121  file.129  file.147  file.165  file.173  file.181  file.189  file.197
file.106  file.114  file.122  file.140  file.148  file.166  file.174  file.182  file.190  file.198
file.107  file.115  file.123  file.141  file.149  file.167  file.175  file.183  file.191  file.199

Note, that set glob is like a ? in that it matches 1 or more, so the glob file.[1][1][1] will only match file.111.

2.4 Subdirectory Matching

Globs can be used to match subdirectories. Consider the following directory layout:

#>ls
dir.a/  dir.ab/ dir.ad/ dir.ba/ dir.bc/ dir.be/ dir.ca/ dir.cc/ dir.ce/ dir.da/ dir.dc/ dir.de/ dir.eb/ dir.ed/
dir.aa/ dir.ac/ dir.b/  dir.bb/ dir.bd/ dir.c/  dir.cb/ dir.cd/ dir.d/  dir.db/ dir.dd/ dir.ea/ dir.ec/ dir.ee/

And each directory has the following files, which you can explore with the ls glob.

#>ls *
dir.a:
file.0   file.10  file.12  file.14  file.16  file.18  file.2   file.3   file.5   file.7   file.9
file.1   file.11  file.13  file.15  file.17  file.19  file.20  file.4   file.6   file.8

dir.aa:
file.0   file.10  file.12  file.14  file.16  file.18  file.2   file.3   file.5   file.7   file.9
file.1   file.11  file.13  file.15  file.17  file.19  file.20  file.4   file.6   file.8

(...)

The file expansion mechanism for a pattern allow the user to match the entire path. Consider an individual file file.10 suppose we wanted to match all instances of that file across the subdirectories, we can use the following pattern:

#> ls */file.10
dir.a/file.10   dir.ad/file.10  dir.bc/file.10  dir.ca/file.10  dir.ce/file.10  dir.dc/file.10  dir.eb/file.10
dir.aa/file.10  dir.b/file.10   dir.bd/file.10  dir.cb/file.10  dir.d/file.10   dir.dd/file.10  dir.ec/file.10
dir.ab/file.10  dir.ba/file.10  dir.be/file.10  dir.cc/file.10  dir.da/file.10  dir.de/file.10  dir.ed/file.10
dir.ac/file.10  dir.bb/file.10  dir.c/file.10   dir.cd/file.10  dir.db/file.10  dir.ea/file.10  dir.ee/file.10

All the directories match the * and thus the pattern refers to file.10 as exists in each of the subdirectories. These can be built upon each other in ever more complex ways. For example, here is a pattern to match all files ending with a 5 or a 0, but not file.5 or file.0, and only in directories ending in "a."

#> ls dir.*a/file.?[05]
dir.a/file.10   dir.aa/file.10  dir.ba/file.10  dir.ca/file.10  dir.da/file.10  dir.ea/file.10
dir.a/file.15   dir.aa/file.15  dir.ba/file.15  dir.ca/file.15  dir.da/file.15  dir.ea/file.15
dir.a/file.20   dir.aa/file.20  dir.ba/file.20  dir.ca/file.20  dir.da/file.20  dir.ea/file.20

3 File name matching with find

While globbing to search for files is incredibly useful, there are times where you need a more powerful facility to identify which files or directories you are interested. For that, there is a very powerful Unix tool called find. For a detailed description, you should refer to the man page.

The basic of find command is a lot like ls, except it finds each of file and directory and outputs it one per line. For exampl,e using the same directory structure as the above example, we can find all the local files and directory using find .

#> find .
.
./dir.a
./dir.a/file.0
./dir.a/file.1
./dir.a/file.10
./dir.a/file.11
./dir.a/file.12
./dir.a/file.13
./dir.a/file.14
./dir.a/file.15
./dir.a/file.16
./dir.a/file.17
./dir.a/file.18
./dir.a/file.19
./dir.a/file.2
./dir.a/file.20
./dir.a/file.3
./dir.a/file.4
./dir.a/file.5
./dir.a/file.6
./dir.a/file.7
./dir.a/file.8
./dir.a/file.9
./dir.aa
./dir.aa/file.0
./dir.aa/file.1
./dir.aa/file.10
./dir.aa/file.11
(...)

Note that find finds all the files and directory, including the . or current directory. You can also find on a specific path:

#> find dir.aa
dir.aa
dir.aa/file.0
dir.aa/file.1
dir.aa/file.10
dir.aa/file.11
dir.aa/file.12
dir.aa/file.13
dir.aa/file.14
dir.aa/file.15
dir.aa/file.16
dir.aa/file.17
dir.aa/file.18
dir.aa/file.19
dir.aa/file.2
dir.aa/file.20
dir.aa/file.3
dir.aa/file.4
dir.aa/file.5
dir.aa/file.6
dir.aa/file.7
dir.aa/file.8
dir.aa/file.9

This may seem like just reinventing the ls wheel, but find can do much more, including the matching of files. These additional options referred to as expressions for find, and we'll look at a useful subset of options. You should refer to the manual for many, many more options.

3.1 Specifying the matching pattern with -name

find provides an expression for matching a file name or directory name, or, more specifically, the last item in the path. So, suppose you want to mach all files whose number start with 1.

#> find . -name "file.1*" 
./dir.a/file.1
./dir.a/file.10
./dir.a/file.11
./dir.a/file.12
./dir.a/file.13
./dir.a/file.14
./dir.a/file.15
./dir.a/file.16
./dir.a/file.17
./dir.a/file.18
./dir.a/file.19
./dir.aa/file.1
./dir.aa/file.10
(...)

The pattern provided to the -name option is the same as the globbing file expansion, but this poses a problem. How do we ensure that bash does not perform file expansion on the pattern to be passed to find? To ensure that bash does not perform a globbing file expansion, you have to place quotes around the pattern, like ab

The -name option only refers to the last part of the path and not the directories, so to search specific sub directories you need to describe them on the command line.

#> find dir.a dir.b -name "file.1*" 
dir.a/file.1
dir.a/file.10
dir.a/file.11
dir.a/file.12
dir.a/file.13
dir.a/file.14
dir.a/file.15
dir.a/file.16
dir.a/file.17
dir.a/file.18
dir.a/file.19
dir.b/file.1
dir.b/file.10
dir.b/file.11
dir.b/file.12
dir.b/file.13
dir.b/file.14
dir.b/file.15
dir.b/file.16
dir.b/file.17
dir.b/file.18
dir.b/file.19

Since find takes multiple sub directories to search, we can take the next step and use bash'es file expansion with a find command to do the same tasks as before. Recall the following pattern dir.*a/file.?[05], here's the equivalent find command:

#> find  dir.*a -name "file.?[05]"
dir.a/file.10
dir.a/file.15
dir.a/file.20
dir.aa/file.10
dir.aa/file.15
dir.aa/file.20
dir.ba/file.10
dir.ba/file.15
dir.ba/file.20
dir.ca/file.10
dir.ca/file.15
dir.ca/file.20
dir.da/file.10
dir.da/file.15
dir.da/file.20
dir.ea/file.10
dir.ea/file.15
dir.ea/file.20

3.2 Specifying the type with -type

While the properties of the -name expression can be duplicated with globbing, consider a situation where you would need to only select files and not directories or just directories and not files. These conditions do not exist in file expansion, but they do in find using ht -type option. There are two arguments that can be provided to -type.

  1. -type f : match only files
  2. -type d : match only directories

For the following example, consider a directory structure that is more complicated with nested subdirectories and files. Each directory and subdirectory is symmetric and looks the same.

#> ls
dir.a/  dir.b/  dir.c/  file.a  file.b
#> ls dir.a/
file.a  file.b  sub.a/  sub.aa/ sub.ab/ sub.b/
#> ls dir.a/sub.a
file.0   file.10  file.15  file.20  file.5   file.a   file.b

Try writing a single glob to match just the files or just the directories by the ending "a" or "b"? It's not posible or becomes complicated very fast; however, with find, you can do so very simply.

#> find . -type d
.
./dir.a
./dir.a/sub.a
./dir.a/sub.aa
./dir.a/sub.ab
./dir.a/sub.b
./dir.b
./dir.b/sub.a
./dir.b/sub.aa
./dir.b/sub.ab
./dir.b/sub.b
./dir.c
./dir.c/sub.a
./dir.c/sub.aa
./dir.c/sub.ab
./dir.c/sub.b

Now consider writing a glob to identify all directories that end with just "a". The following doesn't work:

find . -name "*.a"
./dir.a
./dir.a/file.a
./dir.a/sub.a
./dir.a/sub.a/file.a
./dir.a/sub.aa/file.a
./dir.a/sub.ab/file.a
./dir.a/sub.b/file.a
./dir.b/file.a
./dir.b/sub.a
./dir.b/sub.a/file.a
./dir.b/sub.aa/file.a
./dir.b/sub.ab/file.a
./dir.b/sub.b/file.a
./dir.c/file.a
./dir.c/sub.a
./dir.c/sub.a/file.a
./dir.c/sub.aa/file.a
./dir.c/sub.ab/file.a
./dir.c/sub.b/file.a
./file.a

But, if we use a -type argument, we get what we want:

#>find .  -type d -name "*.a"
/dir.a
./dir.a/sub.a
./dir.b/sub.a
./dir.c/sub.a

3.3 Specifying the depth with -maxdepth and -mindepth

The last expression we'll cover is matching a file or directory based on how far down the file system hiearchy the item is. You can do this either by identify a minimum depth or maximun depth.

For example, in the above directory layout, there exists a file.a at many levels:

#> find .  -type f -name "*.a"
./dir.a/file.a
./dir.a/sub.a/file.a
./dir.a/sub.aa/file.a
./dir.a/sub.ab/file.a
./dir.a/sub.b/file.a
./dir.b/file.a
./dir.b/sub.a/file.a
./dir.b/sub.aa/file.a
./dir.b/sub.ab/file.a
./dir.b/sub.b/file.a
./dir.c/file.a
./dir.c/sub.a/file.a
./dir.c/sub.aa/file.a
./dir.c/sub.ab/file.a
./dir.c/sub.b/file.a
./file.a

Suppose we only want to identify those that are on the bottom level, such as ./dir.a/sub.a/file.a. We can use the -mindepth option to do state that it must be at least that depth down the file hierarchy tree.

#> find . -mindepth 3 -type f -name "*.a"
./dir.a/sub.a/file.a
./dir.a/sub.aa/file.a
./dir.a/sub.ab/file.a
./dir.a/sub.b/file.a
./dir.b/sub.a/file.a
./dir.b/sub.aa/file.a
./dir.b/sub.ab/file.a
./dir.b/sub.b/file.a
./dir.c/sub.a/file.a
./dir.c/sub.aa/file.a
./dir.c/sub.ab/file.a
./dir.c/sub.b/file.a

Note that depth is determined by counting the number of sub directories.

0   1     2      3    <-- depth
./dir.c/sub.aa/file.a

So if we wanted to specify a -maxdepth of 2 then that will list only the file.a's that exist under the dir.* directories and the . directory.

>find . -maxdepth 2 -type f -name "*.a"
./dir.a/file.a
./dir.b/file.a
./dir.c/file.a
./file.a

And if you only wanted to identify the file.a's under the dir.* directories, you can mix -maxdepth with -mindepth

#> find . -maxdepth 2 -mindepth 2 -type f -name "*.a"
./dir.a/file.a
./dir.b/file.a
./dir.c/file.a

4 Operations per item

Performing an action for each matched file is the first step in command line scripting. In this lab, we will first look at some simple mechanism for doing so, and we will expand into writing full fledged bash scripts.

4.1 xargs

The xargs command line tool is extremely useful way to apply an operation to an output of another operation. Consider a situation where you have the following files in your directory.

#> ls
file.0   file.13  file.18  file.22  file.27  file.31  file.36  file.40  file.45  file.5   file.9
file.1   file.14  file.19  file.23  file.28  file.32  file.37  file.41  file.46  file.50
file.10  file.15  file.2   file.24  file.29  file.33  file.38  file.42  file.47  file.6
file.11  file.16  file.20  file.25  file.3   file.34  file.39  file.43  file.48  file.7
file.12  file.17  file.21  file.26  file.30  file.35  file.4   file.44  file.49  file.8

And each file is of a different, random size.

#> ls -l
total 1800
-rw-r--r--  1 aviv  staff  15296 Jan  8 09:18 file.0
-rw-r--r--  1 aviv  staff  29476 Jan  8 09:18 file.1
-rw-r--r--  1 aviv  staff   4139 Jan  8 09:18 file.10
-rw-r--r--  1 aviv  staff   4501 Jan  8 09:18 file.11
-rw-r--r--  1 aviv  staff  20022 Jan  8 09:18 file.12
-rw-r--r--  1 aviv  staff  27465 Jan  8 09:18 file.13
-rw-r--r--  1 aviv  staff  16888 Jan  8 09:18 file.14
-rw-r--r--  1 aviv  staff    293 Jan  8 09:18 file.15
(...)

We might need a way to remove only the top 10 biggest files. We can get the top 10 files fairly easily using a combination of ls and head, ls -S sorts by size, and head -10 gives us the top 10.

#> ls -S | head -10
file.31
file.29
file.36
file.24
file.1
file.32
file.42
file.13
file.17
file.35

Now that they have been identified, all we need to delete those files. Conceptually, what we would like to do is apply rm to each of the files, like so:

rm file.31 file.29 file.36 file.24 file.1 file.32 file.42 file.13 file.17 file.35

This is exactly what xargs can do, but along a pipeline by setting the input to xargs as the arguments to the operation listed on the command line.

ls -S | head -10 | xargs rm 

This is a basic, single line bash script that can perform a relatively complex task in a compact space. It is the unix design philosophy in action.

4.2 find -exec

We can do similar action per item using things with find as well, using the -exec expression. Simply, the -exec expression states to perform the operation on each of the found items.

find . -name pattern -tpye f -exec mv {} dest \;

The {} symbols get replaced with each found item, and, so, the above find command will find all files that match the pattern and move them to some destination folder. Again in a very compact space, you can do quite complex tasks, and this is the power of command line scripting.

For example, consider a situations where we have to identify all the zero length files and directories move them into a new folder. Here's the directory structure.

#> ls
dir.a/       dir.b/       dir.c/       empty-dirs/  empty-files/
#> find . -type f -empty
./dir.a/file.3
./dir.a/file.43
./dir.a/file.9
./dir.b/file.12
./dir.b/file.33
./dir.b/file.43
./dir.b/file.8
./dir.c/file.25
./dir.c/file.33
./dir.c/file.43
./dir.c/file.55

And now we can move them to the empty-files directory, which might report some errors since some files are the same name.

#> find . -type f -empty -exec mv {} empty-files \;

And we can see that they are all in empty-files.

On your own: Figure out how to do the same for empty directories.

However, there are still many tasks that can't be accomplished with an xarg or a -exec expression, such as redirect or executing multiple commands. For those situations, we need more advanced scripting techniques.