IC221: Systems Programming (SP17)


Home Policy Calendar Units Assignments Resources

Project 01: wc : The Word Count Program

Table of Contents

Project Preliminaries

Project Learning Goals

The goal of this project are:

  1. To write C programs without framework
  2. Work with command line arguments
  3. Read from files and stdin to complete tasks
  4. Write functions that handle structured data

Project Grading and Due Date

This project is graded out of 100 points and is due on Mon. 20 Feb at 2359. Late submissions will not be allowed.

We will apply the following grading rubric to this project

  • 60% Complete a wc such that it prints only the total number of words using fscanf() reading from a file specified on the command line.
  • 65% Complete a wc such that it prints the total number of words, characters, and lines using fscanf() reading from a file specified on the command line
  • 75% Complete a wc such that it prints the total number of words, characters, and lines using fscanf() reading from any number of files specified on the command line, reporting the total across all files at the end.
  • 85% Complete a wc such that it prints the total number of words, characters, and lines using fgetc() reading from any number of files specified on the command line, reporting the total across all files at the end.
  • 95% Complete a wc such that it prints the total number of words, characters, and lines based on command line arguments using fgetc() reading from any number of files specified on the command line, reporting the total across all files at the end.
  • 100% Complete a wc such that it prints the total number of words, characters, and lines based on command line arguments using fgetc() reading from any number of files or stdin on the command line, reporting the total across all files at the end.

Additional Requirements:

  • You must provide a Manual Page like entry for your program. Place your description in man.txt
  • You must provide a README file for your program describing tasks completed and processes used. This is also the place to provide additional details to your grader.

Project Setup:

  • Run the following command in your terminal
~aviv/bin/ic221-up
  • Then change into the following directory
cd ic221/proj/01
  • You will find all the material you need to complete this lab in that directory.
  • During the course of this lab, we will refer to the ic221/proj/01 as the project directory

Project Submission

To submit this lab you will place all relevant content into your lab directory:

ic221/proj/01

Then issue the submission script

~aviv/bin/ic221-submit

Select the option for proj/01, and confirm. If you see SUCCESS at the end. You may submit multiple times up until the submission deadline. Only your final submission will be considered for grading.

Sample Solution

You can find a working version of the program on the lab machines here:

~aviv/ic221-proj/wc

You can compare your solution to this one


Project Description

In this project, you will reimplement the command line wc utility, short for "word count." The wc utility, as its name implies, is used to count the number of words in a file, but it can also count the number of characters and lines.

As an example, here is some sample output of using a 100% solution of wc:

$ ./wc dickens.txt 
dickens.txt 19202 161009 936251
$ ./wc A.txt 
A.txt 1 3960 201961
$ ./wc random.txt 
random.txt 69765 295983 6795742

The first number is the number of lines, the second is the number words, and the third is the number of characters. If multiple files are specified, the totals in each category are also reported along with individual file results:

$ ./wc dickens.txt A.txt 
dickens.txt 19202 161009 936251
A.txt 1 3960 201961
total 19203 164969 1138212
$ ./wc A.txt random.txt 
A.txt 1 3960 201961
random.txt 69765 295983 6795742
total 69766 299943 6997703

If no files are specified, then wc should read from the stdn:

$ cat dickens.txt | ./wc 
-stdin- 19202 161009 936251

Yes, you should indicate "-stdin-" for the filename in this case. A user can also indicate that they wish to read from stdin using the + symbol as a file name, like in the following:

$ cat dickens.txt | ./wc A.txt + A.txt
A.txt 1 3960 201961
-stdin- 19202 161009 936251
A.txt 1 3960 201961
total 19204 168929 1340173

Finally, prior to the list of files, optional command line arguments can be provided to limit the output to just reporting the number of lines, number of words, or number of characters:

  • -l print number of lines
  • -w print number of words
  • -c print number of characters

For example:

$ cat dickens.txt | ./wc -l A.txt + A.txt
A.txt 1 
-stdin- 19202 
A.txt 1 
total 19204 
$ cat dickens.txt | ./wc -w A.txt + A.txt
A.txt 3960 
-stdin- 161009 
A.txt 3960 
total 168929 
$ cat dickens.txt | ./wc -c A.txt + A.txt
A.txt 201961 
-stdin- 936251 
A.txt 201961 
total 1340173

Command line arguments can be combined as well, like so:

$ cat dickens.txt | ./wc -c -l A.txt + A.txt
A.txt 1 201961 
-stdin- 19202 936251 
A.txt 1 201961 
total 19204 1340173

But, the output is always reported in line, word, character order despite the order of the command lines.

How to count

There are two methods for how you can choose to count, however, for full credit you must using fgetc(). But, let's start with a simpler method of using fscanf().

You can use fscanf() to read a file multiple times with different format characters to determine line, word, and character counts. For words, your can use the "%s" format which will recognize word boundaries, but you will need to specify a buffer large enough to store the resulting word which may fail for large words (yes, we will test with some odd files!). You could then read lines and chars by using the "%c" format to count characters and detect newline symbols.

A more efficient method is to use fgetc() which reads from the specified file one character a time. The challenge with this method is then you need to a way to detect word boundaries. To do that, you should employ the ctype.h library and the isspace() function. Using this method, you should be able to make a single pass through a file and perform all your counts.

Parsing Command Lines Options

A full solution to this project must be able to handle command lines. For this, you could use the getopt.h library, or you can program a simpler parsing routine. This is your choice.

The parsing requirements for command line options are as follows:

  1. Command line options must come before the list of files. If a command line option apears within the list of files, it is treated like a file name.
  2. Command line options must begin with a - (tack/hyphen).
  3. You can report an error on unknown options.
  4. Once you reach the first command line argument without a -, you can assume the list of files have started.

Error Conditions

You are required to detect error on user provided input. There are two main categories:

  1. Unknown file name: You should report the error, but continue to proceed with processing remaining files.
  2. Unknown command line argument: You should report the error, and not continue and return.

ALL ERROR REPORTING MUST BE DONE to stderr.

Here some examples of condition (1) errors:

$ ./wc doesnotexist.txt
ERROR: file 'doesnotexist.txt' cannot be opened
$ ./wc doesnotexist.txt 2> /dev/null #no output, since redirect /dev/null
$ ./wc dickens.txt doesnotexist.txt dickens.txt 
dickens.txt 19202 161009 936251
ERROR: file 'doesnotexist.txt' cannot be opened
dickens.txt 19202 161009 936251
total 38404 322018 1872502
$ ./wc dickens.txt doesnotexist.txt dickens.txt 2>/dev/null #error doesn't apear in list
dickens.txt 19202 161009 936251
dickens.txt 19202 161009 936251
total 38404 322018 1872502

Here some examples of condition (2) errors:

$ ./wc -p dickens.txt 
ERROR: unkown option '-p'
$ ./wc -p -l dickens.txt 
ERROR: unkown option '-p'
$ ./wc -l -p dickens.txt 
ERROR: unkown option '-p'

Important, a - by itself could be treated like a file name, so could be consider it a condition (1) error.

$ ./wc - dickens.txt 
ERROR: file '-' cannot be opened
dickens.txt 19202 161009 936251
total 19202 161009 936251

However, you could also consider it a condition (2) error and do a hard stop. You should choose what is natural for your program and stick with it.