Perl Lessons

A tutorial on the Perl programming language.

Contents

1. What Is Perl?
2. Course Requisites and Goals
3. Perl References & Resources
4. "Hello world!", and some more
5. Taste of Perl
6. Storing & Running Perl Programs
7. Variables and Data Structures
8. Literals & Operators
 9. Loops and I/O
10. Grade Book Example
11. System Interface
12. Matching
13. Substitutions & Parsing
14. Simple CGI
15. Testing Perl Programs
16. Common Goofs

For sample programs, there are links (named e.g. "this program") to plain text documents containing the programs, to let you download and try them more easily.

1. What Is Perl?

Perl is a programming language which can be used for a large variety of tasks. Both Perl interpreters and Perl documentation are freely available for Windows, Unix/Linux, and Macintosh e.g. from the Perl.org site.

A typical simple use of Perl would be for extracting information from a text file and printing out a report or for converting a text file into another form. In fact, the name Perl was formed from the expression Practical Extraction and Report Language. But Perl provides a large number of tools for quite complicated problems, including system administration tasks, which can often be programmed rather portably in Perl.

Perl has powerful string-manipulation functions. On the other hand, it eclectically combines features and purposes of the C language and many command or scripting languages. For such reasons, Perl looks rather odd on first sight. But once you have learned Perl, you will be able to write programs in Perl much faster than in most other languages. This makes Perl especially suitable for writing programs which are used once only.

The following simple Perl program reads a text file consisting people's names like "John Smith", each on a line of its own, and it prints them out so that the first and second name have been swapped and separated with a comma (e.g. "Smith, John").

while(<>) {
  split;
  print "$_[1], $_[0]\n"; }

As you can see, Perl is compact - and somewhat cryptic, until you learn some basic notations. (There will be a rather similar example, which is annotated in some detail.)

Perl has become popular for programming handlers for World Wide Web forms and generally as glue and gateway between systems, databases, and users.

Perl is typically implemented as an interpreted (not compiled) language. Thus, the execution of a Perl program tends to use more CPU time than a corresponding C program, for instance. On the other hand, computers tend to get faster and faster, and writing something in Perl instead of C tends to save your time.

Programs written in Perl are often called Perl scripts, especially in the context of CGI programming, whereas the term the perl program refers to the system program named perl for executing Perl scripts. (What, confused already?)

If you have used Unix shell scripts or awk or sed or similar utilities for various purposes, you will find that you can normally use Perl for those and many other purposes, and the code tends to be more compact. And if you haven't used such utilities but have started thinking you might have need for them, then perlhaps what you really need to learn is Perl instead of all kinds of futilities.

Perl is relatively stable; the current major version, Perl 5, was released in 1994. Extensions to the language capabilities are mainly added though the construction of modules, written by different people and made available to other Perl programmers. For major general-purpose Perl applications, particularly CGI scripts and client or server applications, see CPAN documentation.

2. Course Requisites and Goals

This course presumes that you have access to a system with Perl 5 installed You are also assumed to know or to find out how to invoke the Perl interpreter on the system they use. (Typically, this means giving, on a command prompt, a command of the form perl filename.)

By completing this course and its homework, you should be able to:

  1. Locate reference materials and other resources related to Perl;
  2. Express fundamental programming constructs such as variables, arrays, loops, subroutines and input/output in Perl;
  3. Understand several concepts relatively idiosyncratic to Perl, such as associative arrays, Perl regular expressions, and system interfaces;
  4. Program in Perl for data manipulation, file maintenance, packaging or interfacing system facilities, and for "Common Gateway Interface" (CGI) Web applications.

To keep this a short course, we won't explain object-oriented concepts and some other facilities appropriate for large projects.

Aiming at brevity, this course is written so that you need to use your intuition at times, guessing what some constructs might mean. The reason is that learning a programming language systematically, rigorously starting from all the basic concepts and constructs in detail, would take a much longer course. Naturally, you need to learn to consult references; actual programming shouldn't be based on guesswork.

Perl, perhaps more than any other computer language, is full of alternative ways to do the same thing; we tend to show only one or two. We will try to stimulate by examples of useful bits of code, results, and questions. Turn to the reference materials for further explanation.

3. Perl References and Resources

Programming Perl
The definitive reference, by Larry Wall, Tom Christiansen and Jon Orwant. Published by O'Reilly; see publisher's information.
The Perl.org site
Information about all aspects of Perl. Includes tutorials, and links for downloading Perl implementations.
Perl Language Reference
Part of the above-mentioned site. Contains a tutorial, a reference manual, and more.
The Perl FAQ (perlfaq).
Answers a large number of Frequently Asked Question. Check this first whenever you have a problem with Perl! Divided into sections:
CPAN
The Comprehensive Perl Archive Network. Tens of thousands of modules.
Perl 5 Desktop Reference
An excellent quick reference, by Johan Vromans, for an experienced Perl programmer who occasionally cannot remember the name of a function or its exact syntax.

Homework!

Learn how to search the Perl 5 online manual using your favorite World Wide Web browser! Find an alternative expansion of the word "Perl" as an acronym involving the word "eclectic".

4. "Hello world!", and some more Perl

When studying a new programming language, the crucial point is to learn to write and execute a program which prints Hello world! (or some other fixed string, but that's the traditional one). The reason is that in order to be able to do this, you need to know quite a lot of simple things about the language itself, and language manuals might not present them compactly enough. Moreover, you need to find out how to work with the language in your particular installation, e.g. how to start the language compiler or interpreter.

"Hello world!"

So let us get started:

lk-hp-23 perl 195 % cat >hello
print "Hello world!\n";
lk-hp-23 perl 196 % perl hello
Hello world!
lk-hp-23 perl 197 % 
Explanations:
  1. I used the cat command to create a file named hello and containing a very simple Perl script. Normally one uses one's favorite editor (such as Emacs or jEdit or Notepad) to create Perl scripts, of course. (You might wish to use names ending with .pl for your Perl programs, partly in order to be able to see which of your files are such programs. If you use Emacs, there will be the additional benefit that Emacs automatically enters Perl mode (if installed) when a file with suffix .pl is edited.)
  2. In this case the Perl program consists of one line only. It should be self-explanatory except for the use of the notation \n which stands for newline. In Perl strings, many control characters can be represented in this way, using the backslash character \ and a letter. (This is the same convention which is used in the C programming language.)
  3. I executed the script by giving the perl command, which invokes the Perl interpreter, and I gave the name of the file as a command argument. (In your system, the command name might be different from perl, but usually it isn't.)

In Perl (as in C), a simple statement is usually terminated by a semicolon. See rules for simple statements in the manual.

Using simple variables

The following Perl script illustrates the use of simple (scalar) variables. It also introduces some other basic features of Perl.

The script prints out its input so that each line begins with a running line number:

$line = 1;
while (<>) {
  print $line, " ", $_;
  $line = $line + 1; }

The scalar variable $line is the line counter. It is initialized to 1 at the beginning, and it is incremented by 1 within a loop which processes each input line at a time.

The loop construct is of the form

while (<>) {
  process one line of input }
and although it looks cryptic at first sight, it is really very convenient to use. You need not worry about actual input operations; just use the construct shown above, and use the predefined variable $_ to refer to the input line.

The print statement in our example contains three arguments, one for getting the line number printed, one for getting a blank printed, and one for getting the input line printed. We do not have an argument for getting a newline printed, since the value of $_ is the entire input line, including the trailing newline.

In fact, you could make your code even shorter: you could write the script as

$line = 1;
while (<>) {
  print $line++, " ", $_; }
Here the statement contains $line++ instead of just $line, since in Perl (as in C) you can increment a variable (after its old value has been used) by appending the operator ++ to it.

You might wish to have the line numbers right-adjusted, e.g. each in a fixed field of five characters, padded with blanks on the left. This would be pretty easy, if you know the C language output formatting tools. You could just replace the print statement with

  printf "%5d %s", $line++, $_;

Input to Perl scripts

Where does a Perl script read its input from? By default, i.e. in the absence of any specification of input source, the input comes from the so-called standard input stream (often called stdin) By default this means user's keyboard.

Normally you want your script to read input from a file. Simply write the name of the file as a command-line argument, i.e. append the name to the command you use to start the script. Thus, for example, if you had written our simple script (the shorter version of it) into a file named lines, you could test it by using it as its own test data (do you find this confusing?) as follows:

lk-hp-23 perl 251 % perl lines lines
1 $line = 1;
2 while (<>) {
3   print $line++, " ", $_; }
lk-hp-23 perl 252 % 

You can also write several file names as command-line arguments, e.g.

perl lines foo bar zap
which would mean that the script lines takes as input the contents of files foo, bar, and zap as if you had concatenated the contents into a single file and given its name as argument.

5. Taste of Perl

Quite useful Perl programs can be short. Suppose we want to change the same text in many files. Instead of editing each possible file or constructing some cryptic find, awk, or sed commands, you could issue a single command on Unix:

Example: Amazing Perl One-Liner That Substitutes Text In Multiple Files

      perl  -e 's/gopher/World Wide Web/gi'  -p  -i.bak  *.html

This command, issued at the Unix prompt, executes the short Perl program specified in single quotes. This program consists of one Perl operation: it substitutes for original word "gopher" the phrase "World Wide Web" (globally, ignoring case). The command-line options imply that the Perl program should run for each file ending in .html in the current directory. If any file blah.html needs changing, a backup of the original is made as file blah.html.bak. The book Programming Perl lists additional handy one-liners.

The amazing one-liner relies on the behavior of Unix command language processors (shells), which expand a wildcard notation like *.html into a list of file names, before invoking the script. Thus, on other systems, you need to handle such expansion in the Perl script, as in the following example.

A Multi-File Text Replacement Script

The following script is a more universal variant of the amazing one-liner discussed above. It works fine in Windows environments, too, because it internally loops through all files with names ending with .html. (As a minor modification, this script uses the name blah.bak rather than blah.html.bak for the backup files.)

while(<*.html>) {
    $oldname = $_;
  open(OLD,"<$oldname")
      || die "Can't open input file $oldname $!";
    s/\.html$/\.new/;
    $newname = $_;
  open(NEW,">$newname") 
      || die "Can't open output file $newname $!";
    while(<OLD>) {
	s/gopher/World Wide Web/gi;
	print NEW; }
    close(OLD);
    close(NEW);
    $backupname = $oldname;
    $backupname =~ s/\.html$/\.bak/;
  unlink $backupname
      || die "Can't delete old backup file $backupname $!";
  rename $oldname, $backupname
      || die "Can't rename $oldname to $backupname $!";
  rename $newname, $oldname
      || die "Can't rename $newname to $oldname $!";
}

Example: Global Substitution, The Scenic Route

The following a script is yet another variation of our theme. It may look structurally more familiar to those accustomed to "classic" procedural programming (in e.g. C or Pascal). Notice that this, as well as the original one-liner, does not handle any wildcard expansion in the script itself.

Note: Anything following a number sign (#) on a line in a Perl program is a comment: ignored by a Perl interpreter, hopefully useful to a human reader of the code.

    # File:  go2www
    # This Perl program in classic programming style changes
    # the string "gopher" to "World Wide Web" in all files
    # specified on the command line.
    $original='gopher';
    $replacement="World Wide Web";
    $nchanges = 0;
    # The input record separator is defined by Perl global
    # variable $/.  It can be anything, including multiple
    # characters.  Normally it is "\n", newline.  Here, we
    # say there is no record separator, so the whole file
    # is read as one long record, newlines included.
    undef $/;

    # Suppose this program was invoked with the command
    #     go2www ax.html  big.basket.html  candle.html
    # Then builtin list @ARGV would contain three elments
    # ('ax.html', 'big.basket.html', 'candle.html')
    # These could be accessed as $ARGV[0] $ARGV[1] $ARGV[2]

    foreach $file (@ARGV) {
        if (! open(INPUT,"<$file") ) {
            print STDERR "Can't open input file $file\n";
            next;
        }

        # Read input file as one long record.
        $data=<INPUT>;
        close INPUT;

        if ($data =~ s/$original/$replacement/gi) {
            $bakfile = "$file.bak";
            # Delete old backup file if existent
	    unlink $bakfile;
            # Abort if can't backup original or output.
            if (! rename($file,$bakfile)) {
                die "Can't rename $file $!";
            }
            if (! open(OUTPUT,">$file") ) {
                die "Can't open output file $file\n";
            }
            print OUTPUT $data;
            close OUTPUT;
            print STDERR "$file changed\n";
            $nchanges++;
        }

        else {  print STDERR "$file not changed\n"; }
    }
    print STDERR "$nchanges files changed.\n";
    exit(0);

Questions

  1. What do you guess that the exclamation mark ! means, as in:
                if (! open(OUTPUT,">$file") ) {
                    die "Can't open output file $file\n";
                }
  2. What does the greater than sign > probably mean here? Compare with open(INPUT ...).
  3. What does die do?
  4. Some languages use "IF ... THEN DO ... END; ELSE IF ... THEN DO ... END". How is this notated in Perl?
  5. What does $nchanges++ do?

The Perl Creed is, "There is more than one way!" This noble freedom of expression however results in the first of the four Perl Paradoxes: Perl programs are easy to write but not always easy to read. For example, the following lines are equivalent!

    if ($x == 0) {$y = 10;}  else {$y = 20;}
    $y = $x==0 ? 10 : 20;
    $y = 20;  $y = 10 if $x==0;
    unless ($x == 0) {$y=20} else {$y=10}
    if ($x)  {$y=20} else {$y=10}
    $y = (10,20)[$x != 0];

6. Storing and Running Perl Programs

Homework: Hello

To test your ability to store and run a Perl program, enter and execute something like this classic code:
    if ($#ARGV >= 0) { $who = join(' ', @ARGV); }
    else { $who = 'World'; }
    print "Hello, $who!\n";

Here's How:

First you need to make sure that the directory containing the Perl interpreter is in your search path as set up at the command level. Please consult operating system specific information and local documents for details on this.

Let us assume that the above lines are stored in a Unix file ~/bin/hello. (That's in your home directory, subdirectory bin, file hello.) You can then run the program by entering a command like one of the following:

    perl ~/bin/hello
    perl ~/bin/hello Citizens of Earth
    perl hello

(The last one works only you're in the ~/bin directory.)

Making it command-like on Unix

If you expect to use this program a lot and want to execute it as a command on Unix, then you need to do five things.

  1. Insert a line at the very beginning of the program file, beginning with #! followed by the full pathname of the Perl interpreter (the perl command), typically something like #!/usr/local/bin/perl or #!/usr/bin/perl. You may append command-line options like -w (warn about possible inconsistencies), to that line.
  2. Set the execute permissions of the program file. To make the file executable (and readable and writable) by only yourself, use a Unix command like:
        chmod 700 ~/bin/hello
    To make it executable and readable by all enter a Unix command like the following:
        chmod a+rx ~/bin/hello
    You may also need to use chmod a+x on the directories ~ and ~/bin. See man chmod for details and the security implications.
  3. Edit your file ~/.cshrc or ~/.login to make directory ~/bin part of the path Unix searches for executables, with a line like this:
        set path = ($path ~/bin)
  4. This takes effect the next time you start a default tcsh or csh shell (.cshrc file) or login (.login file). If you want it to take effect immediately, enter the above set path command at the Unix prompt or execute your .cshrc or .login file with the source command. If you are using sh, bash, ksh or some other shell, alter ~/.profile or some other file to set the path at login.
  5. If a program you want to execute has just been newly created, then issue the csh/tcsh command rehash to rescan the path.

If you perform (1)-(5), then you can execute your program via a command like this:

    hello

7. Variables and Data Structures

The rest of these notes will refer to the Perl manual, highlighting and expanding on important points. In this section, the relevant part of the manual is section perldata.

Variables in Perl

The value of a variable in Perl can be a number or a string, among other things. Variables are not typed. You can, for example, assign a string to a variable and later a number to the same variable. Variables are not declared at all. You can simply start using a variable.

An attempt to use an uninitialized variable causes a zero or an empty string or the truth value false (depending on the context) to be used. However, using the command-line switch -w you can ask the Perl interpreter to issue warnings, such as reporting uses of undefined values.

Perl's Data Structures

In addition to simple (scalar) variables and constants Perl has two kinds of data structures: arrays (lists), and associative arrays ("hashes"). Scalar variable names always begin with the dollar sign, e.g. $myvar. (The simple program listing a file with line numbers used a scalar variable for keeping track of the line number.) Names for arrays (and array slices) always begin with the commercial at sign, e.g. @myarray. Names for hashes always begin with the percent sign, e.g. %myhash.

Perl allows combinations of these, such as lists of lists and associative arrays of lists. (See The Perl Data Structures Cookbook for illustrations of such advanced topics.)

Scalars

Scalars can be numeric or character as determined by context:

    123  12.4  5E-10  0xff (hex)  0377 (octal)

    'What you $see is (almost) what \n you get'   'Don\'t Walk'

    "How are you?"  "Substitute values of $x and \n in \" quotes."

    `date`   `uptime -u`   `du -sk $filespec | sort -n`

    $x    $list_of_things[5]     $lookup{'key'}

Different delimiters around a string have different effects:

Arrays

Arrays (also called lists) consist of sequentially-arranged scalars:

    ('Sunday', 'Monday', 'Tuesday', 'Wednesday',
        'Thursday', 'Friday', 'Saturday')

    (13,14,15,16,17,18,19)   equivalent to  (13..19)

    (13,14,15,16,17,18,19)[2..4]  equivalent to  (15,16,17)

    @whole_list

A notation like (2, 3, 5, 7) can be called an array constructor. It can be assigned to a an array variable in order to initialize it:

@primes = (2, 3, 5, 7);

Associative arrays

Associative arrays (also called hashes) resemble arrays but can be indexed with strings:

    $DaysInMonth{'January'} = 31;   $enrolled{'Joe College'} = 1;

    $StudentName{654321} = 'Joe College';

    $score{$studentno,$examno} = 89;
        
    %whole_hash

Indexing

If you have an array, e.g. @myarr, you can form indexed variables by appending an index in brackets (as in the C language) but changing @ into $. The reason for the change is that the indexed variable is a scalar. Indexes are integer numbers, starting with 0 (as in C, but unlike in many other languages). For example, $myarr[5] denotes the 6th  element of the array @myarr. And if you have assigned

@wday = ('Sun','Mon','Tue','Wed','Thu','Fri','Sat');
then $wday[1] equals 'Mon'. (Negative subscripts count from the end so that e.g. $wday[-1] would be 'Sat'.)

Associative arrays are indexed with curly braces enclosing a string. $whatever, @whatever, and %whatever are three different variables.

You can also form array slices, for example @myvar[5..10], which is an array (therefore, denoted using @) consisting of those components of @myvar which have an index between 5 and 10, inclusively.

Hashes, on the other hand, can be indexed e.g. with strings, since indexing method is different. Conceptually, a Perl interpreter performs a search from a hash, using the index as the search key. For hashes, the index is in braces, e.g. $myhash{'foobar'}. Notice that in this case, too, the indexed variable is a scalar and therefore begins with $.

For example, the predefined hash name %ENV denotes the collection of so-called environment variables. Thus, you could refer e.g. to the value of the environment variable HOST with the expression $ENV{"HOST"}.

Example: using an array for splitting lines into words

In Perl, you can easily split data into fields without coding the details. You simply specify what you want, using the built-in split function, optionally with some arguments.

For instance, the statement

split;
first splits the current input line into blank-separated fields and then assigns the fields to components of the predefined array variable @_. You can then access the fields using indexed variables. The special variable $#_ contains information about the number of fields: the value of that variable is the number of fields minus one. (More generally, for any array variable @a, the variable $#a contains the last index of the array.)

Assume, for example, that you have some data where each line consists of blank-separated items (which might be strings or numbers) and you want to write a Perl script which picks up the second item from each line. (Such filtering is often needed to extract useful information from a large data file.) This is simple:

while (<>) {
  split;
  print $_[1], "\n"; }

Notice that you must use an index value of 1 to get the 2nd field, since array indexing begins at 0 in Perl.

Name Conventions

$identifier simple (scalar) variable
@identifier list (normal array)
%identifier associative array
&identifier subroutine or function
IDENTIFIER filehandle

Every variable kind (scalar, array, hash) has its own namespace. This means that $foo and @foo are two different variables. It also means that $foo[1] is a part of @foo, not a part of $foo. This may seem a bit weird, but that's okay, because it is weird. Notice, in particular, that there are two important predefined variables $_ and @_ which are quite distinct from each other, and e.g. $_[2] is a component of @_.

The case of letters is significant in variable names (as in Unix commands and in the C language), e.g. $foo and $Foo are distinct variables.

More examples of arrays and hashes

@days = (31,28,31,30,31,30,31,31,30,31,30,31);
               # A list with 12 elements.

$#days         # Last index of @days; 11 for above list

$#days = 7;    # shortens or lengthens list @days to 8 elements

@days          # ($days[0], $days[1],... )

@days[3,4,5]   # = (30,31,30)

@days{'a','c'} # same as ($days{'a'},$days{'c'})

%days          # (key1, value1, key2, value2, ...)

Special (predefined) variables

If a letter or underscore is the first character after the $, @, or %, the rest of the name may also contain digits and underscores. If this character is a digit, the rest must be digits. Perl has several dozen special (predefined) variables, recognized from their second character being non-alphanumeric. For example, $/ is the input record separator, newline "\n" by default. See section Special Variables in Perl 5 Reference Guide for a handy list.

The variable $_ is presumed (defaulted) by Perl in many contexts when needed variables are not specified. Thus:

<STDIN>;          # assigns a record from filehandle STDIN to $_
print;            # prints the current value of $_
chomp;            # removes the trailing newline from $_
@things = split;  # parses $_ into white-space delimited
                  # words, which become successive
                  # elements of list @things.

$_, $/, $1, and other implicit variables contribute to Perl Paradox Number Two: What you don't see can help you or hurt you.

Subroutines and functions

The words "subroutine" and "function" are used interchangeably when discussing Perl. There really is no difference, but often a subprogram is called "function" if it returns a value and "subroutine" if it does not. On the other hand, quite often "function" means a builtin subprogram whereas "subroutine" means a subprogram which is defined in a Perl program.

Subroutines/functions are referenced with names containing an initial &, which is optional if reference is obviously a subroutine/function such as following the sub, do, and sort directives. An example of a simple function (which returns the square of its argument), and a sample invocation:

    sub square { return $_[0] ** 2; }
    print "5 squared is ", &square(5);

Inside a function, the special variable @_ contains the list of actual arguments, so $_[0] refers to the first argument (which is the only one in the example above).

Filehandles

Filehandles don't start with a special character, and so as to not conflict with reserved words are most reliably specified as uppercase names: INPUT, OUTPUT, STDIN, STDOUT, STDERR, etc.

8. Literals and Operators

Example: Numbers and Characters

print '007',' has been portrayed by at least ', 004, ' actors. ';
print 7+3, ' ', 7*3, ' ', 7/3, ' ', 7%3, ' ', 7**3, ' ';
$x = 7;
print $x;
print '   Doesn\'t resolve variables like $x and backslashes \n. ';
print "Does resolve $x and backslash\n";
$y = "A line containing $x and ending with line feed.\n";
print $y;
$y = "Con" . "cat" . "enation!\n";
print $y;

This produces:

007 has been portrayed by at least 4 actors. 10 21
2.33333333333333 1 343 7   Doesn't resolve variables like $x and
backslashes \n. Does resolve 7 and backslash
A line containing 7 and ending with line feed.
Concatenation!

In fact, most of the output "runs together", into one line. (The long line has been split above to keep the width of this document reasonable.) Can you guess why?

Example: Comparisons

The following example illustrates, in addition to comparisons, the << mechanism which is very useful when a program has to write out a multi-line string (e.g. in conjunction with CGI scripts).

    $x = 'operator';
    print <<THATSALL;
    A common mistake:  Confusing the assignment $x =
    and the numeric comparison $x ==, and the character
    comparison $x eq.
    THATSALL
    $x = 7;
    if ($x == 7) { print "x is $x\n"; }
    if ($x = 5)  {
        print "x is now $x,",
        "the assignment is successful.\n";
    }
    $x = 'stuff';
    if ($x eq 'stuff') {
        print "Use eq, ne, lt, gt, etc for strings.\n";
    }

This produces:

    A common mistake:  Confusing the assignment operator =
    and the numeric comparison operator ==, and the character
    comparison operator eq.
    x is 7
    x is now 5, the assignment is successful.
    Use eq, ne, lt, gt, etc for strings.

Example: Ordinary Arrays

    @stuff = ('This', 'is', 'a', 'list.');
    print "Lists and strings are indexed from 0.\n";
    print "So \$stuff[1] = $stuff[1], ",
         "and \$#stuff = $#stuff.\n";
    print @stuff,"\n";
    print join('...',@stuff),"\n";
    splice(@stuff, 3, 0, ('fine', 'little'));
    print join('...',@stuff),"\n";

This produces:

    Lists and strings are indexed from 0.
    So $stuff[1] = is, and $#stuff = 3.
    Thisisalist.
    This...is...a...list.
    This...is...a...fine...little...list.

Homework: Validate a date.

The following program prompts for a date in a numeric (ISO 8601) format and reads and parses it.

    print "Enter a date numerically:  year-month-dayyear\n";
    $_ = <STDIN>;
    chomp;
    ($year,$month,$day) = split('-');

Complete this program so that it checks whether the date is valid. Print an error message if the month is not valid. Print an error message if the day is not valid for the given month (31 is ok for January but not for February). See if you can avoid using conditionals (if, unless, ?,...) statements but instead use data structures.

Approach this incrementally. On the first draft, assume that the user enters three numbers separated by hyphens and that February has 28 days. Subsequent refinements should account for bad input and leap year. Finally, find a Perl builtin function that converts a date to system time, and see how to use that to validate time data generally.

Homework: Play with associative arrays.

Start with a few assignments like:

    $name{12345} = 'John Doe';
    $name{24680} = 'Jane Smith';
Print these scalars. What is the value of an associative array element that has never been assigned? What happens if you assign an associative array to a scalar? What happens if you assign an associative array to a normal array?
    $blunk = %name;
    @persons = %name;
    print '$blunk=',$blunk,', @persons=',
        join(', ',@persons),"\n";
What happens if you assign a normal array to an associative array?

9. Loops and I/O

Control structures

Perl has a rich set of control structures. See section perlsyn in the manual for the full list. Theoretically, and very often practically too, you can use just if statements for branching and while statements for looping.

Within control structures you specify the actions to be conditionally or repeatedly executed as blocks. A block is simply a sequence of statements surrounded by braces. Notice that braces {} are always required (unlike in C).

The simplest if statement is of the form

if(expression)block

which means that the expression is evaluated, and if the result is true, the block is executed.

For example, the statement if($i < 10) {$j = 100;} sets the value of $j to 100 if the value of $i is less than 10. As mentioned above, braces are required (even if there is a single statement within them), and the parentheses around the condition expression are obligatory, too.

A two-branch if statement is of the form

if(expression)block1 else block2

which means that the expression is evaluated, and if the result is true, block1 is executed, otherwise block2 is executed.

The while statement is of the form

while(expression)block

which means that the expression is evaluated, and if the result is true, the block is executed, then the expression is re-evaluated and the process is repeated until the expression evaluates to false.

As a simple example of using the while statement is the following script, which splits input lines into fields (in a manner described above) and prints out the fields in reverse order.

while (<>) {
  split;
  $i = $#_;
  while($i >= 0) {
    print $_[$i--], " "; }
  print "\n";
}
The control in the (inner) while loop is based on using an auxiliary variable $i, which is initialized to the index of the last field and decremented (using the C-style embedded decrement operator --) within the loop until it reaches zero, i.e. all fields have been processed. The operator >= has the obvious meaning 'is greater than or equal to'.

The for statement is of the form

for(initialization;condition;updating)block

If you are familiar with the for statement in C, you probably want to use for in Perl too, but you might as well use just while as the loop construct. However, in Perl there is also a foreach statement, which will be illustrated by the next example (and was already used in a previous example).

Example: Command Line Values and Iterative Loops

    print "$#ARGV is the subscript of the ",
        "last command argument.\n";
    # Iterate on numeric subscript 0 to $#ARGV:
    for ($i=0; $i <= $#ARGV; $i++) {
        print "Argument $i is $ARGV[$i].\n";
    }
    # A variation on the preceding loop:
    foreach $item (@ARGV) {
        print "The word is:  $item.\n";
    }
    # A similar variation, using the
    # "Default Scalar Variable" $_ :
    foreach (@ARGV) {
        print "Say:  $_.\n";
    }
Demonstration run of this program
    > perl example5.pl Gooood morning, Columbia!
    2 is the subscript of the last command argument.
    Argument 0 is Gooood.
    Argument 1 is morning,.
    Argument 2 is Columbia!.
    The word is:  Gooood.
    The word is:  morning,.
    The word is:  Columbia!.
    Say:  Gooood.
    Say:  morning,.
    Say:  Columbia!.

Example: Standard I/O

The following program illustrates simple interaction with user.

print STDOUT "Tell me something: ";
while ($input = <STDIN>) {
   print STDOUT "You said, quote: $input endquote\n";
   chomp $input;
   print STDOUT "Without the newline:  $input endquote\n";
   if ($input eq '') { print STDERR "Null input!\n"; }
   print STDOUT "Tell me more:\n";
}
print STDOUT "That's all!\n";

Note 1: The while statement's condition is an assignment statement: assign the next record from standard input to the variable $input. On end of file, this will assign not a null value but an "undefined" value. On keyboard input, end of file can be simulated in different ways on different systems; for example, on Unix the method is traditionally control-D, while on DOS it is control-Z followed by a newline (enter). An undefined value in the context of a condition evaluates to "false". So the while ($input = <STDIN>) does three things: gets a record, assigns it to $input, and tests whether $input is undefined. In other contexts, Perl treats an undefined variable as null or zero. Thus, if $i is not initialized, $i++ sets $i to 1. Perl Paradox Number Three: Side effects can yield an elegant face or a pain in the rear.

Note 2: Data records are by default terminated by a newline character "\n" which in the above example is included as the last character of variable $input. The chomp function removes the trailing end-of-line (newline) indicator (if present), which is defined in the special variable $/. (The chomp function was introduced Perl 5. Old programs often use the less safe function chop, which simply removes the last character, whatever it is.)

Demonstration:

    > perl example6.pl
    Tell me something: I'm warm.
    You said, quote: I'm warm.
     endquote
    Without the newline:  I'm warm. endquote
    Tell me more:
    Can I have some water?
    You said, quote: Can I have some water?
     endquote
    Without the newline:  Can I have some water? endquote
    Tell me more:

    You said, quote:
     endquote
    Without the newline:   endquote
    Null input!
    Tell me more:
    ^D
    That's all!

Example: Perls - A Perl Shell, Calculator, & Learning Tool

    for (;;) {
        print '(',join(', ',@result),")\n? ";
        last unless $input = <STDIN>;
        $? = ''; $@ = ''; $! = '';
        @result = eval $input;
        if ($?) { print 'status=',$?,' ' }
        if ($@) { print 'errmsg=',$@,' ' }
        if ($!) { print 'errno=',$!+0,': ',$!,' ' }
    }

This reads a line from the terminal and executes it as a Perl program. The for (;;) {...} construct makes an endless loop. The last unless line might be equivalently specified:

$InPuT = <STDIN>;               # Get line from standard input.
if (! defined($InPuT)) {last;}  # If no line, leave the loop.

The eval function in Perl evaluates a string as a Perl program. The special variable $@ contains the Perl error message from the last eval or do.

Demonstration: (note that the statements system 'date' and $x=`date` invoke a system command named date and are therefore system-dependent and work (in a useful way) mostly on Unix):

perl perls.pl
()
? Howdy
(Howdy)
? 2+5
(7)
? sqrt(2)
(1.4142135623731)
? $x=sqrt(19)
(4.35889894354067)
? $x+5
(9.35889894354067)
? 1/0
errmsg=Illegal division by zero at (eval 6) line 3, <STDIN> chunk 6.
 ()
? system 'date'
Fri Feb  5 15:33:47 EET 1999
(0)
? $x=`date`
(Fri Feb  5 15:34:06 EET 1999
)
? chomp $x
(1)
? @y=split(' ',$x)
(Fri, Feb, 5, 15:34:06, EET, 1999)
? @y[1,2,5]
(Feb, 5, 1999)
? localtime()
(39, 38, 15, 5, 1, 99, 5, 35, 0)
? foreach (1..3) {print sqrt(),' ';}
1 1.4142135623731 1.73205080756888 ()
? exit

Example: File I/O

The following program illustrates reading from a file and writing to a file. It also reads from character from standard input, in order to let the user control what happens. Moreover, it illustrates how "short circuit" operator || can be used so that error processing can be written more conveniently. An expression of the form A||B is evaluated so that A is always evaluated first, and if the result is "true", the expression B is not evaluated at all.

    # Function:  Reverse each line of a file

    # 1: Get command line values:
    if ($#ARGV !=1) {
        die "Usage:  $0 inputfile outputfile\n";
    }
    ($infile,$outfile) = @ARGV;
    if (! -r $infile) {
        die "Can't read input $infile\n";
    }
    if (! -f $infile) {
        die "Input $infile is not a plain file\n";
    }

    # 2: Validate files
    open(INPUT,"<$infile") ||
        die "Can't input $infile $!";
    if ( -e $outfile) {
        print STDERR "Output file $outfile exists!\n";
        until ($ans eq 'r' || $ans eq 'a' || $ans eq 'e' ) {
            print STDERR "replace, append, or exit? ";
            $ans = getc(STDIN);
        }
        if ($ans eq 'e') {exit}
    }
    if ($ans eq 'a') {$mode='>>'}
    else {$mode='>'}
    open(OUTPUT,"$mode$outfile") ||
        die "Can't output $outfile $!";

    # 3: Read input, reverse each line, output it.
    while (<INPUT>) {
        chomp $_;
        $_ = reverse $_;
        print OUTPUT $_,"\n";
    }

    # 4: Done!
    close INPUT,OUTPUT;
    exit;

10. Data Processing: Grade Book Example

This example produces a score summary report by combining data from a simple file of student info and a file of their scores.

Input file stufile is delimited with colons. Fields are Student ID, Name, Year:

123456:Washington,George:SR
246802:Lincoln,Abraham "Abe":SO
357913:Jefferson,Thomas:JR
212121:Roosevelt,Theodore "Teddy":SO

Input file scorefile is delimited with blanks. Fields are Student ID, Exam number, Score on exam. Note that Abe is missing exam 2:

123456 1 98
212121 1 86
246802 1 89
357913 1 90
123456 2 96
212121 2 88
357913 2 92
123456 3 97
212121 3 96
246802 3 95
357913 3 94

The desired report:

Stu-ID Name...                       1   2   3   Totals:

357913 Jefferson,Thomas             90  92  94       276
246802 Lincoln,Abraham "Abe"        89      95       184
212121 Roosevelt,Theodore "Teddy"   86  88  96       270
123456 Washington,George            98  96  97       291

                          Totals:  363 276 382

The program that made this report:

# Gradebook - demonstrates I/O, associative
# arrays, sorting, and report formatting.
# This accommodates any number of exams and students
# and missing data.  Input files are:
$stufile='stufile';
$scorefile='scorefile';

open (NAMES,"<$stufile")
    || die "Can't open $stufile $!";
open (SCORES,"<$scorefile")
    || die "Can't open $scorefile $!";

# Build an associative array (%name) of student info
# keyed by student number
 while (<NAMES>) {
        ($stuid,$name,$year) = split(':',$_);
        $name{$stuid}=$name;
        if (length($name)>$maxnamelength) {
                $maxnamelength=length($name);
        }
}
close NAMES;

# Build an assoc. array (%score) from the test scores:
while (<SCORES>) {
   ($stuid,$examno,$score) = split;
   $score{$stuid,$examno} = $score;
   if ($examno > $maxexamno) {
      $maxexamno = $examno;
   }
}
close SCORES;

# Print the report from accumulated data!
printf "%6s %-${maxnamelength}s ",
        'Stu-ID','Name...';
foreach $examno (1..$maxexamno) {
        printf "%4d",$examno;
}
printf "%10s\n\n",'Totals:';

# Subroutine "byname" is used to sort the %name array.
# The "sort" function gives variables $a and $b to
# subroutines it calls.
# "x cmp y" expression returns -1 if x lt y, 0 if x eq y,
# +1 if x gt y.  See the Perl documentation for details.

sub byname { $name{$a} cmp $name{$b} }

# Order student IDs so the names appear alphabetically:
foreach $stuid ( sort byname keys(%name) ) {
   # Print scores for a student, and a total:
   printf "%6d %-${maxnamelength}s ",
           $stuid,$name{$stuid};
   $total = 0;
      foreach $examno (1..$maxexamno) {
         printf "%4s",$score{$stuid,$examno};
         $total += $score{$stuid,$examno};
         $examtot{$examno} += $score{$stuid,$examno};
      }
      printf "%10d\n",$total;
}

printf "\n%6s %${maxnamelength}s ",'',"Totals: ";
foreach $examno (1..$maxexamno) {
   printf "%4d",$examtot{$examno};
}
print "\n";
exit(0);

The foreach $stuid ... loop first calls the predefined function sort, passing the name of the ordering subroutine byname as the first parameter. That function returns an array, and the loop iterates over that array so that $stuid gets each of the values in the array in succession.

More advanced applications could be written using the feature that Perl allows an associative array to be "tied" (using the tie function) to a genuine database, such that expressions like $record = $student{$key} use the database.

11. System Interface

In this section, the first example illustrates how Perl can be used in a system-independent way for system-related tasks like renaming files. For more information, see especially section Functions for filehandles, files, or directories of perlfunc in the manual. The second example illustrates a job which is more deeply system-related and therefore needs system-specific (Unix-specific, in the example) methods.

Example: renaming files

Unix users often get frustrated when they need to rename files e.g. so that all file names ending with some suffix (like .for) are renamed by changing the suffix (e.g. to .f). In some operating systems this is easy, but in normal Unix command interpreters there is no direct way to do it. (A novice user often tries mv *.for *.f but it does not work at all in the way you would like.)

No problem, it's easily done in Perl, for example as follows:

while(<*.for>) {
  $oldname = $_;
  s/\.for$/\.f/;
  rename $oldname, $_; }

This works on any system with reasonably normal file naming conventions, not just Unix.

The while statement is different from what we have seen before. It means that all file names matching the pattern within the angle brackets (here *.for) are processed and assigned, each in turn, to the variable $_. In fact, the meaning of $_ is not simply 'the current input line' as told before but more generally 'the current data being processed', and the context defines in each case what this exactly means.

Within the loop, the file name is copied to variable $oldname and then modified using a construct which performs a substitution.

One might try to use simply s/.for/.f/; instead of s/\.for$/\.f/;. Although the simpler version works in most cases, it is buggy, because the symbol . stands for any character, not just the period, and because there is no requirement that the string .for must appear at the end of the file name only. Thus, the code would rename e.g. zapfor.for to za.f.for. To refer to the period character, one must use "escape" notation by prefixing it with a backslash. Moreover, if the trailing $ (denoting end of line) is omitted, the code would apply to the first appearance of .for in the filename.

Finally, the rename operation is performed using a Perl built-in function, rename, which takes two file names as arguments. Alternatively, we could also use the following:

  system "mv $oldname $_";
which does the same operation (less efficiently, and in a system-dependent manner) by asking the Unix system to execute its mv command.

Pipe I/O and System Calls

This program works under Unix only.

The following program reports disk usage by the files specified as arguments. The Unix command du -sk ... (on BSD Unix, du -s ...) produces a series of lines like

1942    bin
2981    etc
listing the Kbytes used by each file or directory. It doesn't show other information, such as the modification date or owner. This program gets du's Kbytes and filename, and merges this info with other useful information for each file.
$files = join(' ',@ARGV);

# The trailing pipe "|" directs command output
# into our program:

if (! open (DUPIPE,"du -sk $files | sort -nr |"))  {
    die "Can't run du! $!\n";
}

printf "%8s %-8s %-16s %10s %s\n",
    'K-bytes','Login','Name','Modified  ','File';
while (<DUPIPE>) {
    # parse the du info:
    ($kbytes, $filename) = split;

    # Call system to look up file info like "ls" does:
    ($dev,$ino,$mode,$nlink,$uid,$gid,$rdev,
        $size,$atime,$mtime,$ctime)
        = stat($filename);

    # Call system to associate login & name with uid:
    if ($uid != $previous_uid) {
        ($login,$passwd,$uid,$gid,$quota,$comment,
            $realname,$dir,$shell) = getpwuid($uid);
        ($realname) = split(',',substr($realname,0,20));
        $previous_uid = $uid;
    }

    # Convert the modification-time to readable form:
    ($sec,$min,$hour,$mday,$mon,$myear) = localtime($mtime);
    $mmonth = $mon+1;
    $myear = 1900 + $myear; 

    printf "%8s %-8s %-16s %4d-%02d-%02d %s\n",
        $kbytes, $login, $realname,
        $myear, $mmonth, $mday, $filename;
}

Demonstration output:

    K-bytes Login    Name             Modified   File
      40788 c527100  Fred Flintstone  1995-10-05 c527100
      32685 c565060  Peter Parker     1995-10-05 c565060
      24932 c579818  Clark Kent       1995-10-06 c579818
      15388 c576657  Lois Lane        1995-10-06 c576657
       9462 c572038  Bruce Wayne      1995-10-06 c572038
       8381 c517401  Eric McGregor    1995-10-05 c517401
       7022 c594912  Asterisk de Gaul 1995-10-05 c594912

12. Match

The pattern matching and substitution operators are described in detail in section Regexp Quote-Like Operators of perlop in the manual. See also Regular expressions in Perl for a tabular summary with examples.

A simple example: tr

Perl has powerful tools for string manipulation. But let is first take a simple example. One often wants to convert letters in input data to lower case. That's easy:

tr /A-Z/a-z/;
This can be read as follows: "translate all characters in the range from A to Z to the corresponding characters in the range from a to z".

The operation is applied to the value of $_, typically the current input line. If you would like it to be applied to the value of a variable $foo, you should write

$foo =~ tr /A-Z/a-z/;
Thus, the syntax is odd-looking, but once you get accustomed to it, the Perl string manipulation tools are easy to use.

The matching operator =~

The =~ operator performs pattern matching. For example, if:

    $s = 'One if by land and two if by sea';
then:
    if ($s =~ /if by la/) {print "YES"}
    else {print "NO"}
prints YES, because the value of $s matches the simple constant pattern "if by la".
    if ($s =~ /one/) {print "YES"}
    else {print "NO"}
prints NO, because the string does not match the pattern. However, by adding the i option to ignore case of letters, we would get a YES from the following:
    if ($s =~ /one/i) {print "YES"}
    else {print "NO"}

Regular expressions

Matching involves use of patterns called regular expressions. This, as you will see, leads to Perl Paradox Number Four: Regular expressions aren't. See section perlre in the manual.

Patterns can contain a mind-boggling variety of special directions that facilitate very general matching. For example, a period matches any character (except the "newline" \n character).

    if ($x =~ /l.mp/) {print "YES"}
would print YES for $x = "lamp" or "lump" or "slumped", but not for $x = "lmp" or "less amperes".

Parentheses () group pattern elements. An asterisk * means that the preceding character, element, or group of elements may occur zero times, one time, or many times. Similarly, a plus + means that the preceding element or group of elements must occur at least once. A question mark ? matches zero or one times. So:

    /fr.*nd/  matches "frnd", "friend", "front and back"
    /fr.+nd/  matches "frond", "friend", "front and back"
                but not "frnd".
    /10*1/    matches "11", "101", "1001", "100000001".
    /b(an)*a/ matches "ba", "bana", "banana", "banananana"
    /flo?at/  matches "flat" and "float"
                but not "flooat"

Square brackets [] match a class of single characters.

    [0123456789] matches any single digit
    [0-9]        matches any single digit
    [0-9]+       matches any sequence of one or more digits
    [a-z]+       matches any lowercase word
    [A-Z]+       matches any uppercase word
    [ab n]*      matches the null string "", "b",
                    any number of blanks, "nab a banana"

[^class] matches those characters which do not match [class] (i.e., ^ denotes negation here - but something quite different outside brackets, see below):

    [^0-9]       matches any non-digit character.

Curly braces {} allow more precise specification of repeated fields. For example [0-9]{6} matches any sequence of 6 digits, and [0-9]{6,10} matches any sequence of 6 to 10 digits.

Patterns float, unless anchored. The circumflex ^ (outside []) anchors a pattern to the beginning, and dollar sign $ anchors a pattern at the end, so:

    /at/         matches "at", "attention", "flat", & "flatter"
    /^at/        matches "at" & "attention" but not "flat"
    /at$/        matches "at" & "flat", but not "attention"
    /^at$/       matches "at" and nothing else.
    /^at$/i      matches "at", "At", "aT", and "AT".
    /^[ \t]*$/   matches a "blank line", one that contains nothing
                          or any combination of blanks and tabs.

Other characters simply match themselves, but the characters
+?.*^$()[]{}|\
and usually / must be escaped with a backslash \ to be taken literally.

    /10.2/       matches "10Q2", "1052", and "10.2"
    /10\.2/      matches "10.2" but not "10Q2" or "1052"
    /\*+/        matches one or more asterisks
    /A:\\DIR/    matches "A:\DIR"
    /\/usr\/bin/ matches "/usr/bin"
If a backslash precedes an alphanumeric character, this sequence takes a special meaning, typically a short form of a [] character class. For example, \d is the same as the [0-9] digits character class.
    /[-+]?\d*\.?\d*/      is the same as
    /[-+]?[0-9]*\.?\d*/
Either of the above matches decimal numbers: "-150", "-4.13", "3.1415", "+0000.00", etc.

A simple \s specifies "white space", the same as the character class [ \t\n\r\f] (blank, tab, newline, carriage return, formfeed). A character may be specified in hexadecimal as a \x followed by two hexadecimal digits which specify the Ascii code of the character; for example, \x1b is the ESC character.

A vertical bar | means "or".

    if ($answer =~ /^yes|^yeah/i ) {
         print "Affirmative!";
    }
prints Affirmative! for $answer equal to "yes" or "yeah" (or "YeS", or "yessireebob, that's right", but not "yep").

13. Substitutions & Parsing

Substitutions

The =~ operator can be used for making substitutions in strings. An expression of the form
$variable =~ /pattern/
tests whether the value of variable matches pattern. Normally such an expression is used as a condition (test) in an if statement or other control structure. But an expression of the form
$variable =~ s/pattern/pattern2/
first tests for a match, and if there is a match, replaces, within the value of variable, the string that matched pattern by pattern2 or, if it contains special notations, by a string generated from pattern2.

If you wish to modify the value of the predefined variable $_, you can write simply
s/pattern/pattern2/

When you include parentheses () in a matched string, the matching text in the parenthesis may subsequently be referenced via variables $1, $2, $3, ... for each left parenthesis encountered. These matches can also be assigned as sequential values of an array.

Example. Assume that we have a text file containing notations like U+0123 which we wish to modify by slapping the strings <code> and </code> around them. The exact format of those notations is U+ followed by one or more hexadecimal characters. Thus, the following program would do the job:

while(<>) {
  s?U\+([0-9a-fA-F]+)?<code>U\+$1</code>?g;
  print;}

Note: Although we normally use the slash character / when specifying substitutions, here it cannot be used, since the slash occurs in the patterns. We can then pick almost any character which does not occur in the patterns and use it as a separator; here we use the question mark ?. (Alternatively, we could "escape" the / as \/ in the patterns.) Notice that the plus sign + must be "escaped" (as \+) when it needs to stand for itself and not act as a special character in a regular expression.

Parsing

The following program parses dates in a strange old format.

    $s = 'There is 1 date 10/25/95 in here somewhere.';
    print "\$s=$s\n";
    $s =~ /(\d{1,2})\/(\d{1,2})\/(\d{2,4})/;
    print "Trick 1: \$1=$1, \$2=$2, \$3=$3,\n",
          "         \$\`=",$`," \$\'=",$',"\n";

    ($mo, $day, $year) =
        ( $s =~ /(\d{1,2})\/(\d{1,2})\/(\d{2,4})/ );
    print "Trick 2: \$mo=$mo, \$day=$day, \$year=$year.\n";

    ($wholedate,$mo, $day, $year) =
        ( $s =~ /((\d{1,2})\/(\d{1,2})\/(\d{2,4}))/ );
    print "Trick 3: \$wholedate=$wholedate, \$mo=$mo, ",
        "\$day=$day, \$year=$year.\n";

Results of above:

    $s=There is 1 date 10/25/95 in here somewhere.
    Trick 1: $1=10, $2=25, $3=95,
             $`=There is 1 date  $'= in here somewhere.
    Trick 2: $mo=10, $day=25, $year=95.
    Trick 3: $wholedate=10/25/95, $mo=10, $day=25, $year=95.
Note that if patterns are matched in an array context as in Tricks 2 and 3, special variables $1, $2, ..., and $`, $', and $& are not set.

Using a combination of Tricks 1 and 3, we can write the following program which processes its input by replacing notations like 10/25/95 (where the month appears first) by ISO 8601 conformant notations like 1995-10-25. The program uses a conditional operator to add 1900 to the year if it was less than 100.

while(<>) {
  while( m/((\d{1,2})\/(\d{1,2})\/(\d{2,4}))/ ) {
    $year = $4 < 100 ? 1900+$4 : $4;
    $newdate = sprintf "%04d-%02d-%02d", $year, $2, $3;
    s/$1/$newdate/; }
  print; }

Matching is greedy

Consider the simple regular expression k.* which means 'the letter k followed by a sequence of any characters'. For a string containing k, there would be several possible matches if the language definition did not say how the matching is performed. In Perl, the definition is that the longest possible string is taken; in our example, that means that the expressions matches the substring which extends from the first occurrence of k to the end of the string.

Regular expressions are greedy, seeking the longest possible match not the shortest match.

This rule applies to matches involving the repetition specifier * or +. It does not apply to selecting between alternatives separated with | in a regular expression. If we have, say, k.*|zap then the first substring that matches k.* or zap is taken, so if zap occurs first, it does not matter that a match to k.* would give a longer match. But if a k is found before any zap is found, then matching to k.* is done the normal way, taking the largest among possible matches (i.e. taking the rest of the string).

In the following example we try to match whatever is between "<" and ">":

    $s = 'Beware of <STRONG>greedy</strong> regular expressions.';
    print "\$s=$s\n";
    ($m) = ( $s =~ /<(.*)>/ );
    print "Try 1: \$m=$m\n";
    ($m) = ( $s =~ /<([^>]*)>/ );
    print "Try 2: \$m=$m\n";
This results in:
    $s=Beware of <STRONG>greedy</strong> regular expressions.
    Try 1: $m=STRONG>greedy</strong
    Try 2: $m=STRONG

Thus, by using a more specific match (which says that the string between "<" and ">" must not contain ">") we get the result we want. In Perl 5, it would also be possible to use *? instead of * to request the shortest match to be made. That would mean using /<(.*?)>/ in our case. (This special meaning for the question mark in a specific context is rather strange, but useful.)

Homework: Parsing and Reporting

1. See preceding "Grade Book" example. Using the same stufile input, print a list of students ordered by family name, with any quoted nickname listed in place of the given name, and family name last. Produce output like this:

    Student-ID  Year  Name
    357913      JR    Thomas Jefferson
    246802      SO    Abe Lincoln
    212121      SO    Teddy Roosevelt
    123456      SR    George Washington

14. Simple CGI

To avoid wasting your time, please check - from applicable local documents or by contacting local webmaster - whether you can install and run CGI scripts written in Perl on a Web server. At the same time, please check how to do that in detail - specifically, where you need to put your CGI scripts.

Depending on the Web server where your pages are, you might, or might not, be able to install and use Perl programs as CGI scripts. This means that you can then e.g. set up HTML forms so that the submitted data is passed as input to the script (that is, your Perl program), which is executed so that its standard output is sent back to the browser from which the form was submitted.

If you know C, you may wish to take a look at Getting Started with CGI Programming in C for comparison, before or after studying how to write CGI scripts in Perl.

This document used to contain an example that was based on the old cgi-lib.pl module. Despite its simplicity, it can hardly be recommended even to novices. There is a more modern module for the purpose: CGI.pm, available from CPAN.

I have now a separate document that introduces to simple CGI in Perl: The Fool's Guide to CGI.pm.

15. Testing Perl Programs

Use the command-line option compiler -w to warn about identifiers that are referenced only once, uninitialized scalars, predefined subroutines, undefined file handles, probable confusion of == and eq, and other things. On Unix, this can be coded in the first line:

    #!/usr/local/bin/perl -w

where you need to replace the path by one that is applicable in your environment. (Cf. to section Making it command-like on Unix.) Section perlrun in the manual explains how to simulate #! on non-Unix systems.

As you write your program, put in print statements to display variables as you proceed. Comment them out using # when you feel you don't need to see their output.

CGI scripts require some special attention in testing. In addition to checking server-dependent things, make sure you know where the problem is; you probably need to use a simple "echoing" script to see whether the problem is on the HTML document containing the form, or in a browser, or in your script.

16. Common Goofs for Novices

Adapted from Programming Perl, page 361. For more traps, see See section perltrap in the manual.

  1. Testing "all-at-once" instead of incrementally, either bottom-up or top-down.
  2. Optimistically skipping print scaffolding to dump values and show progress.
  3. Not running the interpreter with the -w switch.
  4. Leaving off $ or @ or % from the front of a variable.
  5. Forgetting the trailing semicolon.
  6. Forgetting curly braces around a block.
  7. Unbalanced () or {} or [] or "" or '' `` or <>.
  8. Mixing apostrophes '' with quotation marks "" or slash / with backslash \.
  9. Using == instead of eq, != instead of ne, = instead of ==, etc. ('White' == 'Black') and ($x = 5) evaluate as (0 == 0) and (5) and thus are true!
  10. Using else if instead of elsif.
  11. Not chomping the output of something in "backquotes" (like `date`) or not chomping input:
        print "Enter y to proceed: ";
        $ans = <STDIN>;
        chomp $ans;
        if ($ans eq 'y') { print "You said y\n";}
        else { print "You did not say 'y'\n";}
    
  12. Putting a comma after the file handle in a print statement.
  13. Forgetting that in Perl indexes start at 0, not 1.
  14. Using $_, $1, or other side-effect variables, then modifying the code in a way that unknowingly affects or is affected by these.
  15. Forgetting that regular expressions are greedy.

This document is largely based on Introduction to Perl by Greg Johnson of MU Campus Computing, a document that now seems to have disappeared from the Web. Changes were made to make the presentation less Unix-specific. Some formulations were modified, links fixed, and so on. Most of the material from my Introduction to Perl was added. This happened in 2001 and 2002. As this document is still used by people, I made a basic cleanup (removing or fixing links that didn't work etc.) in 2011, without adding content.

Date of last update: 2014-03-16.

Jukka Korpela.