This is an old document. Its content has been incorporated into the newer Perl lessons.

Getting Started With Perl

Preface

This is an introduction to the Perl programming language. There is quite a lot of information about Perl, both in printed form and accessible via Internet, but I had not found a suitable introductory text. Thus, I wrote one while teaching myself Perl.

Later I've found Introduction to Perl by Greg Johnson. It was so good that I created Perl Lessons which is essentially that introduction and this introduction of mine merged together, modified, and updated; there is also an experimental version with frames. (I still keep this document available, since some people may prefer a short introduction.)

Elementary acquaintance with using Unix system is assumed. Knowing the C programming language is definitely an advantage, but I have tried to avoid assuming such knowledge.

This text is mostly based on an Internet-accessible hypertext documentation of Perl version 5. My contribution is an attempt to pick up the most essential things and to provide simple example Perl scripts.

What is Perl?

Perl is a programming language which can be used for a large variety of tasks. A typical simple use of Perl would be for extracting information from a text file and printing out a report or for converting a text file into another form. But Perl provides a large number of tools for quite complicated problems, including systems programming.

Programs written in Perl are called Perl scripts, whereas the term the perl program refers to the system program named perl for executing Perl scripts. (What, confused already?)

If you have used shell scripts or awk or sed or similar utilities for various purposes, you will find that you can normally use Perl for those and many other purposes, and the code tends to be more compact. And if you haven't used such utilities but have started thinking you might have need for them, then perlhaps what you really need to learn is Perl instead of all kinds of futilities.

Perl is implemented as an interpreted (not compiled) language. Thus, the execution of a Perl script tends to use more CPU time than a corresponding C program, for instance. On the other hand, computers tend to get faster and faster, and writing something in Perl instead of C tends to save your time.

Hello world!

When studying a new programming language, the crucial point is to learn to write and execute a program which prints Hello world! The reason is that in order to be able to do this, you need to know quite a lot of simple things, and language manuals often omit such details.

So let us get started:

lk-hp-23 perl 195 % cat >hello
#!/usr/bin/perl
print "Hello world!\n";
lk-hp-23 perl 196 % chmod a+x hello
lk-hp-23 perl 197 % ./hello
Hello world!
lk-hp-23 perl 198 %

Explanations:

I used the cat command to create a file named hello and containg a very simple Perl script. Normally one uses one's favourite editor (such as Emacs) to create Perl scripts, of course.
The first line of the script, #!/usr/bin/perl, specifies that the script is to be executed by the perl program (and not by a shell, for example). Consider that line as an obligatory prelude. The part /usr/bin/perl is the full name (path name) of the perl program. In different installations, different names may be in use, but this name is a typical one.
After the prelude one writes the Perl script proper, and in this case it consists of one line only. It should be self-explanatory except for the use of the notation \n which stands for newline. In Perl strings, control characters can be represented in this way, using the backslash character \ and a letter. This is the same convention which is used in the C programming language.
Having written the script, I used the chmod command to give execute access to the file containing the script. In Unix, files are usually created without execute access, so that access must be granted separately. In this case, all users get execute access to the file.
Finally I execute the script simply giving the name of the file as a command. Having the initial ./ means that I refer to the named file in my current working directory. (Possibly just a simple name like hello could work, but it might also cause problems, if the name happens to be the same as the name of a system command.)

Notice that in Perl (as in C), a simple statement is terminated by a semicolon.

Data structures and variables

The value of a variable in Perl can be a number or a string, among other things. Variables are not typed. You can, for example, assign a string to a variable and later a number to the same variable.

Variables are not declared at all. You can simply start using a variable.

An attempt to use an uninitialized variable causes a zero or an empty string or the truth value false (depending on the context) to be used. However, using the command-line switch -w you can ask the Perl interpreter to issue warnings, such as reporting uses of undefined values.

Perl has three data structures: scalars, arrays of scalars, and associative arrays of scalars, known as "hashes".

Scalar variable names always begin with the dollar sign, eg $myvar.

Names for arrays and array slices always begin with the commercial at sign, eg @myarray

Names for hashes always begin with the percent sign, eg %myhash.

Let us also mention that subroutine names begin with the ampersand sign, eg &mysub, although this sign can often be omitted.

Intuitively, the special characters mentioned above correspond to English words as follows: you can read
$ as the,
@ as these or those,
% as these or those, and
& as do.

The case of letters is significant in variable names (as in Unix commands and in the C language), eg $foo and $Foo are distinct variables.

If you have an array, eg @myarr, you can form indexed variables by appending an index in brackets (as in the C language) but changing @ into $, eg $myarr[5]. The reason for the change is that the indexed variable is a scalar.

You can also form array slices, for example @myvar[5..10], which is an array (therefore, denoted using @) consisting of those components of @myvar which have an index between 5 and 10, inclusively.

Array indexes are integer numbers, starting with 0 (as in C, but unlike in many other languages).

Hashes, on the other hand, can be indexed eg with strings, since indexing method is different. For hashes, the index is in braces, eg $myhash{'foobar'}. Notice that in this case, too, the indexed variable is a scalar and therefore begins with $.

Every variable kind (scalar, array, hash) has its own namespace. This means that $foo and @foo are two different variables. It also means that $foo[1] is a part of @foo, not a part of $foo. This may seem a bit weird, but that's okay, because it is weird. Notice, in particular, that there are two important predefined variables $_ and @_ and that you should realize that eg $_[2] is a component of @_.

The value of an array variable is effectively an ordered list of values. In Perl, you can also construct lists as data objects using a constructor in which the values are listed within parentheses, separated with blanks, eg

(2, 3, 7, 42)

A list can, in particular, be assigned to an array variable, eg

@foo = (2, 3, 7, 42);

Lists are important in Perl, since many operations yield lists as their result.

Example: displaying lines with line numbers

The following Perl script illustrates the use of scalar variables. It also introduces some very basic features of Perl.

The script prints out its input so that each line begins with a running line number:

#!/usr/bin/perl
$line = 1;
while (<>) {
  print $line, " ", $_;
  $line = $line + 1; }

The scalar variable $line is of course the line counter. It is initialized to 1 in the beginning, and it is incremented by 1 within a loop which processes each input line at a time.

The loop construct is of the form
while (<>) {
process one line of input }
and although it looks cryptic at first sight, it is really very convenient to use. You need not worry about actual input operations; just use the construct shown above, and use the predefined variable $_ to refer to the input line.

The print statement in our example contains three arguments, one for getting the lines number printed, one for getting a blank printed, and one for getting the input line printed. We do not have an argument for getting a newline printed, since the value of the special variable $_ contains a trailing newline.

In fact, you could make your code even shorter: you could write the script as

#!/usr/bin/perl
$line = 1;
while (<>) {
  print $line++, " ", $_; }

Here the statement contains $line++ instead of just $line, since in Perl (as in C) you can increment a variable (after its old value has been used) by appending the operator ++ to it.

You might wish to have the line numbers right-adjusted, eg each in a fixed field of five characters, padded with blanks on the left. This would be pretty easy, if you know the C language output formatting tools. You could just replace the print statement with

  printf "%5d %s", $line++, $_;

Input to Perl scripts

Where does a Perl script read its input from? By default, ie in the absense of any specification of input, the input comes from the so-called standard input stream in Unix. By default this means user's keyboard.

Normally you want your script to read input from a file. Simply write the name of the file as a command-line argument, ie when giving the script name as a command. Thus, for example, if you had written our simple script (the simpler version of it) into a file named lines, you could test it by using it as its own test data (confusing?) as follows:

lk-hp-23 perl 251 % ./lines lines
1 #!/usr/bin/perl
2 $line = 1;
3 while (<>) {
4   print $line++, " ", $_; }
lk-hp-23 perl 252 %

You can also write several file names as command-line arguments, eg

lines foo bar zap

which would mean that the script lines takes as input the contents of files foo, bar, and zap as if you had concatenated the contents into a single file and given its name as argument.

Example: splitting input lines

In Perl, you can easily split data into fields without coding the details. You simply specify what you want.

For instance, the statement

split;

first splits the current input line into blank-separated fields and then assigns the fields to components of the predefined array variable @_. You can then access the fields using indexed variables. The special variable $#_ contains information about the number of fields: the value of that variable is the number of fields minus one. (More generally, for any array variable @a, the variable $#a contains the last index of the array.)

Assume, for example, that you have some data where each line consists of blank-separated items (which might be strings or numbers) and you want to write a Perl script which picks up the second item from each line. (Such filtering is often needed to extract useful information from a large data file.) This is simple:

#!/usr/bin/perl
while (<>) {
  split;
  print $_[1], "\n"; }

Notice that you must use an index value of 1 to get the 2nd field, since array indexing begins at 0 in Perl.

Control structures

Perl has a rich set of control structures. Theoretically, and very often practically too, you can use just if statements for branching and while statements for looping.

Within control structures you specify the actions to be conditionally or repeatedly executed as blocks. A block is simply a sequence of statements surrounded by braces. Notice that braces are always required (unlike in C).

The simplest if statement is of the form

if(expression)block

which means that the expression is evaluated, and if the result is true, the block is executed.

For example, the statement if($i < 10) {$j = 100;} sets the value of $j to 100 if the value of $i is less than 10. As mentioned above, braces are required (even if there is a single statement within them), and the parentheses around the condition expression are obligatory, too.

A two-branch if statement is of the form

if(expression)block1 else block2

which means that the expression is evaluated, and if the result is true, block1 is executed, otherwise block2 is executed.

The while statement is of the form

while(expression)block

which means that the expression is evaluated, and if the result is true, the block is executed, then the expression is re-evaluated and the process is repeated until the expression evaluates to false.

As a simple example of using the while statement is the following script, which splits input lines into fields (in a manner described above) and prints out the fields in reverse order.

#!/usr/bin/perl
while (<>) {
  split;
  $i = $#_;
  while($i >= 0) {
    print $_[$i--], " "; }
  print "\n";
}

The control in the (inner) while loop is based on using an auxiliary variable $i, which is initialized to the index of the last field and decremented (using the C-style embedded decrement operator --) within the loop until it reaches zero, ie all fields have been processed. The operator >= has the obvious meaning 'is greater than or equal to'.

String manipulation

Perl has powerful tools for string manipulation. For example, one often wants to convert letters in input data to lower case. That's easy:

tr /A-Z/a-z/;

This can be read as follows: "translate all characters in the range from A to Z to the corresponding characters in the range from a to z".

The operation is applied to the value of $_, ie the current input line. If you would like it to be applied to the value of a variable $foo, you should write

$foo =~ tr /A-Z/a-z/;

Thus, the syntax is odd-looking, but once you get accustomed to it, the Perl string manipulation tools are easy to use.

Example: renaming files

Unix users often get frustrated when they need to rename files eg so that all file names ending with some suffix (like .for) are renamed so that the suffix is changed (eg to .f). In some operating systems this is easy, but in normal Unix command interpreters there is no direct way to do it. (A naive user might try mv *.for *.f but it does not work at all in the way you would like.)

No problem, it's easily done in Perl, for example as follows:

#!/usr/bin/perl
while(<*.for>) {
  $oldname = $_;
  s/\.for$/\.f/;
  rename $oldname, $_;
}

A previous version of this document had in this example s/.for/.f/; instead of s/\.for$/\.f/;. Although the simpler version works in most cases, it is buggy, because the symbol . stands for any character, not just the period, and because there is no requirement that the string .for must appear at the end of the file name only. Thus, the code would rename eg zapfor.for to za.f.for. To refer to the period character, one must use "escape" notation by prefixing it with a backslash. Moreover, if the trailing $ (denoting end of line) is omitted, the code would apply to the first appearance of .for in the filename.

The while statement is different from what we have seen before. It means that all file names matching the pattern within the angle brackets (here *.for) are processed and assigned, each in turn, to the variable $_. In fact, the meaning of $_ is not simply 'the current input line' as told before but more generally 'the current data being processed', and the context defines in each case what this exactly means.

Within the loop, the file name is copied to variable $oldname and then modified using a construct which performs a substitution and which resembles the tr construct used in the preceding example.

Finally, the rename operation is performed using a Perl built-in function, rename, which takes two file names as arguments. Alternatively, we could also use the following:

  system "mv $oldname $_";

which does the same operation (less efficiently) by asking the Unix system to execute a system command.