Carla Climate is studying climate change in the Northern and Southern hemispheres. As part of her work, she wants to see whether the gap between annual temperatures in Canada and Australia increased during the Twentieth Century. The raw data she needs is available online; her goal is to get it, do her calculations, and then post her results so that other scientists can use them.
This chapter is about how she can do that. More specifically, it's about how to fetch data from the web, and how to create web pages that are useful to both human beings and computers. What we will not cover is how to build interactive web applications: our experience has shown that all we can do in the time we have is show you how to create security holes, which we're reluctant to do. However, everything in this chapter is a prerequisite for doing that, and there are lots of other good tutorials available if you decide that's what you really need.
FIXME
To start, let's have another look at the hearing tests from our chapter on Python programming. Most people would probably store these results in a plain text file with one row for each test:
Date Experimenter Subject Test Score ---------- ------------ ------- ----- ----- 2011-05-02 A. Binet H. Ebbinghaus DL-11 88% 2011-05-07 A. Binet H. Ebbinghaus DL-12 71% 2011-05-02 A. Binet W. Wundt DL-11 29% 2011-05-02 C. S. Pierce W. Wundt DL-11 45%
This is pretty much what a conscientious researcher would write in a lab notebook, and is easy for a human being to read. It's a lot harder for a computer to understand, though. Any program that wanted to load this data would have to know that the first line of the file contains column titles, that the second can be ignored, that the first field of each row thereafter should be translated from text into a date, that the fields after that start in particular columns (since the number of spaces between them is variable, and the number of spaces inside names can also vary—compare "A. Binet" with "C. S. Pierce"), and so on. Such a program would not be hard to write, but having to write, debug, and maintain a separate program for each data set would be tedious.
Now consider something like this quotation from Richard Feynman's 1965 Nobel Prize acceptance speech:
As a by-product of this same view, I received a telephone call one day at the graduate college at Princeton from Professor Wheeler, in which he said, "Feynman, I know why all electrons have the same charge and the same mass." "Why?" "Because, they are all the same electron!"
A lot of information is implicit in these four sentences, like the fact that "Wheeler" and "Feynman" are particular people, that "Princeton" is a place, that the speakers are alternating (with Wheeler speaking first), and so on. None of that is "visible" to a computer program, so if we had a database containing millions of documents and wanted to see which ones mentioned both John Wheeler (the physicist, not the geologist) and Princeton (the university, not the glacier), we might have to wade through a lot of false matches. What we need is some way to explicitly tell a computer all the things that human beings are able to infer.
An early effort to tackle this problem dates back to 1969, when Charles Goldfarb and others at IBM created the Standard Generalized Markup Language, or SGML. It was designed as a way of adding extra data to medical and legal documents so that programs could search them more accurately. SGML was very complex (the specification is over 500 pages long), and unless you were a specialist, you probably didn't even know it existed: all you saw were the programs that used it.
But in 1989 Tim Berners-Lee borrowed the syntax of SGML to create the HyperText Markup Language, or HTML, for his new "World Wide Web". HTML looked superficially the same as SGML, but it was much (much) simpler: almost anyone could write it, so almost everyone did.
However, HTML only had a small vocabulary, which users could not change or extend. They could say, "This is a paragraph," or, "This is a table," but not, "This is a chemical formula," or, "This is a person's name." Instead of adding thousands of new terms for different application domains, a new standard for defining terms was created in 1998. This standard was called the Extensible Markup Language (XML); it was much more complex than HTML, but hundreds of specialized vocabularies have now been defined in terms of it, such as the Chemical Markup Language for describing chemical compounds and related concepts.
More recently, a new version of HTML called HTML5 has been created. Web programmers are very excited about it, primarily because its new features allow them to create sophisticated user interfaces that run on smart phones and tablets as well as conventional computers. In what follows, though, we'll focus on some basics that haven't changed (much) in 20 years.
FIXME
A basic HTML document contains text and elements. (The full specification allows for many other things with names like "external entity references" and "processing instructions", but we'll ignore them.) The text in a document is just characters, and as far as HTML is concerned, it has no intrinsic meaning: "Feynman" is just seven characters, not a person.
Elements are metadata that describe the meaning of the document's content. For example, one element might signal a heading, while another might indicate that something is a cross-reference.
Elements are written using tags,
which must be enclosed in angle brackets <…>
.
For example, <cite>
is used to mark the start of a citation,
and </cite>
is used to mark its end.
Elements must be properly nested:
if an element called inner
begins inside an element called outer
,
inner
must end before outer
ends.
This means that <outer>…<inner>…</inner></outer>
is legal HTML,
but <outer>…<inner>…</outer></inner>
is not.
Here are some commonly-used HTML tags:
Tag | Usage |
---|---|
html |
Root element of entire HTML document. |
body |
Body of page (i.e., visible content). |
h1 |
Top-level heading. Use h2 , h3 , etc. for second- and third-level headings. |
p |
Paragraph. |
em |
Emphasized text. |
Finally,
every well-formed document started with a DOCTYPE
declaration,
which looks like:
<!DOCTYPE html>
This tells programs what kind of elements are allowed to appear in the document: 'html' (by far the most common case), 'math' for MathML, and so on. Here is a simple HTML document that uses everything we've seen so far:
<!DOCTYPE html><html><body><h1>Dimorphism</h1><p>Occurring or existing in two different <em>forms</em>.</p></body></html>
A web browser like Firefox might present this document as shown in Figure XXX. Other devices will display it differently. A phone, for example, might use a different background color for the heading, while a screen reader for people with visual disabilities would read the text aloud.
These different presentations are possible because HTML separates content from presentation, or in computer science jargon, separates models from views. The model is the data itself; the view is how that data is displayed, such as a particular pattern of pixels on our screen or a particular sequence of sounds on our headphones. A given model may be viewed in many different ways, just as what files are on your hard drive can be viewed as a list, as snapshots, or as a hierarchical tree (Figure XXX).
People can construct models from views almost effortlessly—if you are able to read, it's almost impossible not to see the letters "HTML" in the following block of text:
* * ***** * * * * * * ** ** * ***** * * * * * * * * * * * * * * * * ****
Computers, on the other hand, are very bad at reconstructing models from views. In fact, many of the things we do without apparent effort, like understanding sentences, are still open research problems in computer science. That's why markup languages were invented: they are how we explicitly specify the "what" that we infer so easily for computers' benefit.
There are a couple of other formatting rules we need to know in order to create and understand documents. If we are writing HTML by hand instead of using a WYSIWYG editor like LibreOffice or Microsoft Word, we might lay it out like this to make it easier to read:
<!DOCTYPE html> <html> <body> <h1>Dimorphism</h1> <p>Occurring or existing in two different <em>forms</em>.</p> </body> </html>
Doing this doesn't change how most browsers render the document, since they usually ignore "extra" whitespace (highlighted above). As we'll see when we start writing programs of our own, though, that whitespace doesn't magically disappear when a program reads the document.
Second,
we must use escape sequences
to represent the special characters <
and >
for the same reason that we have to use \"
inside a double-quoted string in a program.
where do we explain escape sequences?
In HTML and XML,
an escape sequence is an ampersand '&'
followed by the abbreviated name of the character
(such as 'amp' for "ampersand")
and a semi-colon.
The four most common escape sequences are:
Sequence | Character |
---|---|
< |
< |
> |
> |
" |
" |
& |
& |
One final formatting rule is that every document must have a single root element, i.e., a single element must enclose everything else. When combined with the rule that elements must be properly nested, this means that every document can be thought of as a tree. For example, we could draw the logical structure of our little document as shown in Figure XXX.
A document like this, on the other hand, is not strictly legal:
<h1>Dimorphism</h1> <p>Occurring or existing in two different <em>forms</em>.</p>
because it has two top-level elements
(the h1
and the p
).
Most browsers will render it correctly,
since they're designed to accommodate improperly-formatted HTML,
but most programs won't,
because they're not.
There are a lot of incorrectly-formatted HTML pages out there. To deal with them, people have written libraries like Beautiful Soup, which does its best to turn real-world HTML into something that a run-of-the-mill program can handle. It almost always gets things right, but sticking to the standard makes life a lot easier for everyone.
FIXME
Elements can be customized by giving them attributes. These are name/value pairs enclosed in the opening tag like this:
<h1 align="center">A Centered Heading</h1>
or:
<p class="disclaimer">This planet provided as-is.</p>
Any particular attribute name may appear at most once in any element,
just like keys may be present at most once in a dictionary,
so <p align="left" align="right">…</p>
is illegal.
Attributes' values must be in quotes in XML and older dialects of HTML;
HTML5 allows single-word values to be unquoted,
but quoting is still recommended.
Another similarity between attributes and dictionaries is that attributes are unordered. They have to be written in some order, just as the keys and values in a dictionary have to be displayed in some order when they are printed, but as far as the rules of HTML are concerned, the elements:
<p align="center" class="disclaimer">This web page is made from 100% recycled pixels.</p>
and:
<p class="disclaimer" align="center">This web page is made from 100% recycled pixels.</p>
mean the same thing.
explain
When should we use attributes, and when should we nest elements? As a general rule, we should use attributes when:
In all other cases, we should use nested elements. However, many widely-used XML formats break these rules in order to make it easier for people to write XML by hand. For example, in the Scalable Vector Graphics (SVG) format used to describe images as XML, we would define a rectangle as follows:
<rect width="300" height="100" style="fill:rgb(0,0,255); stroke-width:1; stroke:rgb(0,0,0)"/>
In order to understand the style
attribute,
a program has to somehow know to split it on semicolons,
and then to split each piece on colons.
This means that a generic program for reading XML
can't extract all the information that's in SVG,
which partly defeats the purpose of using XML in the first place.
FIXME
As anyone who has surfed the web has seen,
web pages can contain a lot more than just headings and paragraphs.
To start with,
HTML provides two kinds of lists:
ul
to mark an unordered (bulleted) list,
and ol
for an ordered (numbered) one
(Figure XXX).
Items inside either kind of list must be wrapped in li
elements:
<!DOCTYPE html> <html> <body> <ul> <li>A. Binet <ol> <li>H. Ebbinghaus</li> <li>W. Wundt</li> </ol> </li> <li>C. S. Pierce <ol> <li>W. Wundt</li> </ol> </li> </body> </html>
Note how elements are nested:
since the ordered lists "belong" to the unordered list items above them,
they are inside those items' <li>…</li>
tags.
And remember,
the indentation used to make this list easier for people to read
means nothing to the computer:
we could put the whole thing on one line,
or write it as:
<!DOCTYPE html> <html> <body> <ul> <li>A. Binet <ol> <li>H. Ebbinghaus</li> <li>W. Wundt</li> </ol> </li> <li>C. S. Pierce <ol> <li>W. Wundt</li> </ol> </li> </body> </html>
and the computer would interpret and display it the same way. A human being, on the other hand, would find the inconsistent indentation of the second layout much harder to follow.
HTML also provides tables, but they are awkward to use:
tables are naturally two-dimensional,
but text is one-dimensional.
This is exactly like the problem of representing a two-dimensional array in memory,
which we saw in the NumPy
and development lessons.
We solve it in the same way:
by writing down the rows,
and the columns within each row,
in a fixed order.
The table
element marks the table itself;
within that,
each row is wrapped in tr
(for "table row"),
and within those,
column items are wrapped in th
(for "table heading")
or td
(for "table data"):
<!DOCTYPE html> <html> <body> <table> <tr> <th></th> <th>A. Binet</th> <th>C. S. Pierce</th> </tr> <tr> <th>H. Ebbinghaus</th> <td>88%</td> <td>NA</td> </tr> <tr> <th>W. Wundt</th> <td>29%</td> <td>45%</td> </tr> </table> </body> </html>
Tables are sometimes used to do multi-column layout,
as well as for tabular data,
but this is a bad idea.
To understand why,
consider two other HTML tags:
i
, meaning "italics",
and em
, meaning "emphasis".
The former directly controls how text is displayed,
but by doing so,
it breaks the separation between model and view that is the heart of markup's usefulness.
Without understanding the text that has been italicized,
a program cannot understand whether it is meant to indicate someone shouting,
the definition of a new term,
or the title of a book.
The em
tag, on the other hand, has exactly one meaning,
and that meaning is different from the meaning of dfn
(a definition)
or cite
(a citation).
Conscientious authors use Cascading Style Sheets (or CSS)
to describe how they want pages to appear,
and only use table
elements for actual tables.
CSS is beyond the scope of this lesson,
but is described briefly in the appendix.
HTML pages can also contain images.
(In fact,
the World Wide Web didn't really take off until
the Mosaic browser allowed people to mix images with text.)
The word "contain" is misleading, though:
HTML documents can only contain text,
so we cannot store an image "in" a page.
Instead,
we must put it in some other file,
and insert a reference to that file in the HTML using the img
tag.
Its src
attribute specifies where to find the image file;
this can be a path to a file on the same host as the web page,
or a URL for something stored elsewhere.
For example,
when a browser displays this:
<!DOCTYPE html> <html> <body> <p>My daughter's first online chat:</p> <img src="madeleine.jpg"/> <p>but probably not her last.</p> </body> </html>
it looks for the file madeleine.jpg
in the same directory as the HTML file:
Notice,
by the way,
that the img
element is written as
<img…/>
,
i.e.,
with a trailing slash inside the <>
rather than with a separate closing tag.
This makes sense because the element doesn't contain any text:
the content is referred to by its src
attribute.
Any element that doesn't contain anything
can be written using this short form.
Images don't have to be in the same directory as the pages that refer to them. When the browser displays this:
<!DOCTYPE html> <html> <body> <p>Yes, she knows she's cute:</p> <img src="img/cute-smile.jpg"/> </body> </html>
it looks in the directory containing the page
for a sub-directory called img
,
and loads the image file from there,
while if it's given:
<!DOCTYPE html> <html> <body> <img src="http://software-carpentry.org/img/software-carpentry-logo.png"/> </body> </html>
it downloads the image from the URL
http://software-carpentry.org/img/software-carpentry-logo.png
and displays that.
The path is always interpreted (web browser config)
Whenever we refer to an image,
we should use the img
tag's alt
attribute
to provide a title or description of the image.
This is what screen readers for people with visual handicaps will say aloud to "display" the image;
it's also what search engines rely on,
since they can't "see" the image either.
Adding this to our previous example gives:
<!DOCTYPE html>
<html>
<body>
<p>My daughter's first online chat:</p>
<img src="madeleine.jpg" alt="Madeleine's first online chat"/>
<p>but probably not her last.</p>
</body>
</html>
We can use URLs for images,
but their most important use is
to create the links within and between pages that make HTML "hypertext".
This is done using the a
element.
Whatever is inside the element is displayed and highlighted for clicking;
this is usually a few words of text,
but it can be an entire paragraph or an image.
The a
element's href
attribute
specifies what the link is pointing at;
as with images,
this can be either a local filename or a URL.
For example,
we can create a listing of the examples we've written so far like this
(Figure XXX):
<!DOCTYPE html> <html> <body> <p> Simple HTML examples for <a href="http://software-carpentry.org">Software Carpentry</a>. </p> <ol> <li><a href="very-simple.html">a very simple page</a></li> <li><a href="hide-paragraph.html">hiding paragraphs</a></li> <li><a href="nested-lists.html">nested lists</a></li> <li><a href="simple-table.html">a simple table</a></li> <li><a href="simple-image.html">a simple image</a></li> </ol> </body> </html>
The hyperlink element is called a
because
it can also used to create anchors in documents
by giving them a name
attribute instead of an href
.
An anchor is simply a location in a document that can be linked to.
For example,
suppose we formatted the Feynman quotation given earlier like this:
<blockquote> As a by-product of this same view, I received a telephone call one day at the graduate college at <a name="pu">Princeton</a> from Professor Wheeler, in which he said, "Feynman, I know why all electrons have the same charge and the same mass." "Why?" "Because, they are all the same electron!" </blockquote>
If this quotation was in a file called quote.html
,
we could then create a hyperlink directly to the mention of Princeton
using <a href="quote.html#pu">
.
The #
in the href
's value separates the path to the document
from the anchor we're linking to.
Inside quote.html
itself,
we could link to that same location simply using
<a href="#pu">
.
Using the a
element for both links and targets was poor design—programs
are simpler to write if each element has one purpose, and one alone—but
we're stuck with it now.
A better way to create anchors is to add an id
attribute
to some other element.
For example,
if we wanted to be able to link to the quotation itself,
we could write:
<blockquote id="wheeler-electron-quote">
As a by-product of this same view, I received a telephone call one day
at the graduate college at <a name="pu">Princeton</a>
from Professor Wheeler, in which he said,
"Feynman, I know why all electrons have the same charge and the same mass."
"Why?"
"Because, they are all the same electron!"
</blockquote>
and then refer to quote.html#wheeler-electron-quote
.
Finally,
well-written HTML pages have a head
element as well as a body
.
The head isn't displayed;
instead,
it's used to store metadata about the page as a whole.
The most common element inside head
is title
,
which,
as its name suggests,
gives the page's title.
(This is usually displayed in the browser's title bar.)
Another common item in the head is meta
,
whose two attributes name
and content
let authors add arbitrary information to their pages.
If we add these to the web page we wrote earlier,
we might have:
<!DOCTYPE html> <html> <head> <title>Dimorphism Defined<title> <meta name="author" content="Alan Turing"/> <meta name="institution" content="Euphoric State University"/> </head> <body> <h1>Dimorphism</h1> <p>Occurring or existing in two different <em>forms</em>.</p> </body> </html>
Well-written pages also use comments (just like code),
which start with <!--
and end with -->
.
Commenting out part of a page does not hide the content from people who really want to see it: while a browser won't display what's inside a comment, it's still in the page, and anyone who uses "View Source" can read it. For example, if you are looking at this page in a web browser right now, try viewing the source and searching for the word "Surprise".
If you really don't want people to be able to read something, the only safe thing to do is to keep it off the web.
meta
elements in a page's head
element.ul
for unordered lists and ol
for ordered lists.<!--
and -->
.table
for tables, with tr
for rows and td
for values.img
for images.a
to create hyperlinks.id
attribute to link to it.FIXME
Turning a Python list into an HTML ol
or ul
list
seems like a natural thing to do,
so you might expect that programmers would have created libraries to do it.
In fact,
they have gone one step further
and creating systems that allow people to put bits of code directly into HTML files.
Such a file is usually called a template,
since it is the general pattern for any number of potential pages.
Here's a simple example. Suppose we want to create a set of web pages to display point-form biographies of famous scientists. We want each page to look like this:
<html> <head> <title>Biography of Beatrice Tinsley</title> </head> <body> <h1>Beatrice Tinsley</h1> <ol> <li>Born 1941</li> <li>Died 1981</li> <li>Studied stellar aging</li> </ol> </body> </html>
but since we expect to have hundreds of such pages,
we don't want to write each one by hand.
(We certainly don't want to have to revise each one by hand
when the university decides it wants them in a slightly different format...)
To make things easier on ourselves,
let's create a single template page called biography.html
that contains:
<html> <head> <title>Biography of {{name}}</title> </head> <body> <h1>{{name}}</h1> <ol> {% for f in facts %} <li>{{f}}</li> {% endfor %} </ol> </body> </html>
This has the same general structure as a general biography,
but there are a few changes:
it uses instead of the scientist's name,
and rather than listing each biographical detail,
it has something that looks a lot like a
for
loop
that iterates over something called facts
.
What we need next is a program that can expand this template
using particular values for name
and facts
.
We will use a Python template library called Jinja2 to do this;
there are many others
but they all work in more or less the same way
(which means, "They each have their own slightly different rules
for what can go in a page and how it's expanded.").
First, let's put all the values we want to customize the page with into variables:
who = 'Beatrice Tinsley' what = ['Born 1941', 'Died 1981', 'Studied stellar aging']
Next, we have to import the Jinja2 library and do a bit of magic to load the template for our page:
import jinja2 loader = jinja2.FileSystemLoader(['.']) environment = jinja2.Environment(loader=loader) template = environment.get_template('biography.html')
We start by importing the jinja2
library,
and then create an object called a "loader".
Its job is to find template files and load them into memory;
its argument is a list of the directories we want it to search (in order).
For now,
we are only looking in the current directory,
so the list is just ['.']
(i.e., the current directory).
Once we have that loader,
we use it to create a Jinja2 "environment",
which—well, honestly,
we don't need two separate objects for what we're doing,
but more complicated applications might need several loaders,
or might be expanding different sets of templates in different ways,
and the Environment
object is where all that is handled.
What we really want is the last line,
which asks the environment to load the template file 'biography.html'
and give us an object that knows how to expand itself.
We're now ready to do the actual expansion:
result = template.render(name=who, facts=what) print result
When we call template.render
,
we pass it any number of name-value pairs.
(Remember,
the odd-looking expression name=who
in the function call
means,
"Assign the value of the variable who
in the calling code
to the parameter called name
inside the function.")
Those names are turned into variables,
and can be used inside the template,
so that {{name}}
is given the string 'Beatrice Tinsley'
and facts
is given our list of facts about her.
The method call template.render
"runs" the template
as if it were a program,
and returns the string that's created.
When we print it out,
we get:
<html> <head> <title>Biography of Beatrice Tinsley</title> </head> <body> <h1>Beatrice Tinsley</h1> <ol> <li>Born 1941</li> <li>Died 1981</li> <li>Studied stellar aging</li> </ol> </body> </html>
Why go to all of this trouble? Because if we want to create another page with exactly the same format, all we have to do is call:
result = template.render(name='Helen Sawyer Hogg', facts=['Born 1905', 'Died 1993', 'Studied globular clusters', 'Wrote a popular astronomy column for 30 years'])
and we will get:
<html> <head> <title>Biography of Helen Sawyer Hogg</title> </head> <body> <h1>Helen Sawyer Hogg</h1> <ol> <li>Born 1905</li> <li>Died 1993</li> <li>Studied globular clusters</li> <li>Wrote a popular astronomy column for 30 years</li> </ol> </body> </html>
Putting code in HTML templates and then expanding that to create actual pages
has advantages and disadvantages.
The main advantage is that simple things are simple to do:
the biography template shown above is a lot easier to understand than either
a bunch of print
statements,
or a set of functions that
construct a document in memory
and then turn the result into a string.
The other big advantage of templating is that
all of the generated pages are guaranteed to have the same format.
If subsections are marked with an h2
heading in one,
they'll be marked with an h2
in all the others.
This makes it easier for programs to read and process those pages.
The biggest drawback of templating is the lack of support for debugging. It's very common for template expansion to do what you said, rather than what you meant, and working backward from a page that has the wrong content to the bits of template that weren't quite right can be complicated. One way to keep it manageable is to keep the templates as simple as possible. Any calculations more complicated than simple addition should be done in the program, and the result passed in as a variable. Similarly, while deeply-nested conditional statements in programs are hard to understand, their equivalents in templates are even harder, and so should be avoided.
Jinja2 templates support all the basic features of Python. For example, we can modify our template file to say:
<html> <head> <title>Biography of {{name}}</title> </head> <body> <h1>{{name}}</h1> {% if facts %} <ol> {% for f in facts %} <li>{{f}}</li> {% endfor %} </ol> {% else %} <p>No facts available.<p> {% endif %} </body> </html>
so that if the list facts
is empty,
the page displays a paragraph saying that,
rather than an empty ordered list.
We can also tell Jinja2 to include one template in another,
so that if we want every page to have the same logo and license statement,
we can use:
{% include "logo.html" %}
at the top, and:
{% include "license.html" %}
at the bottom.
FIXME
Now that we know how to read and write the web's most common data format, it's time to look at how data is moved around on the web. Broadly speaking, web applications are built in one of two ways. In a client/server architecture many clients communicate with a central server (Figure XXX). This model is asymmetric: clients ask for things, and servers provide them. Web browsers and web servers like Firefox and Apache are the best-known example of this model, but many database management systems also use a client/server architecture.
In contrast, a peer-to-peer architecture is one in which all processes exchange information equally (Figure XXX). This is symmetric: every participant both provides and receives data. The most widely used example today is probably BitTorrent, but again, there are many others. Peer-to-peer systems are generally harder to design than client-server systems, but they are also more resilient: if a centralized web server fails, the whole system goes down, while if one node in a filesharing network goes down, the rest can (usually) carry on.
Under the hood, both kinds of systems (and pretty much every other program that uses the network) run on a family of communication standards called Internet Protocol (IP). IP breaks messages down into small packets, each of which is forwarded from one machine to another along any available route to its destination, where the whole message is reassembled (Figure XXX).
The only part of IP that concerns us is the Transmission Control Protocol (TCP/IP). It guarantees that every packet we send is received, and that packets are received in the right order. Putting it another way, it turns an unreliable stream of disordered packets into a reliable, ordered stream of data, so that communication between computers looks as much as possible like reading and writing files. (Figure XXX).
Programs using IP communicate through sockets.
Each socket is one end of a point-to-point communication channel,
just like a phone is one end of a phone call.
A socket is identified by two numbers.
The first is its host address
or IP address,
which identifies a particular machine on the network.
This address consists of four 8-bit numbers,
such as 208.113.154.118
.
The Domain Name System (DNS)
matches these numbers to symbolic names like software-carpentry.org
that are easier for human beings to remember.
We can use tools like nslookup
to query DNS directly:
$ nslookup software-carpentry.org Server: admin1.private.tor1.mozilla.com Address: 10.242.75.5 Non-authoritative answer: Name: software-carpentry.org Address: 173.236.199.157
A socket's port number is just a number in the range 0-65535 that uniquely identifies the socket on the host machine. (If the IP address is like a university's phone number, then the port number is the extension.) Ports 0-1023 are reserved for the operating system's use; anyone else can use the remaining ports (Figure XXX).
The Hypertext Transfer Protocol (HTTP) sits on top of TCP/IP. It describes one way that programs can exchange web pages and other data, such as image files. The communicating parties were originally web browsers and web servers, but HTTP is now used by many other kinds of applications as well.
In principle, HTTP is simple: the client sends a request specifying what it wants over a socket connection, and the server sends some data in response. The data may be HTML copied from a file on disk, a similar page generated dynamically by a program, an image, or just about anything else (Figure XXX).
A lot of people use the terms "Internet" and "World Wide Web" synonymously, but they're actually very different things. The Internet is what lets (almost) any computer communicate with (almost) any other. That communication can be email, File Transfer Protocol (FTP), streaming video, or any of a hundred other things. The World Wide Web, on the other hand, is just one particular way to share data on top of the network that the Internet provides.
An HTTP request has three parts
(Figure XXX).
The HTTP method is almost always either
"GET"
(to fetch information)
or
"POST"
(to submit form data or upload files).
The URL specifies what the client wants;
it may be a path to a file on disk,
such as /research/experiments.html
,
but it's entirely up to the server to decide what to send back.
The HTTP version is usually "HTTP/1.0"
or "HTTP/1.1"
;
the differences between the two don't matter to us.
An HTTP header is a key/value pair, such as the three shown below:
Accept: text/html Accept-Language: en, fr If-Modified-Since: 16-May-2005
A key may appear any number of times, so that (for example) a request can specify that it's willing to accept several types of content.
The body is any extra data associated with the request. This is used when submitting data via web forms, when uploading files, and so on. There must be a blank line between the last header and the start of the body to signal the end of the headers; forgetting it is a common mistake.
One header,
called Content-Length
,
tells the server how many bytes to expect to read in the body of the request.
There's no magic in any of this:
an HTTP request is just text,
and any program that wants to can create one or parse one.
HTTP responses are formatted like HTTP requests (Figure XXX). The version, headers, and body have the same form and mean the same thing. The status code is a number indicating what happened when the request was processed by the server. 200 means "everything worked", 404 means "not found", and other codes have other meanings (Figure XXX). The status phrase repeats that information in a human-readable phrase like "OK" or "not found".
Code | Name | Meaning |
---|---|---|
100 | Continue | Client should continue sending data |
200 | OK | The request has succeeded |
204 | No Content | The server has completed the request, but doesn't need to return any data |
301 | Moved Permanently | The requested resource has moved to a new permanent location |
307 | Temporary Redirect | The requested resource is temporarily at a different location |
400 | Bad Request | The request is badly formatted |
401 | Unauthorized | The request requires authentication |
404 | Not Found | The requested resource could not be found |
408 | Timeout | The server gave up waiting for the client |
418 | I'm a teapot | No, really |
500 | Internal Server Error | An error occurred in the server that prevented it fulfilling the request |
601 | Connection Timed Out | The server did not respond before the connection timed out |
The one other thing that we need to know about HTTP is that it is stateless: each request is handled on its own, and the server doesn't remember anything between one request and the next. If an application wants to keep track of something like a user's identity, it must do so itself. The usual way to do this is with a cookie, which is just a short character string that the server sends to the client, and the client later returns to the server (Figure XXX). When a user signs in, the server creates a new cookie, stores it in a database, and sends it to their browser. Each time the browser sends the cookie back, the server uses it to look up information about what the user is doing (e.g., what wiki page they are editing).
FIXME
Opening sockets, constructing HTTP requests, and parsing responses is tedious,
so most people use libraries to do most of the work.
Python comes with such a library called urllib2
(because it's a replacement for an earlier library called urllib
),
but it exposes a lot of plumbing that most people never want to care about.
Instead,
we recommend using the Requests library.
Here's an example that uses it to download a page from our web site:
import requests
response = requests.get("http://guide.software-carpentry.org/web/testpage.html")
print 'status code:', response.status_code
print 'content length:', response.headers['content-length']
print response.text
status code: 200
content length: 126
<!DOCTYPE html>
<html>
<head>
<title>Software Carpentry Test Page</title>
</head>
<body>
<p>Use this page to test requests.</p>
</body>
</html>
request.get
does an HTTP GET on a URL
and returns an object containing the response.
That object's status_code
member is the response's status code;
its content_length
member is the number of bytes in the response data,
and text
is the actual data
(in this case, an HTML page).
no images etc. fetched
Sometimes a URL isn't enough on its own:
for example,
we have to specify what our search terms are
if we are using a search engine.
We could add these to the path in the URL,
but that would be misleading
(since most people think of paths as identifying files and directories),
and we've have to decide whether /software/carpentry
and /carpentry/software
were the same search or not.
What we should do instead is
add parameters to the URL
by adding a '?' to the URL
followed by 'key=value' pairs separated by '&'.
For example,
the URL http://www.google.ca?q=Python
ask Google to search for pages related to Python—the key is the letter 'q',
and the value is 'Python'—while
the longer query
http://www.google.ca/search?q=Python&client=Firefox
tells Google that we're using Firefox.
We can pass whatever parameters we want,
but it's up to the application running on the web site to decide
which ones to pay attention to,
and how to interpret them.
Yes, this means that we could write a program that tells websites it is Firefox, Internet Explorer, or pretty much anything else. We'll return to this and other security issues later.
Of course,
if '?' and '&' are special characters,
there must be a way to escape them.
The URL encoding standard
represents special characters using "%"
followed by a 2-digit code,
and replaces spaces with the '+' character
(Figure XXX).
Thus,
to search Google for "grade = A+" (with the spaces),
we would use the URL http://www.google.ca/search?q=grade+%3D+A%2B
.
Character | Encoding |
---|---|
"#" |
%23 |
"$" |
%24 |
"%" |
%25 |
"&" |
%26 |
"+" |
%2B |
"," |
%2C |
"/" |
%2F |
":" |
%3A |
";" |
%3B |
"=" |
%3D |
"?" |
%3F |
"@" |
%40 |
Encoding things by hand is very error-prone,
so the Requests library lets us use
a dictionary of key-value pairs instead
via the keyword argument params
:
import requests
parameters = {'q' : 'Python', 'client' : 'Firefox'}
response = requests.get('http://www.google.com/search', params=parameters)
print 'actual URL:', response.url
actual URL: http://www.google.com/search?q=Python&client=Firefox
You should always let the library build the URL for you, rather than doing it yourself: there are subtleties we haven't covered, and even if there weren't, there's no point duplicating code that's already been written and tested.
Suppose we want to write a script that actually does search Google.
Constructing a URL is easy.
Sending it and reading the response is easy too,
but parsing the response is hard,
since there's a lot of stuff in the page that Google sends back.
Many first-generation web applications relied on
screen scraping
to get data,
i.e.,
they would search for substrings in the HTML
using something like Beautiful Soup.
They had to do this because a lot of hand-written HTML was improperly formatted:
for example,
it was quite common to use <br>
on its own to break a line.
Screen scraping is always hard to get right if the page layout is complex. It is also fragile: whenever the layout of the pages changes, the application will most likely break because data is no longer where it was.
Most modern web applications try to sidestep this problem by providing some sort of web services interface, which is a lot simpler than it sounds. When a client sends a request, it indicates that it wants machine-oriented data rather than human-readable HTML by using a slightly different URL (Figure XXX). When asked for data, the server sends back JSON, XML, or something else that is easy for a program to handle. If the client asks for HTML, on the other hand, the application turns that data into HTML pages with italics and colored highlights and the like to make it easy for human beings to read.
Using "live" data from a web service is a powerful way to get a lot of science done in a hurry, but only when it works. As a case in point, we wanted to use bird-watching data from ebird.org in this example, but their server was locked down for security reasons when it came time for us to write our examples. (This is another way in which software is like other experimental apparatus: odds are that when you need it most, it will be broken or someone will have borrowed it.) We therefore chose to use climate data from the World Bank instead. According to the documentation, data for a particular country can be found at:
http://climatedataapi.worldbank.org/climateweb/rest/v1/country/cru/VARIABLE/year/ISO.FORMAT
where:
Let's try getting temperature data for France:
>>> import requests >>> url = 'http://climatedataapi.worldbank.org/climateweb/rest/v1/country/cru/tas/year/FRA.JSON' >>> response = requests.get(url) >>> print response.text [{"year":1901, "data":9.748865}, {"year":1902, "data":9.864603}, {"year":1903, "data":10.130159}, ... {"year":2009,"data":11.709985}]
This is straightforward to interpret:
the outer list element contains a dictionary for each year,
each of which contains "year"
and "data"
entries.
Let's use this to write a program
that compares the data for two countries
(which is the problem Carla wanted to solve at the start of this chapter).
We need to know which countries to compare:
def main(args):
first_country = 'AUS'
second_country = 'CAN'
if len(args) > 0:
first_country = args[0]
if len(args) > 1:
second_country = args[1]
result = ratios(first_country, second_country)
display(result)
def ratios(first, second):
'''Calculate ratio of average temperatures for two countries over time.'''
return {} # FIXME: fill in
def display(values):
'''Show dictionary entries in sorted order.'''
keys = values.keys()
keys.sort()
for k in keys:
print k, values[k]
if __name__ == '__main__':
main(sys.argv[1:])
The pattern here should be familiar:
we solve the top-level problem as if we already have the functions we need,
then come back and fill them in.
In this case,
this function to be filled in is ratios
,
which fetches data and calculates our result:
def ratios(first, second): '''Calculate ratio of average temperatures for two countries over time.''' first = get_temps(first) second = get_temps(second) assert len(first) == len(second), 'Length mis-match in results' result = {} for (i, first_entry) in enumerate(first): year = first_entry['year'] second_entry = second[i] assert second_entry['year'] == year, 'Year mis-match' result[year] = first_entry['data'] / second_entry['data'] return result
It depends in turn on get_temps
:
URL = 'http://climatedataapi.worldbank.org/climateweb/rest/v1/country/cru/tas/year/%s.JSON' ...all the code written so far... def get_temps(country_code): '''Get annual temperatures for a country.''' response = requests.get(URL % country_code) assert response.status_code == 200, \ 'Failed to get data for %s' % country_code return json.loads(response.text)
But wait a second:
judging from the sample response shown earlier,
temperatures are being reported in Celsius.
We should probably convert them to Kelvin
to make the ratios more meaningful
(and to avoid the risk of dividing by zero).
Let's modify get_temps
:
def get_temps(country_code):
'''Get annual temperatures for a country.'''
response = requests.get(URL % country_code)
assert response.status_code == 200, \
'Failed to get data for %s' % country_code
result = json.loads(response.text)
for entry in result:
result['data'] = kelvin(result['data'])
return result
and add the required conversion function:
def kelvin(celsius): '''Convert degrees C to degrees K.''' return celsius + 273.15
Let's try running this program with no arguments to compare Australia to Canada:
$ python temperatures.py 1901 1.10934799048 1902 1.11023963325 1903 1.10876094164 ... ... 2007 1.10725265753 2008 1.10793365185 2009 1.10865537105
and then with arguments to compare Malaysia to Norway:
$ python temperatures.py MYS NOR 1901 1.08900632708 1902 1.09536126502 1903 1.08935268463 ... ... 2007 1.08564675748 2008 1.08481881663 2009 1.08720464013
Only six lines in this program do anything webbish
(i.e., format the actual URL and get the data).
The remaining 47 lines are the user interface
(handling command-line arguments and printing output)
data manipulation
(converting temperatures and calculating ratios),
import
statements,
and docstrings.
It really is that simple.
FIXME
The next logical step is to provide data to others by writing some kind of server application. The basic idea is simple (Figure XXX):
As simple as this is,
we're not going to show you how to do it,
because experience has shown that
all we can actually do in a short lecture
is show you how to create security problems.
Here's just one example.
Suppose you want to write a web application that accepts URLs of the form
http://my.site/data?species=homo.sapiens
and fetches a database record
containing information about that species.
One way to do it in Python might look like this:
def get_species(url): '''Get data for a species given a URL with the species name as a query parameter.''' params = url.split('?')[1] # Get everything after the '?'. pairs = params.split('&') # Get the name1=value1&name2=value2 pairs. pairs = [pairs.split('=') for p in pairs] # Split the name=value pairs. pairs = dict(pairs) # Convert to a {name : value} dictionary. species = pairs['species'] # Get the species we want to look up. sql = '''SELECT * FROM Species WHERE Name = "%s";''' # Template for SQL query. sql = sql % species # Insert the species name. cursor.execute(sql) # Send query to database. results = cursor.fetchall() # Get all the results. return results[0]
We've taken out all the error-checking—for example, this code will fail if there aren't actually any query parameters, or if the species' name isn't in the database—but that's not the problem. The problem is what happens if someone sends us this URL:
http://my.site/data?species=homo.sapiens";DROP TABLE Species"--
Why? Because the dictionary of query parameters produced by the first five lines of the function will be:
{'species' : 'homo.sapiens";DROP TABLE Species;--'}
which means that the SQL query will be:
SELECT * FROM Species WHERE Name = "homo.sapiens";DROP TABLE Species;--";
which is the same as:
SELECT * FROM Species WHERE Name = "homo.sapiens"; DROP TABLE Species;
In other words,
this query selects something from the database,
then deletes the entire Species
table.
This is called an SQL injection attack, because the user is injecting SQL into our database query. It's just one of hundreds of different ways that evil-doers can try to compromise a web application. Built properly, web sites can withstand such attacks, but learning what "properly" is and how to implement it takes more time than we have.
Instead,
we will look at how to write programs that create static HTML pages
that can then be given to clients by a standard web server.
Using the ratios of average annual temperatures as our example,
we'll create pages whose names look like
http://my.site/tempratio/AUS-CAN.html
,
and which contain data formatted like this:
<html> <head> <meta name="revised" content="2013-09-15" /> </head> <body> <h1>Ratio of Average Annual Temperatures for AUS and CAN</h1> <table class="data"> <tr> <td class="year">1901</td> <td class="data">1.10934799048</td> </tr> <tr> <td class="year">1902</td> <td class="data">1.11023963325</td> </tr> <tr> <td class="year">1903</td> <td class="data">1.10876094164</td> </tr> ... <tr> <td class="year">2007</td> <td class="data">1.10725265753</td> </tr> <tr> <td class="year">2008</td> <td class="data">1.10793365185</td> </tr> <tr> <td class="year">2009</td> <td class="data">1.10865537105</td> </tr> </table> </body> </html>
The first step is to calculate ratios, which we did in the previous section. The main function of our program is:
def main(args): '''Create web page showing temperature ratios for two countries.''' assert len(args) == 4, \ 'Usage: make_data_page template_filename output_filename country_1 country_2' template_filename = args[0] output_filename = args[1] country_1 = args[2] country_2 = args[3] page = make_page(template_filename, country_1, country_2) writer = open(output_filename, 'w') writer.write(page) writer.close() if __name__ == '__main__': main(sys.argv[1:])
Most of the work is done by make_page
,
which gets temperature data for two countries,
calculates ratios,
and fills in a Jinja2 template.
Using the get_temps
function we wrote earlier,
it is:
def make_page(template_filename, output_filename, country_1, country_2):
'''Create page showing temperature ratios.'''
data_1 = get_temps(country_1)
data_2 = get_temps(country_2)
years = data_1.keys()
years.sort()
the_date = date.isoformat(date.today()) # Format today's date
loader = jinja2.FileSystemLoader(['.'])
environment = jinja2.Environment(loader=loader)
template = environment.get_template(template_filename)
result = template.render(country_1=country_1, data_1=data_1,
country_2=country_2, data_2=data_2,
years=years, the_date=the_date)
return result
The only new thing here is the use of
date.isoformat
and date.today
to format today's date as something like "2013-09-15".
To finish, we need a Jinja2 template for the pages we want to create:
<!DOCTYPE html> <html> <head> <title>Temperature Ratios of {{country_1}} and {{country_2}} as of {{the_date}}</title> </head> <body> <h1>Temperature Ratios of {{country_1}} and {{country_2}}</h1> <h2>Calculated {{the_date}}</h2> <table> <tr> <td>Year</td> <td>{{country_1}}</td> <td>{{country_2}}</td> <td>Ratio</td> </tr> {% for year in years %} <tr> <td>{{year}}</td> <td>{{data_1[year]}}</td> <td>{{data_2[year]}}</td> <td>{{data_1[year] / data_2[year]}}</td> </tr> {% endfor %} </table> </body> </html>
Let's run it for Australia and Canada:
$ python make_data_page.py temp_ratio.html /tmp/aus-can.html AUS CAN
Sure enough,
the file /tmp/aus-can.html
contains:
<!DOCTYPE html> <html> <head> <title>Temperature Ratios of AUS and CAN as of 2013-02-10</title> </head> <body> <h1>Temperature Ratios of AUS and CAN</h1> <h2>Calculated 2013-02-10</h2> <table> <tr> <td>Year</td> <td>AUS</td> <td>CAN</td> <td>Ratio</td> </tr> <tr> <td>1901</td> <td>294.507021</td> <td>265.477581</td> <td>1.10934799048</td> </tr> <tr> <td>1902</td> <td>294.532462</td> <td>265.2872886</td> <td>1.11023963325</td> </tr> ... <tr> <td>2009</td> <td>295.07194</td> <td>266.1529883</td> <td>1.10865537105</td> </tr> </table> </body> </html>
This looks right, but most experienced programmers would ask us to make one improvement. Our program doesn't actually calculate temperature ratios; that's done by this line in the template:
<td>{{data_1[year] / data_2[year]}}</td>
Experience shows that the more calculations we do in our views (i.e., our information displays), the harder they are to maintain. What we should do is:
ratios
in the Python program
and pass it into the template,
and
Splitting things this way is extra work in this small case, but it's the best way to manage information as our displays become more complex.
The HTTP servers taht come in the standard Python library are useful for practicing these things in class. To start serving files, we go into the directory that contains them and run:
$ python -m SimpleHTTPServer 8080
-m SimpleHTTPServer
tells Python
to find the SimpleHTTPServer
library
and run it as a program;
the parameter 8080
tells it what port to use.
(It's normal to run HTTP servers on port 80,
but your system may forbid you from doing that
if you don't have administrator privileges.)
To get files,
we use localhost
as the site,
and include the appropriate port number,
so the URL is http://localhost:80/index.html
,
or more simply,
http://localhost:80/
.
FIXME
If Carla is calculating temperature ratios for many different countries, how will other scientists know which ones she has done? In other words, how can she make her data findable?
The standard answer for hundreds of years has been,
"Create an index."
On the web,
we can do this by creating a file called index.html
and putting it in the directory that holds our data files.
We don't have to call our index file index.html
,
but it's best to do so.
By default,
most web servers will give clients that file
when they're asked for the directory itself.
In other words,
if someone points a browser (or any other program)
at http://my.site/tempratio/
,
the web server will look for /tempratio
.
When it realizes that path is a directory rather than a file,
it will look inside that directory for a file called index.html
and return that.
This is not guaranteed—system administrators
can and do set up other default behaviors—but it is a common convention,
and we can always tell our colleagues to fetch
http://my.site/tempratio/
if they want the current index anyway.
What should be in index.html
?
The answer is simple:
a table of some kind showing what files are available,
when they were created,
and where they are.
The first piece of information is the most important;
the second allows users to determine
what has been added since they last looked at our site
without having to download actual data files,
while the third tells them how to get what they want.
Our index.html
will therefore be something like this:
<html> <head> <title>Index of Average Annual Temperature Ratios</title> <meta name="revised" content="2013-09-15" /> </head> <body> <h1>Index of Average Annual Temperature Ratios</h1> <table class="data"> <tr> <td class="country">AUS</td> <td class="country">CAN</td> <td class="revised">2013-09-12</td> <td class="revised"><a href="http://my.site/tempratio/AUS-CAN.html">download</a></td> </tr> ... <tr> <td class="country">MYS</td> <td class="country">NOR</td> <td class="revised">2013-09-15</td> <td class="download"><a href="http://my.site/tempratio/MYS-NOR.html">download</a></td> </tr> </table> </body> </html>
Strictly speaking,
we don't need to store the URLs in the index file:
we could instead tell people that if they got the index from
http://my.site/tempratio/index.html
,
then the data for AUS and CAN is in http://my.site/tempratio/AUS-CAN.html
,
and let them construct the URL themselves.
However,
that puts more of a burden on the user both in the short term
(since more coding is required)
and in the long term
(since the rule for constructing the URL for a particular data set could well change).
It also effectively hides our data from search engines,
since there's no way for them to know what our URL construction rule is.
Now, unlike our actual data files, this index file is added to incrementally: each time we generate a new version, we have to include all the data that was in the old version as well. We therefore need to remember what we've done. The usual way to do this in a real application is to use a database, but for our purposes, a plain old text file will suffice.
We could make up a format to store the information we need, such as:
Updated 2013-05-09 AUS CAN 2013-03-07 AUS NOR 2013-03-09 CAN NOR 2013-04-22 CAN MDG 2013-05-09
but it's much simpler just to use JSON:
{ 'updated' : '2013-05-09', 'entries' : [ ['AUS', 'CAN', '2013-03-07'], ['AUS', 'NOR', '2013-03-09'], ['CAN', 'NOR', '2013-04-22'], ['CAN', 'MDG', '2013-05-09'] ] }
Loading this data is as simple as:
import json
reader = open('index.json', 'r')
check = json.load(reader)
print check
{u'updated': u'2013-05-09', u'entries': [[u'AUS', u'CAN', u'2013-03-07'], [u'AUS', u'NOR', u'2013-03-09'], [u'CAN', u'NOR', u'2013-04-22'], [u'CAN', u'MDG', u'2013-05-09']]}
(Remember, the 'u' in front of each string signals that these strings are actually stored as Unicode, but we can safely ignore that for now.) Let's rewrite the main function of our temperature ratio program so that it creates the index as well as the individual page:
import sys import os from datetime import date import jinja2 import json from temperatures import get_temps INDIVIDUAL_PAGE = 'temp_ratio.html' INDEX_PAGE = 'index.html' INDEX_FILE = 'index.json' def main(args): ''' Create web page showing temperature ratios for two countries, and update the index.html page with the new entry. ''' assert len(args) == 5, \ 'Usage: make_indexed_page url_base template_dir output_dir country_1 country_2' url_base, template_dir, output_dir, country_1, country_2 = args the_date = date.isoformat(date.today()) loader = jinja2.FileSystemLoader([template_dir]) environment = jinja2.Environment(loader=loader) page = make_page(environment, country_1, country_2, the_date) save_page(output_dir, '%s-%s.html' % (country_1, country_2), page) index_data = load_index(output_dir, INDEX_FILE) index_data['entries'].append([country_1, country_2, the_date]) save_page(output_dir, INDEX_FILE, json.dumps(index_data)) page = make_index(environment, url_base, index_data) save_page(output_dir, INDEX_PAGE, page)
Since we will be expanding templates in a couple of different functions,
we move the creation of the Jinja2 environment to the main program.
We then pass the environment into both make_page
and a new function called update_index
,
and use another new function save_page
to save generated pages where they need to go.
(Note that we update the index data before rewriting the index HTML page,
so that the updates to the index appear in the HTML.
We did these two steps in the wrong order
in the first version of this program that we wrote,
and it was several hours before we noticed the error...)
save_page
is the simplest function to write,
so let's do that:
def save_page(output_dir, page_name, content): '''Save text in a file output_dir/page_name.''' path = os.path.join(output_dir, page_name) writer = open(path, 'w') writer.write(content) writer.close()
Our revised make_page
function is shorter than our original,
since the environment is now being created in main
.
It is also now being passed the date
(since that is used to update the index as well),
and uses a fixed template specified by the global variable
INDIVIDUAL_PAGE
.
The result is:
def make_page(environment, country_1, country_2, the_date): '''Create page showing temperature ratios.''' data_1 = get_temps(country_1) data_2 = get_temps(country_2) years = data_1.keys() years.sort() template = environment.get_template(INDIVIDUAL_PAGE) result = template.render(country_1=country_1, data_1=data_1, country_2=country_2, data_2=data_2, years=years, the_date=the_date) return result
The function that loads existing index data is also pretty simple:
def load_index(output_dir, filename): '''Load index data from output_dir/filename.''' path = os.path.join(output_dir, filename) reader = open(path, 'r') result = json.load(reader) reader.close() return result
All that's left are the function that regenerates the HTML version of the index:
def make_index(environment, url_base, index_data): '''Refresh the HTML index page.''' template = environment.get_template(INDEX_PAGE) return template.render(url_base=url_base, updated=index_data['updated'], entries=index_data['entries'])
and the HTML template it relies on:
<!DOCTYPE html> <html> <head> <title>Index of Average Annual Temperature Ratios</title> <meta name="revised" content="{{updated}}" /> </head> <body> <h1>Index of Average Annual Temperature Ratios</h1> <table class="data"> {% for entry in entries %} <tr> <td class="country">{{entry[0]}}</td> <td class="country">{{entry[1]}}</td> <td class="revised">{{entry[2]}}</td> <td class="revised"><a href="{{url_base}}/{{entry[0]}}-{{entry[1]}}.html">download</a></td> </tr> {% endfor %} </table> </body> </html>
FIXME
FIXME
We'll now use what we have learned to build a simple tool to download new temperature comparisons from a web site. In broad strokes, our program will keep a list of URLs to download data from, along with a timestamp showing when data was last downloaded. When we run the program, it will poll each site to see if any new data sets have been added since the last check. If any have, the program will display their URLs.
In order for this to work,
each of the sites that's providing data needs to be able to tell us
what data sets it has calculated,
and when they were created.
This information is in the site's index.html
file in human-readable form,
but it's also in the index.json
file each site is maintaining.
Client programs can load this file directly without having to do any parsing,
so we'll rely on that.
An earlier version of this tutorial loaded the HTML version of the index and extracted dates and URLs from it. Doing so only required twelve extra lines of code—but an extra 1200 words to explain how to read HTML into a program and find things in it. Storing information in machine-friendly formats for machines to use makes life a lot simpler...
The next step is to decide how to keep track of what we have downloaded and when.
The simplest thing is to create another JSON file
containing the timestamp and the list of URLs.
We'll call this sources.json
:
{ "timestamp" : "2013-05-02:07:04:03", "sites" : [ "http://software-carpentry.org/temperatures/index.json", "http://some.other.site/some/path/index.json" ] }
(Again, a larger application would use a database of some kind,
but that's more than we need right now.)
Each time we run our program,
it will read this file,
then download each index.json
file.
If any of those files contain links to data sets that are newer than the timestamp,
it will print the data set's URL.
(A real data analysis program would download the data and do something with it.)
We will then save a fresh copy of sources.json
with an updated timestamp
(Figure XXX).
Our main program looks like this:
import date def main(sources_path): '''Check all data sites in list, then update timestamp of sources.json.''' old_timestamp, all_sources = read_sources(sources_path) new_timestamp = date.datetime.now() for source in all_sources: for url in get_new_datasets(old_timestamp, source): process(url) write_sources(sources_path, new_timestamp, sources)
That seems pretty simple; the only subtlety is that we calculate the new timestamp before we start checking for new datasets. The reason is that this check might take anything from a few seconds to a few hours, depending on how busy the Internet is and how much data we actually download. If we wait until we're done and then record that moment as the new timestamp, then the next time we run our program, we won't download any datasets that were created between the time we started the first run of our program and the time it finished (Figure XXX).
We now have four functions to write:
read_sources
,
write_sources
,
get_new_datasets
,
and
process
.
Reading and writing the sources.json
file is pretty simple:
import json def read_sources(path): '''Read timestamp and data sources from JSON files.''' reader = open(path, 'r') data = json.load(reader) timestamp = data['timestamp'] sources = data['sources'] return timestamp, sources def write_sources(sources_path, timestamp, sources): '''Write timestamp and data sources to JSON file.''' data = {'timestamp' : timestamp, 'sources' : sources} writer = open(sources_path, 'w') json.dump(data, writer) writer.close()
What about processing a URL? Right now, we're just going to print it, though in a real application we would probably download the data and do some further calculations with it:
def process(url): '''Placeholder for processing a data set given its URL.''' print url
Finally,
we need to construct a list of dataset URLs
given the URL of an index.json
file:
import requests def get_new_datasets(last_checked, index_url): '''Return a list of URLs of datasets that are newer than the timestamp.''' response = requests.get(index_url) index_data = json.loads(index.text) result = [] for (country_a, country_b, updated) in index_data: dataset_timestamp = datetime.parse(updated) if dataset_timestamp >= last_checked: dataset_url = make_dataset_url(index_url, country_a, country_b) result.append(dataset_url) return result
The logic here is straightforward:
grab the index.json
file,
check each dataset to see if it's newer than the last time we checked,
and if it is—hm.
This code uses a not-yet-written function called make_dataset_url
to construct the URL for the specific dataset
from the URL of the index file
and the two country codes,
but as we discussed earlier,
asking client programs to construct links themselves is a bad idea.
Instead,
we should modify the index.json
files so that they include the URLs.
Doing this is left as an exercise for the reader.
But hang on: what exactly are we downloading when we download data sets? Right now, our temperature ratio files are all HTML pages; if we want to use that information in programs, it would be a lot easier if producers generated JSON files that consumers could use directly. It's almost trivial to extend our original program to produce such a file each time it produces a new HTML file, and to include the URLs for both files in both versions of the index (Figure XXX). Once we've done that, we have a first-class data syndication system: human-friendly and machine-friendly formats live side by side, so scientists and programs all over the world can make use of our results as soon as they appear.
FIXME
The web has changed in many ways over the last 20 years, not all of them for the better. An HTML page on a modern commercial site is likely to include dozens or hundreds of lines of Javascript that depend on several large, complicated libraries, and which generate the page's content on the fly inside the browser. Such a "page" is really a small (or not-so-small) program rather than a document in the classical sense of the word, and while that may produce a better experience for human users, it makes life more difficult for programs (and for people with disabilities, whose assistive aids are all too easy to confuse). And while XML is widely used for representing data, many people believe that younger alternatives like JSON do a better job of balancing the needs of human and computer readers.
Regardless of the technology used, though, the web's basic design principles are both simple and stable: tell people where data is, rather than giving them a copy; make the data itself and your names for it easy for both human beings and computers to understand; remix other people's data, and allow them to remix yours.