This is regarding script #62 (define a word): Looks like WordNet has changed their online version again and I tried the following replacement for the url=
http://wordnet.princeton.edu/perl/webwn?s=
But the script doesn’t return anything but goes back to the prompt. I tried the url with a word in lynx and I got the source. Could you push me in a general direction?
While most of the content of my popular book Wicked Cool Shell Scripts has weathered the passage of time well, the scripts that scrape specific content off Web sites have had a harder time with the inevitable redesigns, restructuring and general changes. In general, scraping content is fraught with risk anyway because you’re very dependent on the current information architecture which can change without warning.
Nonetheless, let’s dig into this. First off, if you’d like to follow along and don’t have my book (Shocking! Hey, just buy a copy at Amazon, it’s well worth it) you can view the script here: Script #62: define.sh.
The problem is that when you go to the given URL, you find out that: “WordNet 2.0 is no longer available.” Fortunately the message goes on to explain that you can access the latest version of this nifty utility at http://wordnet.princeton.edu/perl/webwn, so let’s go there and enter a search query with a standard Web browser like Firefox. I’ll search for “harmonious” because that’s just a word on my mind today. 🙂
The resultant URL from the Princeton tool is rather scarylong:
&o0=1&o7=&o5=&o1=1&o6=&o4=&o3=&h=
As with most of these, however, you can axe out any name=value pair where there’s no value specified, which immediately trims it down to:
o0=1&o1=1
A little more fiddling reveals that in fact if we want the default behavior – and we do – that the URL can be hacked down to:
A result that is, well, more harmonious. 🙂
Now we can at least get definitions again with the script, but parsing the result to display it attractively within the shell, well, I think I’d do it differently now. To understand the challenge, here’s the Wordnet definition of baroque:
The goal is to display both of these definition groups, but omit the material above and below it (as I have neatly done with the screen shot).
Here’s the good news, however, gleaned by reading the source code: parts of speech are signified by <h3> headers, so part of the source of the above is <h3>Noun</h3>. We can search for that, and that gives us the beginning of the definition.
The end turns out to be easy too: the line after the last definition line is:
so we can use that as the end marker too and let “sed” do the dirty work of chopping out what we don’t want to see.
That’s done, as readers would know, with something like:
The rest, I’ll leave as an exercise for enthused readers. 🙂
wordnet.princeton is now wordnetweb.princeton, so the url is now http://wordnetweb.princeton.edu/perl/webwn?s=harmonious if anyone is interested.
just tried http://wordnet.princeton.edu/perl/webwn?s=harmonious and got a 404. Thought you’d like to know.