|
Update to Wicked Cool Shell Script #62: define.shThis is regarding script #62 (define a word): Looks like WordNet has changed their online version again and I tried the following replacement for the url= http://wordnet.princeton.edu/perl/webwn?s= But the script doesn't return anything but goes back to the prompt. I tried the url with a word in lynx and I got the source. Could you push me in a general direction? While most of the content of my popular book Wicked Cool Shell Scripts has weathered the passage of time well, the scripts that scrape specific content off Web sites have had a harder time with the inevitable redesigns, restructuring and general changes. In general, scraping content is fraught with risk anyway because you're very dependent on the current information architecture which can change without warning. Nonetheless, let's dig into this. First off, if you'd like to follow along and don't have my book (Shocking! Hey, just buy a copy at Amazon, it's well worth it) you can view the script here: Script #62: define.sh. The problem is that when you go to the given URL, you find out that: "WordNet 2.0 is no longer available." Fortunately the message goes on to explain that you can access the latest version of this nifty utility at href="http://wordnet.princeton.edu/perl/webwn" target="_blank">http://wordnet.princeton.edu/perl/webwn, so let's go there and enter a search query with a standard Web browser like Firefox. I'll search for "harmonious" because that's just a word on my mind today. :-) The resultant URL from the Princeton tool is rather scarylong: http://wordnet.princeton.edu/perl/webwn?s=harmonious&sub=Search+WordNet&o2=
&o0=1&o7=&o5=&o1=1&o6=&o4=&o3=&h= As with most of these, however, you can axe out any name=value pair where there's no value specified, which immediately trims it down to: http://wordnet.princeton.edu/perl/webwn?s=harmonious&sub=Search+WordNet&
o0=1&o1=1 A little more fiddling reveals that in fact if we want the default behavior - and we do - that the URL can be hacked down to: http://wordnet.princeton.edu/perl/webwn?s=harmonious
A result that is, well, more harmonious. :-) Now we can at least get definitions again with the script, but parsing the result to display it attractively within the shell, well, I think I'd do it differently now. To understand the challenge, here's the Wordnet definition of baroque: ![]() The goal is to display both of these definition groups, but omit the material above and below it (as I have neatly done with the screen shot). Here's the good news, however, gleaned by reading the source code: parts of speech are signified by <h3> headers, so part of the source of the above is <h3>Noun</h3>. We can search for that, and that gives us the beginning of the definition. The end turns out to be easy too: the line after the last definition line is: <a href="http://wordnet.princeton.edu">WordNet home page</a>
so we can use that as the end marker too and let "sed" do the dirty work of chopping out what we don't want to see. That's done, as readers would know, with something like: sed -n "/<h3>/,/wordnet/p"
The rest, I'll leave as an exercise for enthused readers. :-)
Categorized:
Shell Script Programming
(Article 8003,
Written by Dave Taylor)
Tagged: hacking, programming, shell scripting, wicked cool shell scripts Previous: How is a stock index calculated? Next: What does "bricked" mean? Subscribe!
just tried http://wordnet.princeton.edu/perl/webwn?s=harmonious and got a 404. Thought you'd like to know. Posted by: Steve O at July 31, 2009 1:32 PMwordnet.princeton is now wordnetweb.princeton, so the url is now http://wordnetweb.princeton.edu/perl/webwn?s=harmonious if anyone is interested. Posted by: Oval at October 27, 2009 6:57 AMI have something to say, now that you mention it, but ...
I do have a comment, now that you mention it!
|
Recommended
Recent Entries
Search
I Need Help!
Apple iPad Help
Articles and Reviews Auctions and Online Shopping Blogs and RSS Feeds Building Web Site Traffic Business and Management CGI Scripts and Web Site Programming Computer and Internet Basics d) None of the Above Facebook Help Google Plus Help HTML and CSS Industry News and Trade Shows iPhone and Cell Phone Help iPod, Sony PSP and MP3 Player Help Mac OS X Help Pay Per Click (PPC) Advertising Search Engine Optimization (SEO) Shell Script Programming Tech Support Video Help The Writing Business Twitter, LinkedIn and Social Network Help Unix and Linux Help Video Game Tips and Help Windows PC Help WordPress Help |