I want to know if there’s some way to automate searching the craigslist site with a shell script or similar so that I can keep an eye on it and know when certain rare auction items show up for sale?
This is a very fun question because it flashes back to one of my first startups, a company called iTrack, which was built around a Web-based service that automated searching and tracking auction items available on all the major auction sites (eBay, Amazon, Yahoo, etc). That, however, was years ago. The basic idea remains, however, and it’s delightfully simple.
Let’s dig into the craigslist site first, to see how it encodes searches.
Rather than a single site, Craigslist is really broken down into about 50 different regional datasets, all addressed by subdomain. For example, “denver.craigslist.org” and “boulder.craigslist.org” cover the Denver, Colorado and Boulder, Colorado markets, respectively. A search of one regional market, however, doesn’t reveal potential matches in any other. Is that a limitation? Doesn’t really matter, that’s just how the site’s designed.
This means that the very first step is to do a search and see what the resultant URL looks like. To search for “Roland Drum” on the Boulder site, for example, here’s the URL:
http://boulder.craigslist.org/search/sss?query=roland%20drum
As you can see, the regional domain name shows up here, and the query shows up at the end of the URL.
However, it’s slightly more tricky, because if you want to limit the searches to just match those Craigslist items where the title matches, you’ll have to do just a bit more experimentation and find out that now the URL has a few more fields:
http://boulder.craigslist.org/search/sss?query=roland+drum&srchType=T&minAsk=min&maxAsk=max
Turns out that we aren’t specifying min or max price so those parameters can be scrubbed out, leaving the resultant URL:
http://boulder.craigslist.org/search/sss?query=roland+drum&srchType=T
So one way you could do this is to simply bookmark this URL and any time you want to check, just click directly to the search results.
That’s not very automated, though, so let’s dig into what a simple shell script that does this search might look like. I like to use curl to grab web pages so the base script might be:
url="http://boulder.craigslist.org/search/sss?query=roland+drum&srchType=T"curl -s "$url"
Problem with this is that the output is raw HTML and, needless to say, it’s not designed to be easily parsed and analyzed, so after looking closely at the code, it turns out that there are some patterns that let you weed out the matches and omit everything we don’t want. The key pattern is href=”/msg/ which you can apply by using it in a “grep”. Then watch how I break every HTML tag onto its own line with a “sed” invocation too:
url="http://boulder.craigslist.org/search/sss?query=roland+drum&srchType=T"curl -s "$url" | \ grep 'href="/msg/" | \ sed 's/</\ </g'
Once you’ve got that, you now just need to screen out the HTML tags and lines that aren’t interesting, which can be again done with another “grep”, but this time we’ll use a regular expression. Ready? Here ya go:
url="http://boulder.craigslist.org/search/sss?query=roland+drum&srchType=T"curl -s "$url" | \ grep 'href="/msg/" | \ sed 's/</\ </g' | \ grep -v -E '(</a>|<i>|</font>|href="/msg/"|</i>|</p>|span>|class="p"|<p>|font size)' | \ grep "<a href="
Looks rather complex, I admit, but if you run it with the broader search pattern “drums”, here’s what you get:
/msg/728672846.html -- Cuban Style Cajon Drums - $500 - /msg/728630582.html -- African Dun Dun Drums - $450 - /msg/724851366.html -- --------Drums and Drum Stuff!---------- - /msg/706689489.html -- DRUM KIT, Yamaha ELECTRIC DRUMS, DTXPLORER, FUN! FUN! - $575 - /msg/694513136.html -- Drums --Slingerland Drum Set - $350 - /msg/679106717.html -- Kid Drums Traps kindly used - $50 -
My final script has a few additional refinements, like making the URL clickable, and it wouldn’t be too insanely difficult to save the search result each night and run a “diff” each night against the previous day’s output. If there’s something new, it could be emailed to you directly. It’s what we call an “exercise best left to the reader”.
Hope that this gives you a productive path to travel on the way to creating the script you seek!
how can i get help with my craigslist account,it says i need to authecat it by phone,i dont wanta do this,how do i fix this problem
Is there a way to run a search that does all the North Carolina sites (there are about 8)?
I read your excellent article at about error handling and stdout / stderr redirection. I have a script in which I want to analyze the error message. I noticed that my script finishes before it writes the error. Is the error message in an accessible variable and if not, how can I redirect the error message directly to a variable without going through a file?
Craigslist does support RSS feeds from searches. You can search for an item and then subscribe to that as a feed. I use it everyday with Bloglines.
Ah, I knew that small print was in their TOS somewhere or other. Of course, if they *had* a search monitoring service it wouldn’t be necessary for a third party to offer it. A surprising absence in my opinion. I mean, it *is* the 21st century, right? 🙂
I think the reason you haven’t seen this in the past is that it violates Craig’s list terms of use, #7u.
It states:
“Additionally, you agree not to:…
u) use automated means, including spiders, robots, crawlers, data mining
tools, or the like to download data from the Service – unless expressly
permitted by craigslist;”
I don’t know how often they watch this, but just thought I should mention it.