Dave Taylor answers free tech support questions about a wide variety of business and technical topics, including blogging, iphone help, ipod help, AdSense, MySpace, Sony PSP help, Mp3 players, Windows XP, Windows Vista, Linux, SEO, Mac OS X, Facebook, Twitter and LinkedIn.

Can I automate craigslist searches?

I want to know if there's some way to automate searching the craigslist site with a shell script or similar so that I can keep an eye on it and know when certain rare auction items show up for sale?


Dave's Answer:

This is a very fun question because it flashes back to one of my first startups, a company called iTrack, which was built around a Web-based service that automated searching and tracking auction items available on all the major auction sites (eBay, Amazon, Yahoo, etc). That, however, was years ago. The basic idea remains, however, and it's delightfully simple.

Let's dig into the craigslist site first, to see how it encodes searches.

Rather than a single site, Craigslist is really broken down into about 50 different regional datasets, all addressed by subdomain. For example, "denver.craigslist.org" and "boulder.craigslist.org" cover the Denver, Colorado and Boulder, Colorado markets, respectively. A search of one regional market, however, doesn't reveal potential matches in any other. Is that a limitation? Doesn't really matter, that's just how the site's designed.

This means that the very first step is to do a search and see what the resultant URL looks like. To search for "Roland Drum" on the Boulder site, for example, here's the URL:

http://boulder.craigslist.org/search/sss?query=roland%20drum

As you can see, the regional domain name shows up here, and the query shows up at the end of the URL.

However, it's slightly more tricky, because if you want to limit the searches to just match those Craigslist items where the title matches, you'll have to do just a bit more experimentation and find out that now the URL has a few more fields:

http://boulder.craigslist.org/search/sss?query=roland+drum&srchType=T&minAsk=min&maxAsk=max

Turns out that we aren't specifying min or max price so those parameters can be scrubbed out, leaving the resultant URL:

http://boulder.craigslist.org/search/sss?query=roland+drum&srchType=T

So one way you could do this is to simply bookmark this URL and any time you want to check, just click directly to the search results.

That's not very automated, though, so let's dig into what a simple shell script that does this search might look like. I like to use curl to grab web pages so the base script might be:

url="http://boulder.craigslist.org/search/sss?query=roland+drum&srchType=T"

curl -s "$url"

Problem with this is that the output is raw HTML and, needless to say, it's not designed to be easily parsed and analyzed, so after looking closely at the code, it turns out that there are some patterns that let you weed out the matches and omit everything we don't want. The key pattern is href="/msg/ which you can apply by using it in a "grep". Then watch how I break every HTML tag onto its own line with a "sed" invocation too:

url="http://boulder.craigslist.org/search/sss?query=roland+drum&srchType=T"

curl -s "$url" | \
  grep 'href="/msg/" | \
  sed 's/</\
</g'

Once you've got that, you now just need to screen out the HTML tags and lines that aren't interesting, which can be again done with another "grep", but this time we'll use a regular expression. Ready? Here ya go:

url="http://boulder.craigslist.org/search/sss?query=roland+drum&srchType=T"

curl -s "$url" | \
  grep 'href="/msg/" | \
  sed 's/</\
</g' | \
  grep -v -E '(</a>|<i>|</font>|href="/msg/"|</i>|</p>|span>|class="p"|<p>|font size)' | \
  grep "<a href="

Looks rather complex, I admit, but if you run it with the broader search pattern "drums", here's what you get:

/msg/728672846.html -- Cuban Style Cajon Drums - $500 -
/msg/728630582.html -- African Dun Dun Drums - $450 -
/msg/724851366.html -- --------Drums and Drum Stuff!---------- -
/msg/706689489.html -- DRUM KIT, Yamaha ELECTRIC DRUMS, DTXPLORER, FUN! FUN! - $575 -
/msg/694513136.html -- Drums --Slingerland Drum Set - $350 -
/msg/679106717.html -- Kid Drums Traps kindly used - $50 -

My final script has a few additional refinements, like making the URL clickable, and it wouldn't be too insanely difficult to save the search result each night and run a "diff" each night against the previous day's output. If there's something new, it could be emailed to you directly. It's what we call an "exercise best left to the reader".

Hope that this gives you a productive path to travel on the way to creating the script you seek!



Help others find this article at Del.icio.us, Digg, Netscape, Reddit, and Simpy.


Subscribe!

Never miss another useful Q&A article again! Subscribe to AskDaveTaylor with Google Reader.

Comments

I think the reason you haven't seen this in the past is that it violates Craig's list terms of use, #7u.

It states:
"Additionally, you agree not to:...
u) use automated means, including spiders, robots, crawlers, data mining
tools, or the like to download data from the Service - unless expressly
permitted by craigslist;"

I don't know how often they watch this, but just thought I should mention it.

Posted by: Jared K. at June 27, 2008 9:40 AM

Ah, I knew that small print was in their TOS somewhere or other. Of course, if they *had* a search monitoring service it wouldn't be necessary for a third party to offer it. A surprising absence in my opinion. I mean, it *is* the 21st century, right? :-)

Posted by: Dave Taylor at June 27, 2008 10:46 AM

Craigslist does support RSS feeds from searches. You can search for an item and then subscribe to that as a feed. I use it everyday with Bloglines.

Posted by: bg at June 27, 2008 10:54 AM

I read your excellent article at about error handling and stdout / stderr redirection. I have a script in which I want to analyze the error message. I noticed that my script finishes before it writes the error. Is the error message in an accessible variable and if not, how can I redirect the error message directly to a variable without going through a file?

Posted by: Ric Turley at July 8, 2008 12:28 PM

I have a lot to say, but ...
Starbucks coffee cup I have a lot to say, and questions of my own for that matter, but most of all I'd like to say thank you for all your efforts on this Web site by buying you a chai!

I do have a comment, now that you mention it!









Remember personal info?


Please note that I will never send you any unsolicited commercial email. Ever.

While I'm at it, please note that by submitting a question or comment you're agreeing to my terms of service, which are: you relinquish any subsequent rights of ownership to your material by submitting it on this site.









Uniblue: Free Virus Scan

Search
Find just the answers you seek from among our 1700+ free tech support articles by using our Lijit search engine.


Help!





Subscribe to
Ask Dave Taylor!

Add to Google Reader
Add to My Yahoo!
Subscribe in NewsGator Online

RDF   XML

Free Updates!
Sign up and get free weekly updates and special offers on books, seminars, workshops and more.


Recent Entries
Join the List!
Join my author info mailing list, where you'll learn about my upcoming books, speaking gigs, and more!


Book Links
© 2002 - 2008 by Dave Taylor. All Rights Reserved.

Note: This web site is for the purpose of disseminating information for educational purposes, free of charge, for the benefit of all visitors. We take great care to provide quality information. However, we do not guarantee, and accept no legal liability whatsoever arising from or connected to, the accuracy, reliability, currency or completeness of any material contained on this web site or on any linked site.

[whiteboard marker tray]