|
|
Can I automate craigslist searches?I want to know if there's some way to automate searching the craigslist site with a shell script or similar so that I can keep an eye on it and know when certain rare auction items show up for sale? This is a very fun question because it flashes back to one of my first startups, a company called iTrack, which was built around a Web-based service that automated searching and tracking auction items available on all the major auction sites (eBay, Amazon, Yahoo, etc). That, however, was years ago. The basic idea remains, however, and it's delightfully simple. Let's dig into the craigslist site first, to see how it encodes searches. Rather than a single site, Craigslist is really broken down into about 50 different regional datasets, all addressed by subdomain. For example, "denver.craigslist.org" and "boulder.craigslist.org" cover the Denver, Colorado and Boulder, Colorado markets, respectively. A search of one regional market, however, doesn't reveal potential matches in any other. Is that a limitation? Doesn't really matter, that's just how the site's designed. This means that the very first step is to do a search and see what the resultant URL looks like. To search for "Roland Drum" on the Boulder site, for example, here's the URL: http://boulder.craigslist.org/search/sss?query=roland%20drum
As you can see, the regional domain name shows up here, and the query shows up at the end of the URL. However, it's slightly more tricky, because if you want to limit the searches to just match those Craigslist items where the title matches, you'll have to do just a bit more experimentation and find out that now the URL has a few more fields: http://boulder.craigslist.org/search/sss?query=roland+drum&srchType=T&minAsk=min&maxAsk=max
Turns out that we aren't specifying min or max price so those parameters can be scrubbed out, leaving the resultant URL: http://boulder.craigslist.org/search/sss?query=roland+drum&srchType=T
So one way you could do this is to simply bookmark this URL and any time you want to check, just click directly to the search results. That's not very automated, though, so let's dig into what a simple shell script that does this search might look like. I like to use curl to grab web pages so the base script might be: url="http://boulder.craigslist.org/search/sss?query=roland+drum&srchType=T"
curl -s "$url" Problem with this is that the output is raw HTML and, needless to say, it's not designed to be easily parsed and analyzed, so after looking closely at the code, it turns out that there are some patterns that let you weed out the matches and omit everything we don't want. The key pattern is href="/msg/ which you can apply by using it in a "grep". Then watch how I break every HTML tag onto its own line with a "sed" invocation too: url="http://boulder.craigslist.org/search/sss?query=roland+drum&srchType=T"
curl -s "$url" | \ grep 'href="/msg/" | \ sed 's/</\ </g' Once you've got that, you now just need to screen out the HTML tags and lines that aren't interesting, which can be again done with another "grep", but this time we'll use a regular expression. Ready? Here ya go: url="http://boulder.craigslist.org/search/sss?query=roland+drum&srchType=T"
curl -s "$url" | \ grep 'href="/msg/" | \ sed 's/</\ </g' | \ grep -v -E '(</a>|<i>|</font>|href="/msg/"|</i>|</p>|span>|class="p"|<p>|font size)' | \ grep "<a href=" Looks rather complex, I admit, but if you run it with the broader search pattern "drums", here's what you get: /msg/728672846.html -- Cuban Style Cajon Drums - $500 -
/msg/728630582.html -- African Dun Dun Drums - $450 - /msg/724851366.html -- --------Drums and Drum Stuff!---------- - /msg/706689489.html -- DRUM KIT, Yamaha ELECTRIC DRUMS, DTXPLORER, FUN! FUN! - $575 - /msg/694513136.html -- Drums --Slingerland Drum Set - $350 - /msg/679106717.html -- Kid Drums Traps kindly used - $50 - My final script has a few additional refinements, like making the URL clickable, and it wouldn't be too insanely difficult to save the search result each night and run a "diff" each night against the previous day's output. If there's something new, it could be emailed to you directly. It's what we call an "exercise best left to the reader". Hope that this gives you a productive path to travel on the way to creating the script you seek!
More Useful Shell Script Programming Articles:
✔ Secretly capture screenshots on my Mac?
When I used to work on a Linux system, there was a utility we had that would let me take screen captures every...
✔ Parsing "id" strings in a Shell Script?Hello Dave. I need a Bash shell script that creates a directories with the group names automatically when user logs in to the...
✔ Copy and Paste from the Mac OS X Command Line?I am constantly running commands in Terminal.app on my MacBook and then copying and pasting the results into email messages or documents. Yes,...
✔ Script to test line lengths for Twitter compatibility?I've been tasked with writing a series of tweets for a Black Friday marketing campaign and am finding it a bit tricky because...
✔ Shell script to convert lowercase to title case?As part of a project I'm working on, I find myself deep in a Linux shell script, needing to have a subroutine that...
Let's stay in touch!
Sign up for my weekly AskDaveTaylor Newsletter and you'll receive even more tech and gadget help
right to your inbox, along with exclusive news and industry updates. It's good stuff. I promise!
Categorized:
Shell Script Programming
(Article 8239,
Written by Dave Taylor)
Tagged: auction tracking, classified ad tracking, craigslist, shell script programming Previous: How can I check what system updates I've installed on my Mac? Next: How to install VMware tools within Microsoft Vista? Reader Comments To Date: 6Dave Taylor said, on June 27, 2008 10:46 AM:
Ah, I knew that small print was in their TOS somewhere or other. Of course, if they *had* a search monitoring service it wouldn't be necessary for a third party to offer it. A surprising absence in my opinion. I mean, it *is* the 21st century, right? :-) bg said, on June 27, 2008 10:54 AM:
Craigslist does support RSS feeds from searches. You can search for an item and then subscribe to that as a feed. I use it everyday with Bloglines. Ric Turley said, on July 8, 2008 12:28 PM:
I read your excellent article at about error handling and stdout / stderr redirection. I have a script in which I want to analyze the error message. I noticed that my script finishes before it writes the error. Is the error message in an accessible variable and if not, how can I redirect the error message directly to a variable without going through a file? keith said, on August 22, 2008 1:48 PM:
Is there a way to run a search that does all the North Carolina sites (there are about 8)? tylermiller said, on December 11, 2008 9:28 PM:
how can i get help with my craigslist account,it says i need to authecat it by phone,i dont wanta do this,how do i fix this problem
I do have a comment, now that you mention it!Check This Out Too... |
Recent Entries
Look for Answers
Recommended
All Our Categories
Apple iPad Help
Articles and Reviews Auctions and Online Shopping Blogs and Blogging Building Web Site Traffic Business and Management Computer and Internet Basics d) None of the Above Facebook Help Google Gmail Help Google Plus Help HTML, JavaScript and Web Site Programming Industry News and Trade Shows iPhone and Cell Phone Help iPod, Sony PSP and MP3 Player Help Kindle Fire Help Mac OS X Help Pay Per Click (PPC) Advertising Pinterest Help Search Engine Optimization (SEO) Shell Script Programming Tech Support Video Help The Writing Business Twitter, LinkedIn and Social Network Help Unix and Linux Help Video Game Tips and Help Windows PC Help Find Me on Google+ ADT on G+ |
I think the reason you haven't seen this in the past is that it violates Craig's list terms of use, #7u.
It states:
"Additionally, you agree not to:...
u) use automated means, including spiders, robots, crawlers, data mining
tools, or the like to download data from the Service - unless expressly
permitted by craigslist;"
I don't know how often they watch this, but just thought I should mention it.