I was presented with the task of extracting content from local community colleges and 4 year colleges. The task was to obtain faculty names, phone numbers, internal addresses, emails, departments, and the mailing addresses for various colleges throughout Illinois. This task was originally done by copying and pasting each entry from the website to an excel sheet. This task maybe easier and faster if done with a web crawler. I tried using exisiting web crawlers that were free to download, but they did not accompish what we needed. Thus, we've turned to nutch. Nutch is open-source and will allow us to make our own web crawler.
Currently, I'm still working on designing the web crawler.