How do you pull valuable, relevant information from hundreds of public web pages every week?

Overview

IDI was spending hundreds of hours manually parsing data on potential deals. Echelon DS developed a custom app that scrapes 100+ unique government entity websites and sends AI-generated summaries of all relevant newly published content, reducing IDI’s time-to-value on public information and freeing up their team for higher-value initiatives.

Development Summary

1. Initial Crawling & Targeting:

The project started by listing 100+ public web pages for municipalities and counties the client identified as high-potential data sources. A simple HTML scraper was created and run on an AWS EC2 instance using Beautiful Soup and Requests to download all content from the provided websites, including linked files and downstream URLs. These files were then stored on AWS S3, leveraging Boto3 to reduce unnecessary storage reads/writes by streaming files directly from memory. This initial scrape was analyzed to understand the web pages' topology and identify where relevant documents (e.g., meeting minutes) are housed and updated. 

2. Scraper Infrastructure:

  • JavaScript Handling: Some websites used JavaScript for interactive portals, so the scraper was enhanced to handle JavaScript using Selenium 4.

  • Selective Crawling: After identifying the key document areas, the scraper was optimized to only target pages housing relevant files, minimizing unnecessary downloads.

3. Efficient Data Handling:

  • Avoid Redundant Downloads: For each web page in S3, the scraper compares the download history against new files to ensure that only newly uploaded documents are downloaded, avoiding repetitive file retrieval and unnecessary storage costs.

  • Batch Processing & Parallelization: The scraping sub-problem was identified as Embarrassingly Parallel. To speed up processing, entities were batched into 6 groups, with scraping parallelized to improve efficiency. While only 80 pages were targeted in this POC, this design would allow for a national-scale buildout for the client without increasing runtimes or requiring an undue infrastructure buildout. 

4. Automation & Scheduling:

  • Weekly Automation: A cron job was set up on the EC2 instance to trigger weekly scraper runs, ensuring the system runs automatically without manual intervention. The scraper incorporates a random delay between each HTTP request to avoid being auto-categorized as a DoS attack and blocked from the page.  

  • Error Monitoring: A bot was developed using the Slack API to catch any errors during scraping or summarization. It tracks basic statistics like runtime and file downloads to identify potential issues before summary emails are sent to the client.

  • Robustness: The scraper was designed to continue even in case of errors on unique pages.

5. Summarization Using LLM:

  • Document Relevance Assessment: Each week, new files (PDFs, text, HTML, DOCs) stored in S3 are processed into text using open-source loaders like pypdf. Each text is then assessed for relevancy by an LLM (GPT 4o, integrated via the OpenAI API) based on specific keywords and zoning codes provided by the client. 

  • Document Summarization: If text is deemed relevant (a minority), it is summarized and saved in a dated JSON file alongside relevant metadata (source URL, county name, document type, government entity). Langchain was used as an orchestration tool to pipe text through LLMs, and prompt engineering was used to achieve the desired LLM output. The JSON file is then saved to S3.

6. Output & Delivery:

  • HTML Generation: The JSON is converted into HTML using Markdown, formatted for delivery, and saved to S3. For each relevant file, the HTML includes a summarization, a link to the source file, and a government entity. Files are clustered together by entity for readability.

  • Automated Email: A Gmail API script sends the generated HTML to a predefined contact list. Two cron jobs handle this process: one sends a test email internally on Sundays, and the other sends the final version to the client on Monday at 8 AM.

7. Robustness & Development Process:

  • All development was done in a shared GitHub repository, allowing version control and multiple people to work on the codebase simultaneously. The code was deployed on EC2 via SSH. The cron job configuration is maintained separately from the code.

Previous
Previous

How do you identify cancer nodules from X-rays with superhuman performance?

Next
Next

How do you forecast restaurant demand in real time to staff locations efficiently?