Since my last blog, I've spent most of my time working on getting Puppeteer, a web scraping library, to function in a production environment and setting up a cron job to periodically run the scraper.
I initially tried following some tutorials using AWS SAM (Serverless Application Model), which is supposed to offer a great developer experience. With AWS SAM, you can create Lambdas, test them, and push them to a production environment directly from the command line, without relying on AWS's user interface. However, I ran into issues with AWS's access, permission, and role management, so I ended up using the user interface, where I successfully got a Puppeteer script to run.
When uploading the script via the AWS user interface, the code needed to be under 50MB. Otherwise, I had to upload it through AWS S3 (which is essentially Google Drive for AWS) and provide the S3 URL for the Lambda. This led me to explore how code, especially Node.js/JavaScript code, is prepared and built for production—an entirely different world in itself. It also got me investigating how the dependencies of my app can have different versions and how sometimes modules require specific versions to satisfy other dependencies. For instance, I used a package called chromium, which simulates a Chrome browser that Puppeteer can use. The version of the Puppeteer module depends on the version of Chromium it requires.
This experience made me realize that this work is highly specialized, and I might not use many of these intricate details in future projects. However, it has undoubtedly taught me perseverance.
Getting the scraping code to run periodically was another challenge, but I managed to hack my way to a solution. A cron job is a scheduler that exists on Linux/Unix servers, allowing you to trigger a piece of code at scheduled intervals. Vercel, the platform I use to host my app, only allows cron jobs with a minimum interval of one day. Fortunately, I discovered that AWS offers a service called Amazon EventBridge, which allows you to create cron jobs on any schedule!