## CS 3700 Project 5: Web Crawler

#### Daniel Szyc and Michelle Zinger

--- 
## High-Level Approach

To begin, we first took the time to thoroughly understand the functionality of the starter code, since we would be building off of it. Then, we read through the "Implementation Details and Hints" section of the instructions to get a feel for what the recommended approach to the project is. Additionally, we spent some time reading up on the HTTP Made Really Easy tutorial provided in the instructions to get an understanding for HTTP and the differences between HTTP 1.0 and 1.1. Once we felt comfortable with the protocol, we were prepared to begin tinkering with the starter code and implementing HTTP 1.1/HTTPS. 

Subsequently, we decided to skip the implementation of ``Connection: keep-alive`` and ``Accept-Encoding: gzip``, reasoning that we could come back to it if it was necessary. The next logical step was implementing the ability to log in via a POST request. This step required cookie management, so we implemented parsing to get cookies from the initial GET request and submit them with the POST request. We also parsed the login page HTML for the CSRF middleware token so that could be submitted with the post. Next, we moved on to scraping. 

We initially implemented frontier tracking, which simply searched for uncrawled URLs in the HTML and added them to a list until the crawler was ready to visit them. We then found that there were issues causing the program to run forever, so we implemented HTTP status code handling and updated frontier tracking to resolve the issue. 

Once our initial implementation was completed, we retraced through our design. This meant ensuring that we were correctly wrapping our socket, reading data, handling each of the status codes specified, accurately tracking the frontier, only traversing URLs pointing to pages on the server, etc. 

---

## Challenges

There were two main challenges in this project. The first came down to attention to detail - a handful of hours were spent receiving mysterious ``403 Forbidden`` responses to POST requests. After talking to multiple TAs about it to no avail, we knew that there was a formatting issue in our formulation of POST requests. This came down to needing to be very attentive to the formatting of everything, ensuring no extra spaces or typos. Eventually, we found the culprit, a misnamed field in the POST request body code.

The second challenge was implementing the frontier tracking logic correctly. We mistakenly checked the ``seen`` condition in the loop iterating over the frontier list. This led to an infinite loop and millions of requests to the server (sorry). Once that was fixed, the program worked as expected.

---


## Testing

We tested our code extensively, using debugging strategies to understand how the program handled different HTTP codes and the speed with which it crawled the website. We also used the build-in developer tools to inspect the HTTP requests and cross-check them against how we send requests, to ensure our own accuracy. We each ran the program on our local machines and retrieved our own flags to build confidence in its correctness.