The Interview Cake Course (9. System design)

Design a URL shortener

You know, like bit.ly.

Let’s call it ca.ke!

Step 1 is to scope the project. System design questions like this are usually intentionally left open-ended, so you have to ask some questions and make some decisions about exactly what you’re building to get on the same page as your interviewer.

So, what are we building? What features might we need?

I’m making a search engine called MillionGazillion™.

I wrote a crawler that visits web pages, stores a few keywords in a database, and follows links to other web pages. I noticed that my crawler was wasting a lot of time visiting the same pages over and over, so I made a set, visited, where I’m storing URLs I’ve already visited. Now the crawler only visits a URL if it hasn’t already been visited.

Thing is, the crawler is running on my old desktop computer in my parents’ basement (where I totally don’t live anymore), and it keeps running out of memory because visited is getting so huge.

How can I trim down the amount of space taken up by visited?

You left your computer unlocked and your friend decided to troll you by copying a lot of your files to random spots all over your file system.

Even worse, she saved the duplicate files with random, embarrassing names (“this_is_like_a_digital_wedgie.txt” was clever, I’ll give her that).

Write a function that returns a list of all the duplicate files. We’ll check them by hand before actually deleting them, since programmatically deleting files is really scary. To help us confirm that two files are actually duplicates, return a list of tuples ↴ where:

  • the first item is the duplicate file
  • the second item is the original file

For example:

[('/tmp/parker_is_dumb.mpg', '/home/parker/secret_puppy_dance.mpg'),
 ('/home/trololol.mov', '/etc/apache2/httpd.conf')]

You can assume each file was only duplicated once.

This is incomplete. Please upload the full content.