Design Cache

  • 10

    dooby_do 4 months ago

    I was asked this problem in Google's interview for L5 position.

    reply
    • 0

      kaidul 2 months ago

      For curiosity, did you make it?

      reply
  • 2

    kumar955 4 months ago

    I think in most of the interviews, candidate is expected to calculate Storage, RAM SIZE, Upstream and DownStream bandwidth, Num of Servers etc. (I meant we cannot assume most of things except one or two like Data Size).

    reply
  • 1

    venknar 4 months ago

    It is somewhat same as designing a distributed hash table.

    reply
  • 1

    deep_saxena 4 months ago

    How the QPS is calculated here? I am not getting the exact calculation. Can somebody explain it ?

    reply
    • 0

      venknar 4 months ago

      The way QPS is calculated is as follows : Earlier we mentioned that each machine would have a RAM of 72 GB of RAM. For serving 30TB of cache, the number of machines required would be 30 TB / 72G which is close to 420. Assume that we have 420 machines to server 30 TB of distributed cache. Now regarding the QPS the requirement was 10 M

      reply
    • 0

      venknar 4 months ago

      Now per machine the QPS would be 10M / 420 = Approximately 23000 QPS. So this meant per machine should be able to handle 23,00 QPS. The approach is similar to how we decided on the number of machines based on the per machine RAM and the total cache size. Similarly for the QPS, it is based on the total QPS / number of machines.

      reply
    • 0

      venknar 4 months ago

      Next assuming that a machine has to serve 23,000 QPS then we look at each machine has 4 core and then we calculate the per request time as - CPU time available per query = 4 * 1000 * 1000 / 23000 microseconds = 174us (Note everything is converted to milliseconds.) So the machines have to return the query in 174 us. This is the way the QPS is derived. Then based on the read / write traffic and the latency numbers as per the https://gist.github.com/jboner/2841832, the QPS is further refined by increasing the number of machines.

      reply
      • 0

        skcoder 4 months ago

        I am coming up with 23,000 queries / 1 sec.* 1,000,000 us/ 1 sec = 1 query / 40us. So 40us per query service time is required. I don't see why we would calculate this based on CPU time, or why my # is so far off. Only thing I can think of is that if I multiply by 4 CPUs I have each CPU servicing 1 query every 160us (close to the 174 above)- but why would we calculate this per CPU rather than per server?

        reply
        • 0

          skcoder 4 months ago

          OK, I figured it out. Each CPU has to return the query in 174us, but each CPU has to service ~6K queries/sec, not 23K. 23K is per machine.

          reply
  • 3

    daniele_broccolo 4 months ago

    I clicked on one of the links for the implementation of the least recently used cache.... the counter started, but I did not have the time to complete the task. I think you should advice before open the problems pages

    reply
    • 0

      catlover 4 months ago

      Yeah, that's right. There should be an advice. It also cause problem when landing from google search results.

      reply
  • 1

    amitvc 4 months ago

    What about sharding algorithms ? How does the caller know which server to go for a specific key ? Like memcache uses hashing to determine which server the key resides on.

    reply
    • 1

      kousik_nath 4 months ago

      You can use consistent hashing to determine the exact shrad for the key. But in case of cache server failure, consistent hashing can't alone help to recover the data. You must have replication cache servers so that disaster recovery happens from the replicated cache server data. You can google about consistent hashing. There are a lot of articles about it.

      reply
      • 1

        amitvc 4 months ago

        Thanks Koushik. I am aware about consistent hashing. My question was more about no discussion on the topic of sharding mentioned in the problem.

        reply
        • 1

          rashmiiiith 4 months ago

          I don't know how relevant to your question but this paper is worth reading .

          reply
        • 0

          rashmiiiith 4 months ago

          http://www.allthingsdistributed.com/files/amazon-dynamo-sosp2007.pdf

          reply
          • 1

            amitvc 3 months ago

            Thank you. I believe Cassandra also uses consistent hashing

            reply
  • 0

    velupula 3 months ago

    partitioning and fault tolerance are important topics that must have been covered in this design.

    reply
  • 0

    aniket15 26 days ago

    Latency is always an important metric for distributed and scalable systems. Can there be scenarios where its not that important?

    reply
Click here to jump start your coding interview preparation