Search Typeahead

  • 5

    ran_biron 4 months ago

    Interesting. Some points worth mentioning about optimization: 1. The design uses a server-side routing between shards. Instead, you could take a leaf from Google's own file storage mechanism and push "which branch is at which shard" to the client (using logical names of course, not IP addresses). That would save you one critical hop on a cluster that is going to be super heavy (receive all calls and forward them). It's important to note that a user starting to type a string will always get sent to the same shard once he's past the common prefix - so we'll really be saving an intra datacenter call for each query. You could do this after-the-fact - meaning keeping the cluster up but letting the client see where the response came from so the next iteration could be sent to the correct shard, or pre-fetch that information and make sure the client is notified on change.

    reply
  • 3

    ran_biron 4 months ago

    (more points): 2. You could send the common part to the client itself, including the most frequent queries. Let's assume a 10 letter query (kind of long) as average and assume we're showing the top 5 queries. To cache the first level (first letter) we'll need about 26 (letters) * 5 (number of results) * 10 (length of query) bytes (disregarding the trie structure itself - it can be optimized by using a fixed width structure) = 1300 bytes. not too much. The 2nd level would be 1st level * 26 = 33800 bytes. 33Kib to download is ok, but starting to approach noticeable. Of course, since you control the browser you could pre-fetch it and even fetch deltas instead of re-fetching the full structure if needed. The next level is 878800 - almost a MiB. Kinda large - but may be worth it to save 3 levels of type-ahead on the client side. That will actually save you a lot of roundtrips as user behavior with a good type-ahead is to accept the results offered. Note that the 878KiB is the worst-case. Since distribution is not even, we'll likely end up with less. You could further optimize by sending an unbalanced tree that favor deeper levels on more frequent type-ahead prefixes.

    reply
  • 2

    ran_biron 4 months ago

    (more points): 3. Just like point 2, you could send tree subsets per query response. That would eliminate some of the chattiness of the protocol and allow quicker response to the user after the initial loading. Since we have all the usage statistics, we could send the most likely to be hit subset (unbalanced).

    reply
  • 2

    ran_biron 4 months ago

    (more points): 4. CDNs are a must here. Since the prefix tree rarely changes (relative to queries), we could make sure to be web-proxy and CDN friendly by loading the response with caching headers. Some queries may be eliminated at the CDN level itself. Note that unless we do some CDN analysis, we'll be missing queries for statistics which may affect other optimizations, so we may want clients to send usage statistics using another route (periodic, in bulks).

    reply
  • 0

    catlover 3 months ago

    Cool ideas @ran_biron!

    reply
  • 1

    adarsh_koyya 3 months ago

    But How would we store Trie in DB ?

    reply
    • 0

      sanketb 13 days ago

      Exactly my question.

      reply
  • 0

    mrincodi 2 months ago

    A correction:

    reply
  • 0

    mrincodi 2 months ago

    A correction: In Q: Do we need to account for spelling mistakes ? you give:
    A: Example : Should typing mik give michael as a suggestion because michael is really popular as a query?
    But this is obviously not an answer to the question. You should say "Yes" or "No".

    reply
  • 0

    dricketts 8 days ago

    Why would a writer thread need to take a lock on a trie in order to avoid inconsistent reads? As stated in other parts of the question, consistency isn't very important for this application. Writers should have exclusive access over other writers, but it should be fine for reads to occur concurrently.

    reply
Click here to jump start your coding interview preparation