Search Typeahead

  • 16

    ran_biron about 1 year ago

    Interesting. Some points worth mentioning about optimization: 1. The design uses a server-side routing between shards. Instead, you could take a leaf from Google's own file storage mechanism and push "which branch is at which shard" to the client (using logical names of course, not IP addresses). That would save you one critical hop on a cluster that is going to be super heavy (receive all calls and forward them). It's important to note that a user starting to type a string will always get sent to the same shard once he's past the common prefix - so we'll really be saving an intra datacenter call for each query. You could do this after-the-fact - meaning keeping the cluster up but letting the client see where the response came from so the next iteration could be sent to the correct shard, or pre-fetch that information and make sure the client is notified on change.

    reply
  • 7

    ran_biron about 1 year ago

    (more points): 2. You could send the common part to the client itself, including the most frequent queries. Let's assume a 10 letter query (kind of long) as average and assume we're showing the top 5 queries. To cache the first level (first letter) we'll need about 26 (letters) * 5 (number of results) * 10 (length of query) bytes (disregarding the trie structure itself - it can be optimized by using a fixed width structure) = 1300 bytes. not too much. The 2nd level would be 1st level * 26 = 33800 bytes. 33Kib to download is ok, but starting to approach noticeable. Of course, since you control the browser you could pre-fetch it and even fetch deltas instead of re-fetching the full structure if needed. The next level is 878800 - almost a MiB. Kinda large - but may be worth it to save 3 levels of type-ahead on the client side. That will actually save you a lot of roundtrips as user behavior with a good type-ahead is to accept the results offered. Note that the 878KiB is the worst-case. Since distribution is not even, we'll likely end up with less. You could further optimize by sending an unbalanced tree that favor deeper levels on more frequent type-ahead prefixes.

    reply
    • 2

      rushi_agrawal 5 months ago

      I'd say, few hundreds of kilobytes for a typeahead is way too much data to transfer, especially when we also intend to cater to mobile devices (with expensive data plans).

      reply
  • 3

    ran_biron about 1 year ago

    (more points): 3. Just like point 2, you could send tree subsets per query response. That would eliminate some of the chattiness of the protocol and allow quicker response to the user after the initial loading. Since we have all the usage statistics, we could send the most likely to be hit subset (unbalanced).

    reply
  • 3

    ran_biron about 1 year ago

    (more points): 4. CDNs are a must here. Since the prefix tree rarely changes (relative to queries), we could make sure to be web-proxy and CDN friendly by loading the response with caching headers. Some queries may be eliminated at the CDN level itself. Note that unless we do some CDN analysis, we'll be missing queries for statistics which may affect other optimizations, so we may want clients to send usage statistics using another route (periodic, in bulks).

    reply
    • 0

      rushi_agrawal 5 months ago

      Question: won't involving CDN be problematic, as the typeahead results will need updation as time passes. Also, same in case in case of a very popular recent news.

      reply
  • 0

    catlover about 1 year ago

    Cool ideas @ran_biron!

    reply
  • 2

    adarsh_koyya about 1 year ago

    But How would we store Trie in DB ?

    reply
    • 0

      sanketb 11 months ago

      Exactly my question.

      reply
    • -2

      himanshu_setia 11 months ago

      Maintain a shard of 26 nodes for 26 characters. Each node in this shard maintains words from a to z respectively and also acts as a parent/routing node.

      reply
      • 0

        himanshu_setia 11 months ago

        As the count for a prefix (say "as") goes beyond a threshold, a new shard is added and the corresponding trie is moved there. What parent maintains is a reference to this shard now corresponding to its key.

        reply
      • 1

        himanshu_setia 11 months ago

        As the count for a prefix (say "as") goes beyond a threshold, a new shard is added and the corresponding trie is moved there. What parent maintains is a reference to this shard now corresponding to its key.

        reply
  • 0

    mrincodi about 1 year ago

    A correction:

    reply
  • 1

    mrincodi about 1 year ago

    A correction: In Q: Do we need to account for spelling mistakes ? you give:
    A: Example : Should typing mik give michael as a suggestion because michael is really popular as a query?
    But this is obviously not an answer to the question. You should say "Yes" or "No".

    reply
    • 0

      thecoder2016 3 months ago

      Spelling mistake will become an Edit distance problem. I think answer should be No.

      reply
  • 0

    dricketts 11 months ago

    Why would a writer thread need to take a lock on a trie in order to avoid inconsistent reads? As stated in other parts of the question, consistency isn't very important for this application. Writers should have exclusive access over other writers, but it should be fine for reads to occur concurrently.

    reply
    • 0

      syst3mw0rm 6 months ago

      When the article says inconsistency, it means that two users can see different results for some time before they become eventually consistent. But if two threads are doing some operations on data, you must ensure that operations don't interleave. You may end up in an illegal state, which is unacceptable.

      reply
  • 0

    himanshu_setia 11 months ago

    Maintain a shard of 26 nodes for 26 characters. Each node in this shard maintains words from a to z respectively and also acts as a parent/routing node.

    reply
  • 0

    hardyvey 9 months ago

    What would be the schema for storing the trie ? Would we use SQL or no-SQL database ?

    reply
    • 0

      ericshape 8 months ago

      trie can be design as multi level hashtable. thus, the k v data store is the choice.

      reply
      • 1

        thecoder2016 3 months ago

        My thought is that we should store every query by users in DB or NOSQL storage. Keep trie totally different as a cache. We can always read data from DB and build trie again. Typeahead is not a mission critical functionality in absense of which user will not be able to search

        reply
  • 0

    thecoder2016 3 months ago

    few additions would help clarify a design bit more

    reply
    • 0

      thecoder2016 3 months ago

      We should add a block diagram just for architecture purpose ... Also here we don't talk about already existing persistence store for Google's queries. Google I am sure does store queries for analytics purposes. We can extend the same store to keep frequencies and use worker threads to update top searches in trie at certain interval .. let say 1 hr

      reply
    • 0

      thecoder2016 3 months ago

      Also client side caching of partial trie can play an important role.

      reply
Click here to start solving coding interview questions