Sharding a Database

  • 23

    vkb087 about 1 year ago

    Great Doc. You can find more information about consistent hashing implementation and detail here.


    http://www.tom-e-white.com/2007/11/consistent-hashing.html

    reply
  • -11

    Ajit about 1 year ago

    Java implementation of Sharading

    reply
  • -6

    Ajit about 1 year ago

    http://sleeplessinslc.blogspot.in/2008/09/hibernate-shards-maven-simple-example.html

    reply
  • 15

    skcoder about 1 year ago

    It looks like this is not addressing the question of how data is actually relocated. I would like to understand how data is physically moved onto the new node when it joins the cluster. Which part of the system does this? Similarly data would need to be relocated to all of the surviving nodes when a node leaves. Is this all done before the node comes online? Also when the node leaves, I suppose this is assuming there must be a redundant copy of data still online that can be moved to the surviving nodes. Would we be expected to come up with this as well in an interview?

    reply
    • 0

      prakhar_jain_803 15 days ago

      I think learning about internal architecture and design of a DB like Cassandra would address this.

      reply
  • 0

    swapnil_marghade about 1 year ago

    Storing data in multiple shards is same as replication factor. We can address this concern by using right term as "Replication factor" !

    reply
    • 1

      karan3296 12 months ago

      Not sure they are replicating here. This is just distributing data. Replication is mentioned in the dynamo paper and is a feature of HDFS.

      reply
  • 0

    AB_kyusak about 1 year ago

    I cant understand,what advantage we gain by maintaining the multiple copies of the shrad? wht advantage does modified consistent hashing has over consistent hashing.It might sound a dumb question but i am new to the topic :)

    reply
    • 0

      Nivas 12 months ago

      1.All I can think of about maintaining multiple copies per shard is it increases the availability.

      reply
    • 2

      Nivas 12 months ago

      With respect to modified consistent hashing, it makes sure that order in which shards follow each other is not deterministic. So if a shard fails the load will not be on a single shard, it will be shared by all subsequent shards that follow the failed instance at different places.

      reply
    • 0

      karan3296 12 months ago

      It allows for heterogeneity too. Beefier nodes can be used to spread more virtually along the ring whereas weaker servers can be less prominent on the ring.

      reply
  • 0

    kumar955 about 1 year ago

    I think he needs to expand the answer, what happens if a user shared size increases - do we split the data or not? Do we need to support replicas? How data center to data center replication happens etc?

    reply
  • 0

    kulkav about 1 year ago

    http://www.project-voldemort.com/voldemort/design.html

    reply
  • 3

    Srivathsan_Venkatavaradhan about 1 year ago

    I think modified consistent hashing explained above is not correct. Please refer https://ihong5.wordpress.com/2014/08/19/consistent-hashing-algorithm/ . It has nothing to do with shards.

    reply
  • 1

    ramanatnsit 12 months ago

    How does the system remain available during the time when a node failure occurs and redistribution of keys are in process. Is it sacrificing availability for consistency

    reply
    • 0

      Nivas 12 months ago

      Each shard will have a leader/master and some replicas. During failure or redistribution the leader will be changed. So availability is not sacrificed. But this may result in increased load in the machine where the elected replica is located if not managed at right time.

      reply
  • 0

    eipie 11 months ago

    What is this obsession with 72GB RAM? Isn't is better / easier to pick powers of two? Or maybe perhaps 100GB treating 1GB = 1000MB for simplicity of calculations?

    reply
    • 0

      sunnyk 7 months ago

      I had the same thought.. 64/128/256 makes so much sense.

      reply
  • 0

    shaily_mittal 10 months ago

    Great doc, didn't know the concept of consistent hashing earlier

    reply
  • 0

    ishtiaque_hussain 7 months ago

    I found the following YouTube tutorial by Curtis on Consistent Hashing interesting, easy to understand:

    reply
  • 1

    ishtiaque_hussain 7 months ago

    I found the following YouTube tutorial by Curtis on Consistent Hashing interesting, easy to understand: https://www.youtube.com/watch?v=jznJKL0CrxM

    reply
  • 0

    mohammed_alhfian 5 months ago

    yes

    reply
  • 0

    cnachiketa07 4 months ago

    Found this article to be good

    reply
  • 2

    cnachiketa07 4 months ago

    https://www.toptal.com/big-data/consistent-hashing

    reply
    • 0

      zonker 3 months ago

      Best article so far on explaining Consistent hashing.

      reply
  • 0

    shivendra_panicker 4 months ago

    https://www.youtube.com/watch?v=--4UgUPCuFM
    Easy to understand concept of consistent hashing!

    reply
  • 0

    surabhi_gupta 3 months ago

    I have a doubt, we can store 10 tb data per machine so we start with 10 machines, but each machine has only 72 gb ram, This means we can not load the complete data into ram and hence can not return data on o(1) time. So do you think with this soln we should implement an internal caching as well, so that data is retrieved at a very fast rate

    reply
Click here to start solving coding interview questions