/ Interview Guides / SRE Interview Questions

SRE Interview Questions

Last Updated: Jan 02, 2024

Computer systems are built to be reliable. A system is reliable if most of the time it performs as intended and is not prone to unexpected failures and bugs.

The site reliability engineer is responsible for the stability and performance of websites, mobile applications, web services, and other online services. They're in charge of monitoring the performance of websites and apps to check for issues and make sure they’re running smoothly.

These engineers also design systems that are resilient to failure so that users can continue to access the site when it does fail. The role requires an expert in software development methods, including test-driven development and software design patterns. It's a flexible role that can involve working onsite with a team or remotely via chat or video conferencing software. In most cases, it’s likely you’ll need both of these skills. This role also requires excellent analytical skills and the ability to think analytically about all sorts of data from different places. The job may require working with other teams as well; you may have to communicate with operations engineers, data analysts, project managers, and others depending on the size of your company and the nature of your software.

This article covers everything you need to know about becoming a Site Reliability Engineer, also some SRE Interview Questions, and answers,

SRE Interview Questions for Freshers
SRE Interview Questions for Experienced
SRE Coding Interview Questions, etc. so that you can land the career you want!

SRE Interview Questions for Freshers

1. What is SRE?

SRE's full form is Site Reliability Engineer. A Site Reliability Engineer is a software engineer who specializes in building and maintaining a reliable system that can handle unexpected changes in the environment. They typically work on large web applications, but they also work with other types of software systems.
They are responsible for making sure that their system is able to handle all of the possible variations that might occur in the world. For example, if one server goes down, they need to make sure that their system can continue running without any problems. They also need to make sure that the site is secure against hackers and other attackers.
Many sites are built using a combination of technologies, such as web apps, databases, and other systems. A Site Reliability Engineer needs to be familiar with all of these different components so that they can make sure that everything is working properly together.
There are also DevOps engineers that sound similar to the work of site reliability engineers. But still, there are differences between them. So let’s understand the first DevOps and then we will understand the difference between these two in the follow-up questions.

Responsibilities of Site Reliability Engineer

Site reliability engineers collaborate with other engineers, product owners, and customers to develop goals and metrics. This assists them in ensuring system availability. Once everyone has agreed on a system's uptime and availability, it is simple to determine the best moment to act.
Site Reliability Engineer implements error budgets to assess risk, balance availability, and drive feature development. When there are no unreasonable reliability expectations, a team has the freedom to make system upgrades and changes.
SRE is committed to decreasing labour. As a consequence, jobs that require a human operator to operate manually are automated.
A site reliability engineer should be well-versed in the systems and their interconnections.
The objective of site reliability engineers is to detect problems early in order to decrease the cost of failure.

Create a free personalised study plan Create a FREE custom study plan

Get into your dream companies with expert guidance

Get into your dream companies with expert..

Real-Life Problems

Prep for Target Roles

Custom Plan Duration

Flexible Plans

Create My Plan

2. What is DevOps?

DevOps is a software development process that involves collaboration between software engineers and IT operations staff or the words (Dev - Development, Ops - Operations). This collaboration helps to improve overall productivity, while also providing better quality assurance and faster time to market. DevOps is a movement that seeks to bring together developers and IT operations staff, in order to make the two groups work more closely together.

DevOps is a relatively new concept, but it's quickly becoming one of the most important aspects of modern software development. In recent years, we've seen a number of enterprises adopt DevOps practices as part of their software development lifecycle (SDLC). This has helped organizations become more efficient and effective, by increasing the overall speed and quality of their products. As such, it's clear that there's plenty of value in the DevOps model today.

3. SRE vs DevOps: What's the Difference Between Them?

DevOps and Site Reliability Engineer are the two terms used to describe a person who specializes in improving applications and services while they are being used.
DevOps and Site reliability engineering are both important roles in modern IT organizations. However, there is a big difference between them. Those are -

DevOps	SRE
DevOps involves the development of software that can be updated and modified while it is running.	Site reliability engineer, on the other hand, focuses on keeping an application or service up and running.
DevOps teams often use automation tools to improve their workflow.	Site reliability engineers, on the other hand, work with both automation tools and humans to ensure service continues to operate smoothly.
DevOps deals with when and how software is built.	The site reliability engineer focuses on what happens once it's built

Refer to this blog for a more detailed understanding of the difference between SRE and DevOps.

4. Can you explain data structures and also describe the physical data structure and logical data structure?

Data structures are a set of rules for organizing and storing data in a computer. Data structures are used to structure databases, manage memory, and organize data. Data structures allow for easy organization of data, easy retrieval of data, and efficient use of resources.

Physical Data Structures can be Arrays and Linked lists. We can call these two physical data structures because the data stored in the actual physical memory, are based on these two. An array is the collection of contiguous data elements of the same type. And the linked list is also the collection of the data elements but it may or may not be contiguous in memory. A linked list consists of nodes that store the data and also the pointer that is pointing to the next node in the memory.
Logical Data Structure can be considered as all the data structures that are constructed while using the two physical data structures. The logical data structures can be stack, queue, tree, graph, etc. These data structures have only the logic and based on this logic it defines a property and stores the data using arrays and linked lists in the memory.

5. What is cloud computing?

Cloud computing is the delivery of IT services, such as servers, storage, and software as a service (SaaS), through network-connected cloud infrastructure. The term can refer to both private clouds, which are managed by a single organization and shared among internal users, and public clouds, which are owned by third parties (e.g., Amazon Web Services) that rent out computing power and storage capacity to companies or individuals on a subscription basis. Cloud computing has the potential to transform IT infrastructure and delivery models across industries but faces challenges in terms of security and regulation.
The “cloud” in “cloud computing” refers to the Internet itself and the networked computers and software that make up the Internet infrastructure. Cloud computing allows organizations to offload workloads from their data centers and focus more resources on applications and business processes. In addition, it enables them to create hybrid environments that combine elements of on-premises data centers with those hosted in cloud environments. This can be especially helpful for companies that need to scale quickly and want to reduce costs.
Cloud computing also has the potential to revolutionize IT operations by allowing organizations to deliver IT services through a flexible, scalable model that reduces costs while improving service quality. For example, it can allow organizations to integrate legacy systems with newer ones (such as mobile applications), reduce complexity and risk by automating routine tasks and streamline the management of remote assets. Cloud computing can also help organizations save money by reducing the costs of leasing or purchasing IT equipment compared to buying it outright.

6. What is DHCP, and for what is it used?

DHCP stands for Dynamic Host Configuration Protocol. It is a protocol that allows networks to dynamically allocate IP addresses to hosts on the network. DHCP is used to assign IP addresses to devices such as PCs and routers. When a device is installed, it may need an IP address in order to access the Internet. So when a new device is installed, it will get an IP address from DHCP so that it can connect to the network.

When a device connects to a network, it needs an IP address first so that it can communicate with other hosts on the network. And since most networks have only one IP address assigned for each device, there must be some mechanism for dynamically allocating those addresses.

In order for a DHCP server to work, it must have at least two parts: an interface (usually Ethernet or WiFi) and some sort of database that stores information about connections and users. Since an interface is required for each device connected, this database must contain all of the information about those devices and how they are connected. All of this data is then pulled together when a connection is requested.

7. Explain DNS and its importance.

DNS stands for Domain Name System. It is a system that maps hostnames to IP addresses so that you can find the correct server when you type in a website address in your browser. The DNS system associates each domain name with one or more IP addresses, which are called "resolvers."

When you type in a URL (e.g., www.google.com) into your browser, the computer sends a request to the DNS resolver for the IP address associated with that domain name. The DNS resolver then returns an IP address to the browser, which is either the IP address of a local computer or of another server that has been configured to return that particular IP address.

Consider the below image for a better understanding -

DNS is necessary because hosts on the Internet have only human-readable names like google.com and not machine-readable names like 111.222.333.444. Without DNS, you would need to know how to interpret a URL's human-readable name in order to find it on the Internet, which would be very difficult without a centralized authority like Google to help you out!

Start Your Coding Journey With Tracks Start Your Coding Journey With Tracks

Master Data Structures and Algorithms with our Learning Tracks

Master Data Structures and Algorithms

Topic Buckets

Mock Assessments

Reading Material

Earn a Certificate

View Tracks

8. Explain APR. Also, what are the stages of this?

In the context of Site Reliability Engineering, Accelerated Problem Resolution (APR) is crucial for quickly addressing and resolving issues that affect system performance and reliability. Here are five main points about APR in Site Reliability Engineering:

**Monitoring and Alerting**: Continuous monitoring is fundamental in APR. It involves actively observing system metrics to detect anomalies or performance degradation. When an anomaly is detected, alerts are generated to notify the Site Reliability Engineers.
**Rapid Diagnosis**: Speed is crucial in problem resolution to minimize downtime. SREs perform a quick initial assessment to understand the nature and severity of the issue. They gather data, logs, and other diagnostic information to pinpoint the root cause.
**Issue Resolution and Mitigation**: Once the root cause is identified, the SREs focus on resolving the issue. Depending on the nature of the problem, this can involve applying hotfixes, rerouting network traffic, or scaling resources. In addition to resolution, mitigation strategies might be used to reduce the impact of the issue on the system and users.
**Post-mortem Analysis and Documentation**: After resolving the issue, a thorough post-mortem analysis is conducted to understand the cause, how it was addressed, and the impact it had. This information is documented for future reference, learning, and improving response strategies.
**Continuous Improvement**: Insights from post-mortem analysis are used to improve the system and the incident response process. This includes implementing preventive measures, enhancing monitoring tools, improving alerting mechanisms, and refining protocols for quicker and more efficient resolution of future incidents.

9. Define Hardlink and Softlink.

Hardlinks and soft links are two different types of file system links used to share files between directories.
Hardlinks create a single link to a file in two different locations, while soft links create a single pointer to the location of a file in one location.
When you create hardlinks, each link is the same size as the original file. Soft links, on the other hand, can be created with or without the original file and can be of variable sizes.
To create a hardlink, you must have the “write” permission for both the original and target file. To create a softlink, you must have the “write” permission for only the target file. If you try to write to the original file while you have the write permission for only one of the files, your attempt will fail and generate an error message. If you try to delete just one of the files while you have the write permission for both, it will also fail and generate an error message.

10. What is Multithreading? What are the benefits of this?

Multithreading is a programming technique that allows the execution of multiple tasks at the same time. To achieve this, each task is assigned its own processing unit or processor. By splitting up the workload across these processors, it is possible to process several tasks simultaneously. This can be helpful for processing large amounts of data, or when running short-lived tasks that have a high resource consumption.

Multithreading can be implemented in different ways, depending on the underlying technology used. For example, multithreading can be achieved by executing multiple tasks on separate processors, or by running those tasks in parallel on a single processor.

Multithreading has many benefits. It allows for increased performance and reduced execution time of long-running computations. Also, it can improve the responsiveness of applications and reduce latency. Multithreading can also be used to execute short-lived tasks that have a high resource consumption. As such, multithreaded applications are ideal for use in IoT environments where there is a constant network traffic and battery drain due to sensor readings and other processes being executed within the device.

11. What are the states that the process could be in?

Processes are the computer program that is going to be executed by the CPU. And during the execution cycle of the process, it does from various stages. That is the process state. So the process states are -

New - A new process is a program that will be loaded into the main memory by the operating system.
Ready - When a process is formed, it immediately enters the ready state and waits for the CPU to be assigned. The operating system selects new processes from secondary memory and places them all in the main memory. Ready-state processes are processes that are ready for execution and sit in the main memory. Many processes may be present in the ready stage. They all can be aligned into the queue for getting a chance to execute.
Running - The OS will select one of the processes from the ready state based on the scheduling mechanism. As a result, if we only have one CPU in our system, the number of operating processes at any given time will always be one. If we have n processors in the system, we can run n tasks at the same time.
Block/Wait - Depending on the scheduling method or the inherent behavior of the process, a process can migrate from the Running state to the block or wait for the state.
When a process waits for a specific resource to be provided or for user input, the operating system moves it to the block or waits for the state and assigns the CPU to other processes.
Terminated - The termination state is reached when a process completes its execution. The process's context (Process Control Block) will likewise be removed, and the process will be terminated by the operating system.
Suspend Block/Wait - Rather than removing the process from the ready queue, it is preferable to delete the stalled process that is waiting for resources in the main memory. Because it is already waiting for a resource to become available, it is preferable if it waits in secondary memory to create a way for the higher priority process. These processes conclude their execution when the main memory becomes accessible and their wait is over.
Suspend Ready - A process in the ready state that is transferred to secondary memory from main memory owing to a shortage of resources (mostly primary memory) is referred to as being in the suspend ready state.
If the main memory is full and a higher-priority process arrives for execution, the OS must free up space in the main memory by moving the lower-priority process to secondary memory. Suspend-ready processes are kept in secondary memory until the main memory becomes accessible.

12. What is RAID?

“Redundant Array of Independent Disk” is a term used to describe a type of storage system that has more than one hard disk to provide more redundancy in case one disk fails. A redundant Array of Independent Disk is commonly used in networks and server farms.
Redundant Array of Independent Disk systems is routinely used in data centres; they provide a second disk drive on a single physical system so if the first disk fails, the user can continue working by accessing the second disk drive. This extra protection means users don’t have to worry about losing data if a drive fails.
Redundant Array of Independent Disk systems can be implemented as a single controller with multiple drives or as multiple controllers connected to each other with each controller housing a single drive. The resulting configuration can be optimized for throughput or for redundancy.
This type of storage system is available from many vendors and can be found in medium-sized or even large-scale enterprise environments, where it's essential for ensuring the availability of critical data.

Discover your path to a Discover your path to a Successful Tech Career for FREE! Successful Tech Career!

Answer 4 simple questions & get a career plan tailored for you

Interview Process

CTC & Designation

Projects on the Job

Referral System

Try It Out

2 Lakh+ Roadmaps Created

13. What are Vertical and Horizontal Scaling? Which is more preferable? And list some advantages and disadvantages of Horizontal Scaling.

Vertical scaling is a process of increasing the size of a system by increasing its number of resources. This is often used to increase capacity, performance, and throughput. It generally involves adding more hardware or more servers on a single physical server. This process is also called Scale-up. Because the size of the system increases in this.

Horizontal scaling is a process of increasing the size of a system by adding multiple logical resources. This can be done by adding more virtual machines per host, or by adding containers per host. It can also be done by adding additional hosts altogether. This is also called Scale-out. Because it increases the number of systems.

Horizontal scaling is preferable. Because of the going time and load on the system. This can be scalable. There are several advantages to Horizontal Scaling (Scale-out):

It requires less upfront investment.
It reduces operational overhead and
It allows for easier scaling as demand increases.

However, there are also some disadvantages:

Horizontal scaling requires careful planning and coordination between all parties involved, which can be a challenge in large multi-tenant environments where different tenants have different needs and requirements. Also, it can result in increased complexity and security risk if not done carefully.
Horizontal scaling can also lead to scalability problems if one component causes issues for multiple other components, so it’s important to monitor each component closely during the entire process from start to finish.

14. What is LILO?

LILO (Linux Loader) is a bootloader for Linux that is used to load Linux into memory and start the operating system. It is also known as a boot manager since it allows a computer to dual boot. It can act as a master boot program or a secondary boot program, and it performs a variety of tasks such as locating the kernel, identifying other supporting programs, loading memory, and launching the kernel. If you wish to utilize Linux OS, you must install a special bootloader called LILO, which allows Linux OS to boot quickly.

15. What do you know about Linux Shell? List Different types of Shell.

Linux Shell is an integral part of the Linux OS. The Linux OS is a free and open-source OS developed by Linus Torvalds. It is the most popular OS to run on servers and embedded devices. A Linux shell is a command line interface that allows the user to interact with the system. The command line interface (CLI) of Linux provides a text-based interface for executing commands, performing file management tasks, and issuing other system commands. There are two types of shells in Linux –

Interactive shell - It starts automatically when a user logs into their computer.
Non-Interactive shell - It can be started manually for the execution of any program.

These two types allow different users to have access to different sets of commands, depending on whether they are logged in or not. In most cases, non-interactive shells are used for administrative tasks such as managing user accounts and managing applications or services.

On a typical Linux system, the following shells are widely used:

KSH (Korn Shell)
BASH (Bourne Again Shell)
TCSH
CSH (C Shell)
Bourne Shell
ZSH

16. What is a “/proc” file system?

A “/proc” file system is a special type of file system that has special access permissions. It is mounted in Linux systems when the kernel needs to execute a process or access certain system resources. A /proc directory contains information about the current state of the system, such as memory usage and CPU speed. There are three subdirectories under /proc:

/proc/1: This is the first subdirectory in the /proc directory tree. It contains information about the CPU and its speed.
/proc/1/cmdline: This subdirectory contains the command line parameters passed to the currently running process.
/proc/1/maps: This subdirectory contains virtual memory map data for processes running on Linux systems. It can be used to determine which parts of the memory are being used by which process.

17. What is Linux Kill Command?

Linux kills command is an easy way to kill all running processes. With this command, you can kill a process, e.g., a program, a service, or a process that is not running on any Linux system. In other words, it will bring down or terminate any process running on the system. By using the Linux kill command, you can close down a malfunctioning application or stop a misbehaving service. You can also use the kill command to terminate misbehaving jobs in batch scripts.

Through this command, you can also reboot the server or halt it while shutting down the network connection and power off the server with one single command.

18. How can you use OOPs in designing a Server?

OOPs is a programming paradigm that encourages the creation of objects to represent real-world entities and these objects are then used to perform tasks. These can be useful in designing a Server because they allow you to break down the tasks into manageable chunks, which will help you to keep your Server under control. As well as this, OOPs allows you to create reusable code which will save time and money. When designing a Server using OOPs, it’s important to follow some basic design principles.

The first of these is the Single Responsibility Principle (SRP). This states that each object should have one and only one reason to exist. For example, if you’re creating an Order Repository, it should only be responsible for one thing -- processing orders. This will help ensure that your code is easy to read and maintain.
The second principle is the Open/Closed Principle (OCP), which states that an object should be either open for addition or closed for modification. For example, if you’re creating an Order Repository, it should be able to accept new orders but not modify existing ones.

19. Explain CDN.

A CDN (Content Delivery Network) is a network of servers that stores and distributes content to clients. These servers are typically located in data centres, and they can be used to improve performance by reducing latency, ensuring that the content is available at the right time, and ensuring that the content is delivered in a timely manner.

CDNs are most commonly used to store static content, such as images and videos, but they can also be used to store dynamic content, such as HTML or JavaScript. CDNs can also be used to deliver content from one location to another, such as from a website to a mobile device.

CDNs are an important part of the Internet infrastructure because they allow content to be stored and distributed in a more efficient way. They also allow content to be served from multiple locations, which can improve performance and reduce latency.

A CDN can be used in many different ways, including

Providing a central location for static content.
Providing a central location for dynamic content.
Providing a central location for content from multiple locations.
Providing a central location for content from multiple data centers.
Providing redundancy for critical infrastructure components such as servers and routers.

CDNs are also an important part of the Internet infrastructure because they help to ensure that the Internet works well for everyone. They help to ensure that everyone has access to the same content at the same time, and equally prioritize access.

SRE Interview Questions for Experienced

1. How will you secure your Docker containers?

Follow these instructions to secure your Docker container:

Choose third-party containers with caution.
Turn on Docker content trust.
Limit the resources available to your containers.
Consider utilizing a third-party security product.
Docker Bench Security should be used.

2. Explain in detail the working of ARP.

Most computer applications employ IP addresses (logical addresses) to send or receive messages, therefore actual communication takes occurs via physical addresses (MAC addresses). So the goal of ARP (Address Resolution Protocol) is to determine the destination's MAC address, which will allow us to interact with other devices. In this scenario, the ARP is truly necessary since it translates the IP address to a physical address.

When the source wishes to interact with the destination at the network layer. First, the source must determine the destination's MAC address (Physical Address). The source will look in the ARP cache and ARP database for the destination's MAC address. If the destination's MAC address is found in the ARP cache or ARP table, the source uses that MAC address for communication.
If the destination's MAC address is not in the ARP cache or table, the Source sends an ARP Request message. The source's MAC address and IP address are included in the ARP Request message. It also includes the destination's IP address and MAC address. The destination's MAC address was left blank since the user desired it.
The source computer will broadcast the ARP Request message to the local network. The broadcast message is received by all devices on the LAN network. Now, each device compares its own IP address to the destination's IP address. If the device's IP address matches the destination's IP address, the device will send an ARP-to-respond message. If the device's IP address does not match the destination's IP address, the packet is dropped automatically.
When the destination address matches the device, the destination sends an ARP reply packet. The MAC address of the device is included in the ARP Reply packet. Because the source's MAC address will be required for communication, the destination device automatically changes the database and saves it.
The source device now serves as a target for the destination device, which sends the ARP Reply message.
The ARP Reply message is sent unicast rather than broadcast. This is due to the fact that the device (destination) sending the ARP Reply message is aware of the MAC address of the device (source) to whom the ARP Reply message is delivered.
When the source device receives the ARP Reply message, it will know the destination's MAC address since the ARP Reply packet contains the destination's MAC address along with the other addresses. The source will update the destination's MAC address in the ARP cache. The sender can now connect directly with the recipient.

3. What is Consistent Hashing?

Consistent hashing is a technique that helps you to maintain database integrity by ensuring that every read operation will always return the same result.

In database systems, consistent hashing is a way of keeping data in sync by ensuring that each piece of data has been hashed in the same way. In other words, if you have two database tables, A and B, and you want to ensure that both tables have the same data, then you need to hash all of the entries in both tables together (A and B). This ensures that every time you read from table A, it will be returned with the same hash value. If another user then goes to read from table B, they will get the same hash value back. As long as there are no changes to either table, this means both tables should have the same data.

4. Where does caching take place in servers? And what is cache invalidation?

Caching is the act of storing data that changes infrequently in memory so that it can be used later. It's often used to speed up performance and reduce network traffic.

Caching can take place at different levels within a server:

In front-end web servers, when a page is requested, the page's content is cached in memory.
In back-end web servers, when a page is requested, the contents of the cache are checked to see if the contents are still valid. If they are, then no request needs to be made. Instead, the cached data can be served right away. If the cached data has changed since being stored in the cache, then it needs to be updated before it can be served.

Cache invalidation is also an important part of caching in servers. Cache invalidation involves checking to see if the cached content still holds true and if it needs to be updated before serving it again.

Caching can improve performance for any application that uses persistent data or relies on a heavy number of requests per second (RPS). By reducing these numbers, caching allows your server to complete more requests per second without having to spend as much time loading data into memory and parsing it.

5. Describe the Sharding process. How does sharding improve performance?

Sharding is a method of dividing a database into multiple pieces. Each piece stores a subset of the data, which can be used to run different types of queries.

Sharding makes it possible to distribute the workload across many more servers. This can reduce the time it takes to process queries and improve performance.

Sharding is also useful when you need to store a large number of small objects (e.g., objects with low cardinality). In this case, each object is stored in its own piece, and only one piece can be read at a time.

Sharding can be used to improve performance in two main ways:

By running several smaller jobs on a single machine, it becomes possible to spread out the load between many machines.
By storing objects in separate pieces, it becomes possible to read only the piece that needs to be accessed at any given time.

6. Explain three-tier architecture along with its real-time uses of it?

A three-tier architecture is a type of architecture in which the application logic is separated from the data storage and retrieval. The three-tier architecture can be implemented in a wide range of business applications, including CRM, e-commerce, and enterprise resource planning (ERP).
The three-tier architecture is often used when there are many different types of data that need to be stored, such as customer data and product data. By separating the different types of data into different tiers, it becomes easier to manage and maintain the data.
A three-tier architecture can be a useful tool for monitoring IT systems. As each tier in the architecture has its own distinct purpose, it can be easier to keep track of what’s happening within each tier. This makes it easier to detect problems that might have otherwise gone unnoticed.
In addition, a three-tier architecture can help provide better visibility into how all the tiers are working together. For example, if you need to troubleshoot an issue with your company’s website, it will be easier to do so if you have easy access to all the information that needs to be looked at as a separate logic.

7. What are containers in servers?

Containers in the server are like a virtual machine that runs an application. A container can be compared with a virtual machine because it provides an environment for running applications. However, containers are different from virtual machines in many ways. First, containers are much more lightweight than virtual machines. They take up far less space on disk and use fewer CPU resources. Second, containers don’t need to be preinstalled on a server. Therefore, they can be deployed quickly and easily. Third, containers can run on any type of hardware, from desktop computers to high-end servers. Finally, containers can only be used for running specific applications and not for general-purpose computing tasks like email or word processing.

Having said all these differences between containers and virtual machines, one thing is certain: Containers are the future of server infrastructure!

When it comes to deploying modern enterprise applications in today's digital world, container technology has proven itself to be the most reliable solution. From deployment speed to stability to security controls, container technology offers unparalleled advantages over traditional virtualization methods. While there are numerous vendors providing solutions that enable the creation of containers (e.g., Docker), there is no single standard or protocol that governs container technology. This lack of standardization presents challenges when trying to deploy containerized applications across multiple organizations or even within an organization's own data centers.

8. What does Virtualization means?

Virtualization is the process of using one physical system to run multiple virtual machines. It is commonly used by companies that want to consolidate computing resources and keep them running 24/7 without having to buy more hardware. Virtualization can also be used for testing purposes, such as for software development or system performance testing.

Virtualization can be used in a number of different ways, from simple setups where multiple virtual machines run on the same physical server, to complex setups that use multiple servers and virtual networks. The end goal is always the same: reducing overhead costs and improving overall IT infrastructure efficiency. Virtualization can also be used to create hybrid environments where physical servers are augmented by cloud-based services.

There are many different types of virtualization technology available today, including:

VMware - This is one of the most popular virtualization technologies available today. It runs on almost any platform and is easy to install and manage. It’s also very cost-effective because it leverages a lot of existing hardware and software infrastructure already in place.
Windows Server - Windows Server is a common choice for virtualizing Microsoft applications because it has built-in support for Hyper-V, making it easy to deploy and manage. There are also several third-party solutions available to further augment administrator capabilities.
Hyper-V - This is another option that’s popular with organizations looking to virtualize their servers. While it’s not as widely used as Hyper-V, it’s still an option that’s worth exploring if you’re looking for a low-cost way to virtualize. It’s one of the newer options available, so it might not be as widely accepted as the others but it’s still a valid option.

9. What are SLA and SLI?

A service-level agreement (SLA) is a commitment we make to a client about uptime. These are frequently legally specified, with consequences for failing to meet the desired availability. As a result, SLAs are typically established with values that are simpler to satisfy than SLOs.
A service-level indicator (SLI) is anything that can be precisely measured to assist you in thinking about, defining, and determining if you are satisfying SLOs and SLAs. They are commonly presented as the ratio of the number of excellent occurrences to the total number of events. A simple example would be the number of successful HTTP requests divided by the total number of HTTP queries. SLIs are typically stated as a percentage, with 0 indicating that everything is broken and 100 indicating that everything is operating flawlessly.

10. What are SNAT and DNAT?

Source Network Address Translation (SNAT)

It is a network function that maps an internal IP address to an external IP address. It often occurs at the edge of the network, where a device is connected to the public Internet. SNAT enables a device to “see” the outside world by translating its internal IP address into the external IP address of the router or server that serves it.
With SNAT enabled, a device can use the public Internet to communicate with other devices on the Internet.
SNAT also allows a device to receive data sent by other devices on the Internet, even if they are behind a firewall that blocks all incoming connections.

Destination network address translation (DNAT)

It is a technology that allows a server to have multiple IP addresses in different networks. DNAT allows a server to be located in one location but maps its IP address to the IP address of another location. DNAT can be used for many purposes, including load balancing, site-to-site VPN connectivity, and security.
The primary benefit of DNAT is that it can be used to load balance traffic across multiple servers. By translating the server’s public IP address into multiple private IP addresses, it is possible to have multiple servers at the same location function as though they were all located elsewhere. This allows for failover and redundancy without adding additional hardware or network infrastructure.

11. What is Observability? And what are the different types of Observability? ANd how can you improve the observability of the system?

Observability is the term used to describe the ability of an organization to track real-time events and metrics within a system. Systems that are more Observable are able to capture data from devices within the organization, such as smartphones and tablets. This data can then be used to track activities within the organization, such as the number of employees who log into work each day.

There are many different types of observability within an organization, including:

Real-time monitoring: This type of observability allows users in the organization to monitor what is happening in real time. This includes things like the number of people who visit a website on their phone or tablet.
Historical monitoring: This type of observability allows users in the organization to view data from previous periods. This type of observability may be most useful when tracking financial transactions, such as how much money has been spent over time.
System-wide monitoring: This type of observability can be used across all devices in an organization, including phones and computers. System-wide monitoring allows users in the organization to view data across all devices within the organization.

We can increase the observability of the organization by -

Recognize the sorts of data that flow from an environment and which of those data types are relevant and valuable to your observability goals.
Determine how your strategy is making sense of data by distilling, filtering, and translating it into actionable insights regarding the performance of your systems.

Observability can provide helpful information about an organization's DevOps maturity level.

12. Define INodes. Also, state the reason why it is important.

Inodes are the units of storage on a Linux filesystem. Every file, directory, and block device has an inode associated with it, which is essentially a pointer to where the file is located in the filesystem. Inodes also have other properties such as their size and owner and group ID. If a file or directory is deleted, the inode will be marked as deleted and all data associated with that inode will be removed as well.

Inodes are an important resource for both performance and security. There are a number of reasons why they can be important:

For performance, inodes are used to determine how much space a file occupies, so they can be used to optimize the placement of files that are likely to change frequently. When a file is created or moved between partitions, it must go through the inode stage first.
For security, there are two main roles for inodes: indexing and ACLs (access control lists). Indexing allows tools like locate or grep to quickly find files by name or location. ACLs allow users to control access to their files based on permissions assigned by their system administrator. In addition, having all files written to disk as soon as they are modified can help prevent data loss due to power outages or other unforeseen events.

Finally, while most people might assume that inodes are used primarily for storing data on disk drives, Inodes are also used to track metadata about every file on your computer, as well as directories and other objects stored on your computer’s hard drive. This data is used to keep track of which files have been deleted, modified, or copied, and can also be used to determine the overall health and performance of your computer.

13. Explain TCP. Also, different TCP connection states.

A TCP connection state is a relationship between a client TCP endpoint and a server TCP endpoint. These states are defined by the TCP three-way handshake process. The three-way handshake process allows TCP to establish a connection between two endpoints, where one side initiates a connection setup using an SYN packet, while the other side responds with an ACK packet. Once both sides have sent and received their respective SYN and ACK packets, an established connection is created. After the connection is established, a client can initiate data transfer over this connection by initiating a FIN packet, which will cause the server to send back an ACK packet indicating that all outstanding data has been successfully received and stored in memory. This process of sending and receiving packets works as long as there is no unexpected network congestion or other unforeseen events that cause either side to disconnect.

The different states of a TCP connection are defined as follows:

LISTEN - The server is listening on a certain port, such as port 80 for HTTP.
SYNC-SENT - Sent an SYN request and is awaiting a response.
RECEIVED SYN - (Server) Waiting for an ACK occurs after the server sends an ACK.
ESTABLISHED - The three-way TCP handshake has been finished.

14. Define Service Level Indicators.

Service Level Indicators are the key measurements that show if service is on track. Without them, it’s difficult to know if the organization is meeting its objectives.

There are three main types of SLIs: Availability, Response Time, and Quality of Service.

Availability measures how often a given service can be provided without causing downtime.
Response time measures how quickly service is delivered.
And the quality of service measures how well a given effort meets certain standards of quality.

In addition to these three main types of SLIs, there are also limits on usage and capacity, which measure how much a given resource can be used at any given time. This can be useful for determining if there is enough capacity in the system to handle the additional demand.

15. Explain the term SLO.

A Service Level Objective (SLO) is a measure of how good or bad the service quality is, and it is usually expressed as a percentage. It shows how close the actual performance of the service level is to what was expected. An SLO is typically set by the customer, but can also be set by management as a way to monitor performance.

SLOs are important because they can help organizations understand when they are underperforming, and they can also help them set targets for improvement. By setting targets, managers have something to strive toward and can motivate employees to work harder.

When you’re setting up an SLO, remember that it’s not just about what your customers are getting right now—it’s also about what they could be getting right in the future. So think about both short-term and long-term goals when making your SLO.

The main objective of SLO is to ensure that customers receive quality service, as measured by the:

Completeness of order fulfilment.
Quality of product.
Timeliness of delivery.
Accuracy and completeness of the information provided to customers.
Communication and support provided by employees.

SRE Coding Interview Questions

1. The Pacific and Atlantic oceans are both bordered by a m x n rectangle island. The Pacific Ocean hits the left and top corners of the island, while the Atlantic Ocean reaches the right and bottom edges.

The island is divided into square cells by a grid. You are provided a m x n integer matrix height, where heights[r][c] reflect the cell's height above sea level (r, c).

The island receives a lot of rain, and the rainwater can flow to nearby cells immediately north, south, east, and west if the height of the adjoining cell is less than or equal to the height of the present cell. Water can flow into an ocean from any cell close to one.

Write an algorithm that returns the indices of (row, column) such that from this location, water can flow to both the pacific and Atlantic oceans. (Asked in LinkedIn Interview last month)

Example -

In the above image, the box colored is the mountain from where the water can flow to both the ocean so we need to return the list of the indices as - [[0,4],[1,3],[1,4],[2,2],[3,0],[3,1],[4,0]]

This seems graph problem. And for solving this problem we need to keep track of the places for reaching both the Pacific and the Atlantic Oceans separately. So the steps that can be followed to solve this problem are -

Create two boolean matrices, one for reaching the Pacific and the other for reaching the Atlantic. And at the first identified the location from where it might reach the Pacific or Atlantic oceans.
Then I performed a Breadth First Search on each of the positions from which it might reach the target.
Finally, it was tested in both matrices to see whether it could reach both oceans and was added to the response list.

public List<List<Integer>> pacificAtlantic(int[][] heights) {
        int m = heights.length, n = heights[0].length;
        
        //Grid that keep track of the mountain from where it can reach to             

        //pacific Ocean.
        boolean[][] reachPacific = new boolean[m][n];
        
        //Grid that keep track of the mountain from where it can reach to 

        //atlantic Ocean.
        boolean[][] reachAtlantic = new boolean[m][n];
        
        //Oueue that helps for breadth first traersal on matrix
        Queue<int[]> queuePacific = new LinkedList<>();
        Queue<int[]> queueAtlantic = new LinkedList<>();
        
        //Marking the row and column as true grom where we can reach to the 

        //Pacific or atlantic ocean initially.
        for(int i = 0; i < m; i++){
            reachPacific[i][0] = true;
            queuePacific.add(new int[]{i,0});
            reachAtlantic[i][n-1] = true;
            queueAtlantic.add(new int[]{i, n-1});
        }
        for(int i = 0; i < n; i++){
            reachPacific[0][i] = true;
            queuePacific.add(new int[]{0,i});
            reachAtlantic[m-1][i] = true;
            queueAtlantic.add(new int[]{m-1,i});
        }
        
        //BFS on the grid to mark all the places from where it can traverse   

        //to the pacific ocean. 
        while(queuePacific.size() > 0){
            int[] val = queuePacific.poll();
            int i = val[0], j = val[1];
            if(i-1 >= 0 && !reachPacific[i-1][j] && heights[i-1][j] >= heights[i][j]){
                reachPacific[i-1][j] = true;
                queuePacific.add(new int[]{i-1, j});
            }
            if(i+1 < m && !reachPacific[i+1][j] && heights[i+1][j] >= heights[i][j]){
                reachPacific[i+1][j] = true;
                queuePacific.add(new int[]{i+1, j});
            }
            if(j-1 >= 0 && !reachPacific[i][j-1] && heights[i][j-1] >= heights[i][j]){
                reachPacific[i][j-1] = true;
                queuePacific.add(new int[]{i, j-1});
            }
            if(j+1 < n && !reachPacific[i][j+1] && heights[i][j+1] >= heights[i][j]){
                reachPacific[i][j+1] = true;
                queuePacific.add(new int[]{i, j+1});
            }
        }
        
        //BFS on the grid to mark all the places from where it can traverse 

        //to the atlantic ocean.
        while(queueAtlantic.size() > 0){
            int[] val = queueAtlantic.poll();
            int i = val[0], j = val[1];
            if(i-1 >= 0 && !reachAtlantic[i-1][j] && heights[i-1][j] >= heights[i][j]){
                reachAtlantic[i-1][j] = true;
                queueAtlantic.add(new int[]{i-1, j});
            }
            if(i+1 < m && !reachAtlantic[i+1][j] && heights[i+1][j] >= heights[i][j]){
                reachAtlantic[i+1][j] = true;
                queueAtlantic.add(new int[]{i+1, j});
            }
            if(j-1 >= 0 && !reachAtlantic[i][j-1] && heights[i][j-1] >= heights[i][j]){
                reachAtlantic[i][j-1] = true;
                queueAtlantic.add(new int[]{i, j-1});
            }
            if(j+1 < n && !reachAtlantic[i][j+1] && heights[i][j+1] >= heights[i][j]){
                reachAtlantic[i][j+1] = true;
                queueAtlantic.add(new int[]{i, j+1});
            }
        }
        
        //List that stores all the indices of the places.
        List<List<Integer>> ans = new ArrayList<>(); 
        
        //Traversing on grid to check the place from where it can reach to 

        //both pacific and atlantic ocean and adding to the answer list.
        for(int i = 0; i < m; i++)
            for(int j = 0; j < n; j++)
                if(reachAtlantic[i][j] && reachPacific[i][j])
                    ans.add(new ArrayList<Integer>(Arrays.asList(i, j)));
           
        return ans;
    }

The time complexity for the above algorithm is O(m*n) because all the places in the matrix will be visited more than once. But the degree of the polynomial is m*n, So it’s O(m*n).

Conclusion

All software fails, at some point in time. The question is when and where. Site reliability engineers ensure systems fail in an expected way and that they fail gracefully. They also ensure that they are resilient against failure.

The job of a site reliability engineer involves implementing automation tools to improve the workflow, while also working with people in order to ensure service continues to operate smoothly and successfully. You will also have to analyze data and make improvements based on it.

Join the growing number of professionals who are finding success in careers in site reliability engineering. You will work with the latest technologies, collaborate with a team and your peers, and be rewarded financially for your efforts.

Additional Interview Resources

For the role of SRE, you have to be aware of 50% of the Software Engineer and the other the Operations. So for better topic preparation for the position of Site Reliability Engineer, whether fresher or experienced. The materials listed below can assist you in better preparing for the SRE Interview.

2. Write a program that returns the leftmost value in the final row of a binary tree given the root.

Example - In the below image, we can see that the leftmost node in the last row of the tree is 7. So we need to return that.

We can solve this problem recursively by traversing to the last row and returning the leftmost node value. And because we are not aware of the final row of each sub-tree, so we can have a count of height that helps in obtaining the answer from the tree.

So the code of this approach will be -

class Solution {
    int maxHeight, ans;
    private void solution(TreeNode root, int height){
        //Checking if it is the leaf node and also if it is the last row. 
        //We are checking the last row based on the height of the tree.
        if(root.left == null && 
          root.right == null){
            if(height > maxHeight){
                maxHeight = height;
                ans = root.val;
            }
            return;
        }
        
        //Recursively traversing for the final row if child exists.
        if(root.left != null)
            solution(root.left, height+1);
        if(root.right != null)
            solution(root.right, height+1);
        
    }
    public int findBottomLeftValue(TreeNode root) {
        maxHeight = -1;
        //Calling helper method that finds the leftmost node in the tree.
        solution(root, 0);
        return ans;
    }
}

The Time complexity for the above approach is O(n) because we are traversing each node only once. And the space complexity can be O(n) because of the recursion.

3. Given a root of the binary tree, a node X in the tree is called good if there are no nodes with values larger than X along the route from root to X.

Write a program in which the number of good nodes in the binary tree should be returned.

Example - Consider the tree given below. The Nodes marked with yellow color are good nodes. Because no such nodes in between have a value greater than the current node up to the root. [7,8,9,7]

For solving this problem, we need to traverse every node by passing the current node value recursively. If on every node, the value passed from the parent node will be compared. If the node is found greater than the value from the parent node. Then the count will be incremented and we can update the value with the current node value and pass it to both the child recursively. So the code for this approach will be -

class Solution {
    //Global variable that keeps count of the good nodes.
    int ans;
    private void solution(TreeNode root, int val){
        //When found the node value greater than the value from parent
        if(root.val >= val){
            ans++;
            val = root.val;
        }
        //Recursively calling the solution if the child node exists.
        if(root.left != null)
            solution(root.left, val);
        if(root.right != null)
            solution(root.right, val);
    }
    public int goodNodes(TreeNode root) {
        //Calling  helper method to count the good node.
        solution(root, root.val);
        return ans;
    }
}

The time complexity for the above approach will be O(n) because we have to traverse all the nodes at once. And we have used recursion so we can say that because of the call stack, the space complexity will be O(n).

4. Create a simple version of Twitter in which users may submit tweets, follow/unfollow other users, and view the 10 most recent tweets in their news feed.

Use the Twitter class as follows:

1. Twitter() creates a new Twitter object.

2. void postTweet(int userId, int tweetId) Creates a new tweet with the user userId's ID tweetId. Each call to this method will be accompanied by a distinct tweetId.

3. List getNewsFeed(int userId) returns the user's news feed's ten most recent tweet IDs. Each item in the news feed must have been uploaded by either the user's followers or the user themselves. Tweets must be sorted in chronological order from most recent to least recent.

4. void follow(int followerId, int followeeId) The user with the ID followerId began to follow the user with the ID followeeId.

5. void unfollow(int followerId, int followeeId) The user with the ID followerId unfollowed the user with the ID followeeId.

The classes and methods are already defined and we need to implement the logic. So we can use the Hashmap that points to every user. And each user can be represented as a node. So the user can be obtained in constant time. And similarly, we can use the node for each tweet that consists of the records of the tweets and the userId to whom the tweets belong. So the Solution can be -

class Twitter {
    //This belongs to each individual user and his/her following.
    private class User{
        int userID;
        HashMap<Integer, Boolean> followings;
        User(int id){
            userID = id;
            followings = new HashMap<>();
        }
    }
    
    //Every Individual tweet. And belongs to which user.
    private class Tweet{
        int tweetID, userID;
        Tweet(int userID, int tweetID){
            this.tweetID = tweetID;
            this.userID = userID;
        }
    }
    
    //List that consists of every tweets.
    List<Tweet> tweets;
    
    //Map to get the user details in constant time.
    HashMap<Integer, User> map;
    
    public Twitter() {
        map = new HashMap<>();
        tweets = new ArrayList<>();
    }
    
    public void postTweet(int userId, int tweetId) {
        //If user don't exist, so create user
        if(!map.containsKey(userId))
            map.put(userId, new User(userId));
        
        //adding the tweets in the list for particular user
        tweets.add(new Tweet(userId, tweetId));
    }
    
    public List<Integer> getNewsFeed(int userId) {
        List<Integer> feeds = new ArrayList<>();
        int n = tweets.size()-1;
        int count = 0;
        
        //Loop that gives 10 recent tweets if it have otherwise
        //whatever less than 10 tweets of followed user.
        
        while(n >= 0 && count < 10){
            int tweetID = tweets.get(n).tweetID;
            int userID = tweets.get(n).userID;
            
            //Checking if user followed the user for which the 
            //tweet belongs.
            boolean exist = (map.get(userId)).followings.containsKey(userID);
            if(userId == userID || exist){
                feeds.add(tweetID);
                count++;
            }
            n--;
        }
        return feeds;
    }
    
    public void follow(int followerId, int followeeId) {
        
        //Following user or followed user if not exist then 
        //creating and adding to the following list.
        if(!map.containsKey(followerId))
            map.put(followerId, new User(followerId));
        
        if(!map.containsKey(followeeId))
            map.put(followeeId, new User(followeeId));
        
        (map.get(followerId)).followings.put(followeeId, true);
    }
    
    public void unfollow(int followerId, int followeeId) {
        //Following user or followed user if not exist then 
        //removing from the following list if exist.
        if(!map.containsKey(followerId))
            map.put(followerId, new User(followerId));
        
        if(!map.containsKey(followeeId))
            map.put(followeeId, new User(followeeId));
        
        (map.get(followerId)).followings.remove(followeeId);
    }
}

The time complexity for the solution will be O(10) which is nothing but constant. It is because at most 10 tweets must be returned to the user.

5. Write a program to check If all asteroids can be eliminated, then return true. Return false otherwise.

You are given an integer mass that represents a planet's initial mass. You are also provided with an integer array called asteroids, where asteroids[i] represent the mass of the ith asteroid.

You may make the planet smash with the asteroids in whatever sequence you like. If the planet's mass is more than or equal to the asteroid's mass, the asteroid is destroyed and the planet obtains the asteroid's mass. Otherwise, the world will be destroyed.

One of the many solutions can be sorting the asteroid array. By sorting this, we can pick the smallest element such that it can gain the mass of the planet. And if the planet destroys (if planet's mass is less than asteroids) then we will return false. So the solution can be -

public boolean asteroidsDestroyed(int mass, int[] asteroids) {
        //Sorting the array
        Arrays.sort(asteroids);
        int n = asteroids.length;
        for(int i = 0; i < n; i++){
            //Attacking the planet with asteroid
            if(mass >= asteroids[i])
                mass += asteroids[i];
            
            //If the mass of the planet becomes greater than the largest
            //asteroid then no need to check further, just return true.
            if(mass > asteroids[n-1])
                return true;
        }
        //If the planet is being destroyed by the asteroid
        return false;
    }

We have used sorting and sorting takes O(n*log n) times. So the time complexity of the solution will also be O(n*log n).

6. You have a single-tab browser in which you begin on the homepage and can navigate to another URL, go back in time a certain number of steps, or move ahead in time a certain number of steps.

Implement the BrowserHistory class as follows:

1. BrowserHistory(String homepage) initializes the object using the browser's homepage.

2. void visit(String URL) Visits the current page's URL. It clears up all of the preceding histories.

3. String back(int steps) Backtrack through time. You will only return x steps if you can only return x steps in the history and steps > x. At most steps, return the current URL after travelling back in time.

4. String forward(int steps) Take a step forward in time. If you can only go back x steps in history and steps > x, you will only go back x steps. At most steps, return the current URL after forwarding it in history.

Since we have been already given the classes and methods. We only need to implement the logic to achieve the desired result. So, we can use the stack to store the URL, and, on each move, we have to modify the stack behavior for achieving this result. So, the solution is -

class BrowserHistory {
    //Stack that stores the URL.
    String[] stack;
    //additional pointer curr, used to manage back and forward.
    int top, curr;
    public BrowserHistory(String homepage) {
        stack = new String[5001];
        stack[top] = homepage;
    }
    
    public void visit(String URL) {
        //Adjusting the stack with the value. And also pointers 
        stack[++curr] = URL;
        top = curr;
    }
    
    public String back(int steps) {
        //Adjusting the pointer while Going Backward.
        while(curr > 0 && steps > 0){
            curr--;
            steps--;
        }
        return stack[curr]; 
    }
    
    public String forward(int steps) {
        //Adjusting the pointer while Going Forward.
        while(curr < top && steps > 0){
            curr++;
            steps--;
        }
        return stack[curr];
    }
}

The time complexity for the above solution is O(steps) because it has to move forward or backwards in the stack for almost step time.

Frequently Asked Questions

1. How do you prepare for an SRE interview?

In the industry, the applicant seeking the post of SRE must have a fundamental understanding of programming as well as data structures and algorithms. Aside from that, the candidate must have a solid grasp of operating system ideas, networking concepts, and basic database concepts. If you are an experienced applicant, you must also be familiar with System Design ideas in addition to the competencies listed above. As a result, you must thoroughly prepare for these issues prior to attending the interview. The resources are freely accessible at InterviewBit.

2. What is the salary of SRE?

For freshers, site reliability engineer salaries in India might range from 6 LPA to 15 LPA. And for experienced applicants, it may change depending on the number of years of experience. For experienced applicants, the typical remuneration for SRE ranges from 13 LPA to 26 LPA.

3. What is the future of SRE?

Any system is not considered excellent if it is not trustworthy. So it is the Site Reliability Engineer's obligation to make the system dependable and efficient. And now, corporations are focused on the greatest available technology at the lowest possible cost. As a result, the SRE may have a fantastic opportunity to study and demonstrate their abilities.

4. Does SRE require coding?

SRE does not code the entire system as software developers do. However, they must develop the code for the automation of specific operations. So, certainly, the SRE creates code, but it may not be for system development but for system automation.

SRE MCQ Questions

Increase Development Fluency

Low Failure Rate of new Release

Both

None

Multiprogramming operating systems

Greater memory systems

Multiprocessor systems

None of the above.

Creates a block device if the user is root.

Creates a block device for all users.

Creates a FIFO if the user is not root.

None of the above

rlimit

ulimit

setrlimit

getrlimit

Uniprogramming systems.

Uniprocessing systems.

Unitasking systems.

None of the above.

Packets

Buffers

Segments

Stacks

Enables Parallel Execution.

Encourages unattended Execution.

Both

Don’t save time and money.

IPX

NCP

SPX

NetBIOS

A graph may have no edges but numerous vertices.

A graph may have numerous edges but no vertices.

A graph may have no edges or vertices.

A graph may have no vertices but numerous edges.

The kernel is the first portion of the operating system to load into memory during booting;

The kernel is made up of several modules that cannot be loaded in a running operating system;

The kernel is the first part of the operating system to load into memory during booting.

The kernel is the program that serves as the operating system's central processing unit.

Download Interview guide PDF

SRE Interview Questions

Download PDF

SRE Interview Questions for Freshers

1. What is SRE?

2. What is DevOps?

3. SRE vs DevOps: What's the Difference Between Them?

Download PDF

4. Can you explain data structures and also describe the physical data structure and logical data structure?

5. What is cloud computing?

6. What is DHCP, and for what is it used?

7. Explain DNS and its importance.

8. Explain APR. Also, what are the stages of this?

9. Define Hardlink and Softlink.

10. What is Multithreading? What are the benefits of this?

11. What are the states that the process could be in?

12. What is RAID?

13. What are Vertical and Horizontal Scaling? Which is more preferable? And list some advantages and disadvantages of Horizontal Scaling.

14. What is LILO?

15. What do you know about Linux Shell? List Different types of Shell.

16. What is a “/proc” file system?

17. What is Linux Kill Command?

18. How can you use OOPs in designing a Server?

19. Explain CDN.

SRE Interview Questions for Experienced

1. How will you secure your Docker containers?

Other than these questions, there are also some questions that are based on your personal understanding of the system if you are an experienced person. The questions can be like this -

2. Explain in detail the working of ARP.

3. What is Consistent Hashing?

4. Where does caching take place in servers? And what is cache invalidation?

5. Describe the Sharding process. How does sharding improve performance?

6. Explain three-tier architecture along with its real-time uses of it?

7. What are containers in servers?

8. What does Virtualization means?

9. What are SLA and SLI?

10. What are SNAT and DNAT?

11. What is Observability? And what are the different types of Observability? ANd how can you improve the observability of the system?

12. Define INodes. Also, state the reason why it is important.

13. Explain TCP. Also, different TCP connection states.

14. Define Service Level Indicators.

15. Explain the term SLO.

SRE Coding Interview Questions

1. The Pacific and Atlantic oceans are both bordered by a m x n rectangle island. The Pacific Ocean hits the left and top corners of the island, while the Atlantic Ocean reaches the right and bottom edges.

The island is divided into square cells by a grid. You are provided a m x n integer matrix height, where heights[r][c] reflect the cell's height above sea level (r, c).

The island receives a lot of rain, and the rainwater can flow to nearby cells immediately north, south, east, and west if the height of the adjoining cell is less than or equal to the height of the present cell. Water can flow into an ocean from any cell close to one.

Write an algorithm that returns the indices of (row, column) such that from this location, water can flow to both the pacific and Atlantic oceans. (Asked in LinkedIn Interview last month)

Example -

In the above image, the box colored is the mountain from where the water can flow to both the ocean so we need to return the list of the indices as - [[0,4],[1,3],[1,4],[2,2],[3,0],[3,1],[4,0]]

Conclusion

Additional Interview Resources

2. Write a program that returns the leftmost value in the final row of a binary tree given the root.

Example - In the below image, we can see that the leftmost node in the last row of the tree is 7. So we need to return that.

3. Given a root of the binary tree, a node X in the tree is called good if there are no nodes with values larger than X along the route from root to X.

Write a program in which the number of good nodes in the binary tree should be returned.

Example - Consider the tree given below. The Nodes marked with yellow color are good nodes. Because no such nodes in between have a value greater than the current node up to the root. [7,8,9,7]

4. Create a simple version of Twitter in which users may submit tweets, follow/unfollow other users, and view the 10 most recent tweets in their news feed.

Use the Twitter class as follows:

1. Twitter() creates a new Twitter object.

2. void postTweet(int userId, int tweetId) Creates a new tweet with the user userId's ID tweetId. Each call to this method will be accompanied by a distinct tweetId.

3. List getNewsFeed(int userId) returns the user's news feed's ten most recent tweet IDs. Each item in the news feed must have been uploaded by either the user's followers or the user themselves. Tweets must be sorted in chronological order from most recent to least recent.

4. void follow(int followerId, int followeeId) The user with the ID followerId began to follow the user with the ID followeeId.

5. void unfollow(int followerId, int followeeId) The user with the ID followerId unfollowed the user with the ID followeeId.

5. Write a program to check If all asteroids can be eliminated, then return true. Return false otherwise.

You are given an integer mass that represents a planet's initial mass. You are also provided with an integer array called asteroids, where asteroids[i] represent the mass of the ith asteroid.

You may make the planet smash with the asteroids in whatever sequence you like. If the planet's mass is more than or equal to the asteroid's mass, the asteroid is destroyed and the planet obtains the asteroid's mass. Otherwise, the world will be destroyed.

6. You have a single-tab browser in which you begin on the homepage and can navigate to another URL, go back in time a certain number of steps, or move ahead in time a certain number of steps.

Implement the BrowserHistory class as follows:

1. BrowserHistory(String homepage) initializes the object using the browser's homepage.

2. void visit(String URL) Visits the current page's URL. It clears up all of the preceding histories.

3. String back(int steps) Backtrack through time. You will only return x steps if you can only return x steps in the history and steps > x. At most steps, return the current URL after travelling back in time.

4. String forward(int steps) Take a step forward in time. If you can only go back x steps in history and steps > x, you will only go back x steps. At most steps, return the current URL after forwarding it in history.

Frequently Asked Questions

1. How do you prepare for an SRE interview?

2. What is the salary of SRE?

3. What is the future of SRE?

4. Does SRE require coding?

SRE MCQ Questions