In my last post How to prepare for Data Engineer Interviews, I wrote about how one can prepare for the Data Engineer Interviews, and in this blog post, I am going to provide the Top 25 Basic data engineer interview questions asked frequently and their brief answers. This is typically the first round of the Interview where the interviewer just wants to access whether you are aware of basic concepts or not and therefore you don't need to explain it in detail. Just a single statement would be sufficient. Let's get started
Checkout the 5 Key Skills Data Engineers need in 2023
A. Programming
1. What is the Static method in Python?
Static methods are the methods that are bound to the Class rather than the Class's Object. Thus, it can be called without creating objects of the class. We can just call it using the reference of the class. Also, all the objects of the class share only one copy of the static method.
2. What is a Decorator in Python?
Decorators provide additional functionality to the functions without directly changing their definition of them. You can define a decorator for a function using the @ symbol. Learn more about Decorators in Python.
3. What is a Dictionary in Python?
Dictionaries are key-value pairs data structures in python. It's built on the concept of Hash Tables. It is quite efficient for search, deletion, and insertion as its time complexity is O(1). Key must be a single immutable element like string, tuples, or numbers and value can be any python object such as list, tuples, or integers.
4. Difference between Tuples and List?
The first important difference is List are mutable and Tuple are immutable i.e. you cannot change tuples. Secondly, Lists have dynamic and Tuples have static characteristics and because of that tuples are much faster than Lists.
5. Difference between Array and Lists?
Lists can store heterogeneous elements i.e. elements with different datatypes whereas Arrays can only store the homogenous elements and if the data type of elements in Arrays is different then an "incompatible data type" exception will be thrown.
6. What is NamedTuple and DefaultDict in Python?
Namedtuple are containers like Dictionaries in collections modules, it is similar to Dictionaries except NamedTuple supports both access from key-value and iteration which dictionary lacks. DefaultDict is also a container like Dictionaries except DefaultDict doesn't throw KeyError exception like Dictionaries and it provides a default value for the key that doesn't exist.
7. Explain Generators functions in Python.
Generator functions are like normal functions but rather than return statements, they use the yield keyword to generate the value. It returns a generator object which is iterable and which can be used as an iterator.
B. Data Structures
1. What is Binary Search? What is its time complexity?
Binary search is a searching algorithm on sorted arrays, where if the search key is less than the middle element then we search the key in the left interval otherwise in the right interval. The time complexity of Binary Search is O(logn).
2. What is the time complexity of Merge sort and Quicksort?
The time complexity of both Merge sort and Quicksort is O(nlogn)
3. What are BFS and DFS?
BFS (Breadth-First Search) is a technique to find the shortest path in the graph. It uses a concept of FIFO that uses Queue to store the nodes in the Graphs. When one node is visited and marked, it is dequeued from Queue and its adjacent nodes are stored in Queue and the process repeats until there are no adjacent unvisited nodes. On the other hand, DFS (Depth First Search) uses the concept of LIFO that uses Stack and stores unvisited nodes until there are no further adjacent unvisited nodes for any particular node. At this point, it starts backtracking to traverse all the unvisited nodes.
4. What is memoization?
Memoization is a simple optimization technique that stores the results for expensive function calls in cache and uses it directly from memory rather than computing it again. In Dynamic programming, memoization is a top-down approach that is used to store the most recent state values in cache and use it without having to calculate them again leading to an increased performance of your program.
5. Explain Dynamic Programming in simple words.
Dynamic Programming is basically solving sub-problems and using its results later without having to re-calculate them.
C. Distributed Systems/Databases
1. What is the CAP Theorem?
CAP theorem in distributed systems states that during network failure you can only have Availability (A) or Consistency (C) but cannot have both along with Partition Tolerance (P). So basically, in the Big Data world, you always need to do a trade-off between Availability or Consistency of the systems.
2. What is Sharding in Distributed systems?
Sharding is a process of distributing the data among multiple databases on multiple machines in the cluster. It is necessary when the dataset is too large to just store it in a single database.
3. Explain Master and Slave architecture in distributed systems.
In a distributed system, a cluster is a set of machines connected to each other and sharing the resources and computations among themselves. This setup follows the master-slave architecture in which there is one master node and several slaves nodes. The master node is responsible for distributing resources and tasks among different slave nodes and slave nodes work accordingly. In master-slave databases architecture, all writes go to the master node and reads from slave nodes.
4. What is a NoSQL database and how it is different from Relational Databases?
Checkout What is NoSQL database
5. What are the Columnar Databases? Give Examples.
A columnar database or column-oriented database is a database management system that stores the data in columns rather than rows. So basically in the columnar database, the columns are multiple files like structure and each of the entries in the files represents rows of the table. For Example C1, C2,....,CN are columns files, and one row consists of R1, R2,....RN entries in respective C1, C2...CN columns files. Columnar Databases are beneficial for OLAP (Online Analytical Processing) where it provides faster performance as compared to row-oriented databases.
6. Explain the scenario when you want to use De-Normalized tables.
De-Normalized tables are beneficial for storing and processing Big Data when you need faster and more efficient analytics performance. By De-Normalizing the tables we avoid a lot of complex joins between different tables that can affect the performance especially when the tables have a huge amount of data. Although it is going to consume a huge amount of Disk Data because De- Normalized tables contain lots of redundant data, then again it is a trade-off between performance and disk space. And you should decide depending on your use case.
D. Data Modelling
1. What is Star Schema in Data Modelling?
Star Schema is widely used to develop a Data Warehouse or Data Mart. It is a data modeling concept where you have a big Facts table and various smaller Dimensions tables. This schema model looks like a star as the big Facts table lies in the center surrounded by various smaller Dimensions tables. Basically, the Facts table stores all the quantitative business data, and Dimensions tables store the characteristics of the facts data.
2. What is Snowflake Schema in Data Modelling?
A snowflake schema is a variant of Star Schema. It is also represented in the form of Facts and Dimensions tables just like Star Schema except for the fact that in Snowflake Schema the Dimensions tables are present in Normalized form in multiple related tables to make it less redundant. A snowflake schema is used when you want to save the disk space and when you want to achieve Data Integrity in your model. Although, having Normalized data means low performance due to complex joins between multiple tables.
3. When to use Star schema and when to use Snowflake schema?
The decision of choosing schemas highly depends on your Business use-case. If your use-case demands faster performance and doesn't care much about the Disk Space then Star Schema would be the best choice. On the other hand, if you want to have Data Integrity, more structured Data, and low Disk Space then Snowflake would be the better choice.
4. What are fact and dimensions tables?
The facts table stores all the Quantitative Business Data which can be literally seen as "facts" in Businesses and Dimensions tables store the dimensions or characteristics of these "facts". Primary keys in the Dimensions tables are stored as Foreign keys in the Facts table.
E. Data Engineering/SQL
1. Explain how would you design end-to-end ETL pipelines.
The first step in designing ETL pipelines is to Extract the raw source of the data and store it in the Data lake, the next step is the transformation of the data where important and useful data is extracted and transformed according to the requirement of business use-case and then finally the transformed data is loaded in the Data Warehouse or it is consumed by different applications. One must design reliable and fault-tolerant pipelines, such that you don't lose the important data if in case the pipeline breaks.
2. How would you process a huge volume of data?
To process a huge volume of data one should use an efficient distributed data processing framework like Hadoop, Spark, or Beam (Dataflow). These frameworks will distribute your data and code among different machines in a cluster and process it in parallel. These frameworks can process petabytes of data efficiently. Moreover, some of the frameworks are totally no-ops, so you don't need to manage the clusters and it will auto-scale your clusters with workers based on the volume of the data.
3. What is the difference between GROUP BY and PARTITION BY in SQL?
A GROUP BY normally reduces the number of rows after performing aggregation on the groups whereas PARTITION BY gives aggregated columns for each row in the tables. So if the table has 10 rows and you perform window aggregation using PARTITION BY then the resulting table has 10 rows each with aggregated values. On the other hand, GROUP BY will reduce the rows based on the number of groups formed.
4. Explain different types of JOINS in SQL.
The most common JOINS in SQL are namely INNER JOIN, LEFT JOIN, RIGHT JOIN, FULL JOIN. INNER JOIN only joins the tables with the matching rows in both the sides of the join, LEFT JOIN returns all the rows from the left side of the join and matching rows from the right side of the join, the rows for which no matching rows exist on the right side the result set will contain NULL values. RIGHT JOIN is just the opposite of LEFT JOIN where it returns all the rows from the right side of the join. FULL JOIN is the combination of LEFT and RIGHT JOIN where it returns all the rows from the left side of the join as well as the right side of the join.
Also, check out Top 10 SQL Interview Questions to know about the most frequent SQL questions asked in Data Engineer Interviews.
Comments
Post a Comment