What is a Data Structure?

March 26, 2023

Overview

Everything in the world can be modeled by some type or combination of data structures. In a physical device, data structures are represented by containers in memory with references that connect them to create the overarching structure. Data structures are used to model different types of relationships between objects (e.g. containers). They could model actual objects in the real world (like a queue of people or a map of roads), abstract objects in the real world (like events in a schedule), or they could model other digital objects (like application logs).

In-Memory Representation of Data Structures

Different data structures are different ways of organizing related data. The physical representation in memory that was mentioned above is just one way to think of a data structure. This knowledge definitely helps in understanding the speed of traversing different data structures because of the difference between sequential and random access speed of physical memory. It's always much faster to traverse a data structure that exists as contiguous blocks in memory rather than blocks that are scattered all over the physical memory. In most programming languages, arrays exist contiguously, whereas sets (and most data structures) are not.

In-memory Representation of a Poor Implementation of a Set Using a Hash Table that Solves Collisions Using Linked Lists:

Below is an in-memory representation of a Set backed by a Hash Table. A Hash Table is basically an array with buckets. These buckets are linked lists that exist for the purpose of solving hash collisions. A hash collision occurs when two objects are processed through a hash table's hash function and resolve to the same index in the array. When this happens, some solution is required to fit multiple elements in the same slot.

Representation in memory of a poor implementation of a set using a hash table

* Units here can be thought of as addressable units, however, not to scale (since we don't know the size of each object). Note: the smallest addressable unit in RAM (via the interface with the operating system) is a byte, whereas the smallest addressable unit (via the OS) in an SSD is a sector (0.5-4kb).

Explaining the Numbers Above:

1. Set's metadata, such as:

a. Total number of elements in the set.
b. Number of buckets (size of array).
c. Load factor -- total number of elements in the set divided by the size of the array (number of buckets) before the size of the array is increased.
d. Hash function -- determines which slot in the hash table (array of linked lists) an element belongs.
e. Other state information like flags, version numbers, etc.

2.-7. Array containing heads of linked lists where the heads are the first element in the bucket, linking to subsequent elements via linked list.

8.-16. Linked list nodes that connect directly or indirectly to the head node contained in the array which, alongside the hash function in the metadata memory unit (unit number 1) altogether makes a hash table, which backs this implementation of a Set data structure.

This above implementation of a hash table (backing our Set data structure) is terrible because its load factor is above 2.0 whereas a good load factor would be closer to 0.75 (more slots in the array than elements in the array). Anytime there's a hash collision, requiring multiple elements in the same slot by creating a linked list (as seen above), the actual access time of the hash table is degraded to above O(1).

In-memory Representation of a Good Implementation of a Set Using a Hash Table that Currently Doesn't have any Hash Collisions:

Representation in memory of a good implementation of a set using a hash table

Thinking about the physical representation in memory is not always how one thinks of data structures. This is especially true for advanced data structures like graphs and trees where diagrams rarely depict the physical representation and typically focus on the references connecting the nodes of data (and the nodes themselves). In these more advanced data structures, it's more helpful to think of data structures by their conceptual structures, as in the diagram below of a graph that includes a cycle and bi-directional edges between its nodes:

Nuances

Software.Land blog posts make an effort to avoid paraphrasing or regurgitating existing content. Instead, the aim is to shed light on nuances and gaps of knowledge that exist on the internet, and in the zeitgeist. Across the internet, data structures are often broken down into two main categories: linear and non-linear. However, in practice, this distinction is often an afterthought.

The best way (and possibly, the only way) to develop an intuition for solving problems is practice. Patterns, rhythms, and similarities begin to emerge with enough time and breadth of problems solved. You eventually realize that the only special thing about computer science is how basic or fundamental it is (based on how easily our world can be modeled by it). With enough practice, it can even feel one step away from common sense.

Approaching Algorithmic Problems

Simpler Problems

Simpler problems can be approached from the lower-level decisions of which data structure or which data structure traversal method. It's even very common (especially in embedded systems in the real-world) to have a data structure provided as input that must be operated on in-place. This could be an easy or difficult task depending on the problem. One step away from an in-place algorithm is creating a data structure that matches the input, but is used for work and returned as response.

If Building a Data Structure is Required for the Solution

Very often, the problem will require building a new data structure from the input. Clues for the correct type of data structure should be discoverable from either:

the structure of the input data, or
the relations that are being expressed in the problem.

The second step in this approach is typically deciding how to traverse this data structure. Each data structure has its own set of well-known traversal methods. These traversal methods can always be adjusted or combined given the use case. One may try to find a way to limit the search space by creating a condition to avoid traversing some nodes or omitting them entirely during building of the data structure (from the input data). These are just some very basic ways one can increase the efficiency of traversal over a data structure to solve some problem being modeled. Acquiring full breadth and depth of strategies is outside the scope of this simple blog post.

Further Learning

To learn more about Algorithms and Data Structures (including strategies for tackling advanced problems), the following courses are excellent resources:

CS50: Introduction to Computer Science | Harvard (Basic)
Data Structures and Performance | UC San Diego (Basic)
Advanced Data Structures in Java | UC San Diego (Intermediate)
Algorithms Specialization | Stanford (Advanced)
Introduction To Algorithms | MIT (Basic to Advanced)

For lists of data structures: