Rethinking Data: A Journey of Relationships
Most data can be separated into two categories: that which has a relationship to other data in the dataset and that which does not. For years, we've spent time organizing data into hierarchical, relational databases. While this is generally fine, being able to put that data into a better structure, like a directed graph, allows us to generate insights in a more logical and simpler fashion.
In this post, we will use two different elements: a Person element which represents a person and an Address element which represents a Person's address.
In a relational database, we'd setup up two tables where the Person table would have a foreign key into the Address table.
Table | Person |
---|---|
id | primary key |
name | varchar |
address_id | foreign key on (Address) |
Table | Address |
---|---|
id | primary key |
street | varchar |
city | varchar |
state | varchar |
post_code | varchar |
We could then get an address by writing a fairly simple SQL query, like this one:
SELECT p.*, a.street, a.city, a.state, a.post_code
FROM Person p
JOIN Address a
ON a.id = p.address_id
Or we could find all persons living within a specified post code:
SELECT p.*
FROM Person p
JOIN Address a
ON a.id = p.address_id
WHERE a.post_code = '12345'
This is simple enough, but if we have to add information to either table, it can potentially require major changes. Alternatively, if we have to change a field, this can become an extremely complex and expensive issue. My company recently had an issue where a development team needed to update a field in an existing database. One that had over 2 billion rows. That update took 18 hours to complete, and required our entire production application to be offline during that timeframe.
Enter graph theory. Graphs use nodes, which describe data, and edges, which describe the relationships between data. This seemingly simple difference allows us to reason about data in a different way as well as allowing us to gain insights that we might otherwise miss.
We will use this, albeit more complex, definition for our graph:
In this case, the graph is slightly more complicated than a similar relational database table schema would be, but it also is easier to reason about. We can clearly see that a Person HAS_A Address, and that Addresses are IN Cities, States and PostCodes.
Our ability to select items is slightly easier:
SELECT a, p
WHERE (p:Person) -[:HAS_A]-> (a:Address)
And finding all Persons living within a specified PostCode looks like this:
SELECT p
WHERE (p:Person) -[:HAS_A]-> (a:Address) -[:IN]-> (pc:PostCode {value:'12345'})
This is human readable and easy to reason about (compared to JOIN's which require understanding how SQL JOIN statements work.)
There are definitely challenges to rethinking how data works and interacts in a graph compared to similar approaches in a relational database.
First, designing the graph requires a better understanding of how your data is being used. Some data can be stored as metadata on Nodes, while other information can be stored as separate Nodes. This increases the complexity of what the data is.
Second, learning a new syntax and understanding the underlying concepts can be both daunting and difficult. It may be easier to reason about the data once you see it, but designing and implementing those queries can be more difficult than just writing SQL queries.
Moving forward, rethinking how data works, interacts and evolves is going to be an important step for most software professionals. This is not an end-all, be-all solution for every problem, but if you find that you need better insight into your data then a graph-based solution might be something to consider. You will have to weigh the pros and cons no matter what.
In a future article, I will dive into more detail on what types of information you can glean from using a graph, and an overview on interesting insights that can help drive innovation across the spectrum.