Learn how to *really* use a relational database, relational data modeling, and S...

Mister_Snuggles · on Jan 5, 2020

> Learn how to really use a relational database, relational data modeling, and SQL.

I have to second this. There is so much power in relational databases that is untapped by most developers. The best part about this is that, for the most part, this type of knowledge can apply to multiple databases.

Some specific things that I want to gain a deeper understanding of are window functions[0] and recursive CTEs[1]. In particular, I've used window functions to identify peaks in sensor data (e.g., finding spikes in temperature, water level, etc), which would otherwise require iterating through rows and maintaining a bunch of state. I've never actually written a recursive CTE, but I'm pretty sure it would simplify virtually anything dealing with a hierarchy.

[0] https://www.postgresql.org/docs/12/tutorial-window.html

[1] https://www.postgresql.org/docs/12/queries-with.html

Dowwie · on Jan 5, 2020

This post I made almost six years ago to the day on SO remains relevant: https://stackoverflow.com/questions/20979831/recursive-query...

I hope it helps you.

Mister_Snuggles · on Jan 5, 2020

This is great - thank you!

This is almost exactly what I'd like to do with recursive CTEs in one of my current projects.

mongol · on Jan 5, 2020

SQL seems to be the most long-lasting skill in the IT industry. Definitely worthwhile to learn well.

LandR · on Jan 5, 2020

In the first 15+ years of my career I never used or understood recursive ctes in sql. Then finally learned it and used them multiple times in last year or so.

They can be incredibly helpful once you grok them! And recognise when they can be used.

kragen · on Jan 6, 2020

I'm not sure recursive CTEs simplify things. Yes, they allow you to express Turing-complete computations in SQL, including indefinite and parallel iteration. But if you weren't going to express those computations in SQL, you'd probably express them in Python or Lua or JS. Is doing them in SQL really going to be an improvement? So far I have not impressed with the results.

shantly · on Jan 6, 2020

My experience has been, especially if your application layer's in a scripting language, pushing as much data-fiddling to the DB as possible will save you serious performance headaches down the road, even if the application layer implementation looks OK at first. They're all really slow and memory-inefficient, and often moving that stuff to the application layer also means more queries (else, typically, why not do it in the DB?) which means more network latency, which an be a real killer. In a lot of cases fixing the performance means, at the very lease, re-implementing a bunch of what your DB already does to support fast & efficient data manipulation.

I've also seen the application change on top of the DB way more than I've seen the reverse, so I'm inclined to avoid putting data manipulation in the application when possible. That way it's there, for free, when we need a second application to access the same thing, or when we break off some chunk of a program into its a separate service and re-write it in Go because it turns out to be a performance bottleneck, or WTF ever.

UK-Al05 · on Jan 6, 2020

The problem is when the database becomes the performance bottleneck.

Scaling a database is incredibly difficult, and requires a lot of expertise.

Some of the biggest engineering projects I've been a part of is removing a central db that everybody connects to in large companies.

kragen · on Jan 6, 2020

My past experience generally agrees with yours, but I'm not confident that it generalizes to recursive CTEs.

cjf4 · on Jan 5, 2020

Third- for an analysis project, I had to look ahead to the next transaction. Tried a bunch of implementations, ranging from offset self joins to iterators to recursion.

Turns out the fastest and simplest solution, by far, was a mix of window functions and CTEs directly in the database.

steve_adams_86 · on Jan 5, 2020

I inherited a code base that has major issues stemming from the developers not realizing what postgres can do.

It's reliant on Kafka with a bizarrely complex queue system for answering questions from users automatically in a distributed, scalable fashion. It has schema-less messages shooting around referencing database rows and tables with zero certainty that anything will exist when the message arrives. It works, but it's a real mess and it breaks remarkably easily.

There's nothing about it that couldn't have been done more easily and perfectly scalably enough with a single database. The product will probably never be large enough to need something like Kafka.

I really, really agree with you. Better database knowledge would have put us weeks ahead on this project already, and it's still very early.

alimw · on Jan 6, 2020

This is exactly what I'd expect of something called Kafka.

throw51319 · on Jan 6, 2020

Isn't the DB completely separate from solving this problem? Why was Kafka used?

steve_adams_86 · on Jan 8, 2020

My thinking here is that had the original devs had a better grasp on redis and postgres, they never would have tried using Kafka in the first place. I can't imagine the problem ever requiring the throughput of Kafka, and there would likely be several other scaling issues in the way of utilizing Kafka to its full potential anyway.

I'm pretty sure a redis-based queue like Bull (https://github.com/OptimalBits/bull) would have sufficed for queuing message responses directly on the server (or multiple instances of the server), and while Kafka works fine for long term storage of logs, our use case for the data makes it so it would be far better stored directly in postgres.

Postgres is apparently also a decent pub/sub solution, though I'm not sure if it's superior to Kafka in this case.

The worst part is that the alternative architecture using a redis queue and postgres for message history is very simple, easy to maintain, benefits from the ability to normalize data, and is comfortably boring. Kafka is not that. It's a miserable beast sometimes, and it presents hurdles all the time for many of us. It's good at what it does and people should consider it (or Pulsar) if their problem requires a high throughput message broker. For everyone else, it's a really risky investment for small or no returns over alternatives. It's the worst decision the developers made in this application by a wide margin.

vishnu_ks · on Jan 5, 2020

Since Zulip was mentioned, I like to point folks who are interested more to the Architecture overview docs of Zulip. The docs has details on how Zulip make use of Django, PostgreSQL, Redis, Tornado, RabbitMQ etc for building a scalable chat application.

https://zulip.readthedocs.io/en/latest/overview/architecture...

Zulip is Open Source, so do take a look at our GitHub page if you folks want to dive deeper or want to get some hands on experience. We are a welcoming community to new contributors :)

https://github.com/zulip/zulip

Disclaimar: I work at Zulip.

ptr · on Jan 5, 2020

I'd start with CMU's "Intro to Database Systems", their lectures are on youtube. Highly recommended both for the depth and how Andy Pavlo presents the topic. https://www.youtube.com/watch?v=oeYBdghaIjc&list=PLSE8ODhjZX...

skovorodkin · on Jan 5, 2020

This amazing course is not about using database systems, but about making them.

wolfoftheweb · on Jan 5, 2020

What do you recommend for aggregating and scraping the data? I’ve been working with PyCharm and BeautifulSoup4.

Also, any suggestions for the best ways to apply the data to a website if the data is being refreshed daily? I’ve been using csv files to pass the variables into a Wordpress theme / post but it seems like building something from scratch would be more efficient in the long term.

scarface74 · on Jan 5, 2020

Python isn't known for high performance

As was said by one of the original Twitter architects, defending the choice of Ruby against people who were saying that it was at the root of all of their performance problems, for any well capitalized company, the language rarely makes a difference.

A stateless web server is a “embarrassingly parallelizable”. The speed or lack there of your runtime is usually not a make or break business decision.

ipsun4 · on Jan 6, 2020

That's all fine if you stay stateless. Once a well-meaning developer introduces local application state into your web app or adds a feature that locks your database, your web server is no longer "embarrassingly parallelizable". This doesn't even start to handle issues you get when you use a single-threaded langugage that cannot handle multitasking well. Sidekiq makes money purely because Ruby is single threaded, and its thread will lock if you give it a task that takes too long.

The microservices movement seems to be a misguided reaction to these self-imposed issues where instead of handling proper task management on a process level or with async/concurrency, functionality is split between servers, codebases and infrastructure. This problem was solved with Erlang decades ago with the actor model and supervision, and newer BEAM languages like Elixir and LFE are a pleasure to work in.

You even have this model and concurrency ported to JVM with Akka, C++ with CAF. Granted, the Actor model and the field of concurrency as a whole is solving the problem of enforcing statelessness in a way such that tasks can be efficiently distributed multiple cores, and that no single task locks up your machine for too long.

scarface74 · on Jan 6, 2020

Once a well-meaning developer introduces local application state

And this would break the minute you have more than one web server. How many websites of any consequence are running on only one web server?

Having server side session state that can be shared across servers is a solved problem as is having a load balancer that handles “sticky sessions”. I’m not saying either is a good idea.

Also if you are “locking your database” even if you use a faster runtime, you’re just delaying the inevitable of your scaling limits.

mperham · on Jan 6, 2020

Maybe you don't understand persistent background jobs or maybe I don't understand Erlang.

What happens if you have a bug in a task and it takes a week for your development team to develop a fix? Does that Erlang task live in memory all that time? That's the point of Sidekiq's retry subsystem and persistence in Redis.

Ruby is multi-threaded. My customers buy my commercial versions because they want the more complex features and support.

bufferoverflow · on Jan 5, 2020

So your proposed solution is to throw more hardware at the problem? It surely can work at the small scale, but why do it when you're talking about hundreds of thousands of dollars / mo in servers? Why not choose a proper high-performance language, at least for the parts that are slow?

tomnipotent · on Jan 5, 2020

> So your proposed solution is to throw more hardware at the problem?

Yes. It's usually cheaper and better for the business, as has been proven time and time again. There's a reason the phrase "cheaper to throw hardware at it" is kinda a thing in our industry. It took Facebook a LONG time and many hundreds of developers before they needed to create HipHip/HHVM.

Even at a small startup, my team of 6 costs over $1MM/yr while our two dozen or so EC2 instances and other AWS resources cost less than $25k/yr. Hell yes I'm going to throw hardware at it.

scarface74 · on Jan 6, 2020

Do you know how much hardware you can buy for the fully allocated cost of one developer?

In reality how many companies in the world have hundreds of thousands of dollars a month in servers?

Why not choose a high performance language? Maybe it’s easier to find developers in a certain language, maybe the developers are cheaper for a certain language or it may have a better ecosystem.

If I just needed a simple CRUD app and thought I could get a lot of cheap developers I might choose PHP (hypothetically) because I know PHP developers are cheap.

OkGoDoIt · on Jan 5, 2020

Do you have any examples of a good book that will take me from intermediate to advanced? Most of the guides I’ve found online either assume you’re an absolute beginner or already quite advanced. I’m quite competent with SQL and relational databases, but nearly all of my experience is on Microsoft SQL server. I’ve heard postgresql has a lot of really cool functionality but I would really love a nice, professional, in-depth book that will help me get fully up to speed.

mongol · on Jan 5, 2020

I recommend watching Markus Winand on youtube. Eye-opening for me

dvko · on Jan 5, 2020

Seconded. His book and blog (which is called something like “use the index, Luke”) are really good too.

chirau · on Jan 5, 2020

Database Systems Concepts https://www.db-book.com/db7/index.html

Great book, that's the one that is also used in the reputable CMU Database Systems course, which you can also find on Youtube.

https://www.youtube.com/playlist?list=PLSE8ODhjZXjbohkNBWQs_...

DudeWhoCode · on Jan 6, 2020

I'll highly recommend "A Curious Moon" by Rob Conery [0].

In this book you'll load Cassini space mission data from NASA into Postgres and analyze one of the Saturn's moon. I learnt a lot about Postgres and also about satellite data.

[0] https://bigmachine.io/products/a-curious-moon/

LiamPa · on Jan 5, 2020

Recommend ‘Designing data intensive applications’

https://www.oreilly.com/library/view/designing-data-intensiv...

james_s_tayler · on Jan 5, 2020

This is a great book, but it's not a book on SQL.

throw1234651234 · on Jan 5, 2020

My first day of Postgres experience is that the first query to fetch 1 record takes 5 seconds and subsequent queries take 50ms. There are a 1000 explanations as to why, but I have no idea which is correct. I hate Postgres. I heard it does JSON or something well.

rapphil · on Jan 5, 2020

This presentation will answer all your questions.

https://youtu.be/0cLIhoXjgDE

I love Postgres.

inimino · on Jan 5, 2020

Some power tools take more than a day to learn. Deal with it if you need the power. If you don't need an RDBMS, don't use one.

Nican · on Jan 5, 2020

+1 for relational models. All data access is relational in some way, and knowing how to efficiently access/index the data is an important step for building efficient applications.

james_s_tayler · on Jan 5, 2020

I've come to a realization. Domain data is basically always relational. Configuration data may or may not be. Where it's not, NoSQL is good for configuration data.

jlelonm · on Jan 5, 2020

What resources do you recommend for learning relational databases and key value stores really well?

wenc · on Jan 5, 2020

Relational: Jennifer Widom's Stanford MOOC is often highly recommended.

https://lagunita.stanford.edu/courses/DB/2014/SelfPaced/abou...

radiowave · on Jan 5, 2020

Yup. In terms of the amount of practical benefit it's turned into, it's got to be the best online course I've ever taken.

psp219 · on Jan 5, 2020

Second this. Looking for resources as well. Been working as an engineer for 3 years but still feel this is my weakness due to ORMs

smoe · on Jan 5, 2020

What helped me a lot getting away from being limited by ORM capabilites while on the job:

- Get the raw SQL of some slow or memory intensive queries the ORM produces and try to optimize them by hand. Try different approaches to get the same result, measure and understand them by using 'explain' and visualizers like http://tatiyants.com/pev/

This works great together with libraries like sqlalchemy where the ORM is optional and build upon a abstraction of SQL that you can use directly. This way you can use the ORM for the 80% where it works just fine and hand write the rest in Python without having to deal with raw, fairly inflexible SQL in application code.

- Try moving workload from the application to the database. In the past I often ended up doing ORM queries to get a large numbers of objects and then further process and even join them in Python. In Most of these cases doing it in the database is way more efficient and lets you get away with a slow language, synchronous requests running on small servers for a surprisingly long time.

- Do business intelligence type queries for reports and monitoring. Through doing this I discovered a lot of database features that I didn't encountered commonly in web application development but that nonetheless came in handy several times for it. Also since you often need to combine data in ways that it wasn't necessarily originally designed for, you really need to start thinking about how your data is structured and how to get it in and out efficiently.

- Don't immediately dismiss relational databases for tasks where they might not be the infamous "best tool for the job". Chances are, that the relational database you already have in place is good enough for your use case and that it will safe you headaches of setting up, understanding, synchronizing and maintaining a entirely different db system. E.g. Vanilla Postgres for timeseries data worked just fine for us for years before moving to a more specialized solution with TimescaleDB. Also used it with success for non-relational data, simple graphs, key/value stores and for queues.

darkteflon · on Jan 5, 2020

This is a great answer - lots of practical advice. Thanks.

ghoshbishakh · on Jan 5, 2020

ORMs are great for fast prototyping in my opinion. With Django and its ORM I can build a web app in a couple of days. Scaling is of course a totally different challenge.

bosser · on Jan 5, 2020

+1 started my third month as data engineer straight out of college and notice I am missing some solid RDB resources

crease · on Jan 5, 2020

What topics/ideas would say someone needs to understand in order to really understand relational databases? Also, can you recommend any resources?

Mister_Snuggles · on Jan 5, 2020

I don't have any good resources on the data modeling side, but on the SQL side the PostgreSQL manual[0] is really good. Even when I'm working with an Oracle database, I often find myself looking at the PostgreSQL documentation.

[0] https://www.postgresql.org/docs/12/index.html

bufferoverflow · on Jan 5, 2020

Using Python for high-performance anything is a bad choice. You will quickly bottleneck at the code execution speed, even if it's just to query some cache.

If you don't agree, make a simple "hello world" endpoint and see how many req/sec you get. Then compare to Rust / Java / C++ / Go. It will be radically different.

tomnipotent · on Jan 5, 2020

Someone already has, and frankly Python does just find for a large class of problems. Also, "high performance" is a poorly ambiguous term.

https://www.techempower.com/benchmarks/#section=data-r18&hw=...

wmock · on Jan 7, 2020

What are good resources for learning the basics of relational databases, and then learning the more intricate parts of it?

unk1102 · on Jan 6, 2020

+1 My focus is on OLAP data modelling SQL. Just curious do you know any practical data modelling learning tools?

morenoh149 · on Jan 6, 2020

check out https://theartofpostgresql.com/ as another resource

tus88 · on Jan 5, 2020

What year is it? :P