Formulus Black’s software, Forsa, enables any application to run fully in memory without modification, which can significantly enhance performance and decrease latency. To see how Forsa could speed up a typical data science project, we built the familiar New York City Taxi data set in both a Forsa-enabled and a non-Forsa environment on identical hardware and compared the execution times for a variety of queries. We found that that Forsa enabled many queries to execute up to 80x or more times faster.
In data science, the process of discovering interesting results is often iterative. Sure, you might get exceptionally lucky and find the single most interesting fact from your data set on your first exploration of the data, but it’s vastly more likely that you’ll reach into the data over and over again to develop more accurate conclusions or to refine your models. Or you might to start exploring one aspect of the data only to find that your results suggest that other aspects of the data are vital to your model or that another related topic is vastly more interesting for you to explore. This iterative process of analysis and exploration is a major part of the data science process.
In other words, one of the main tasks a data scientist will spend time on is directly and repeatedly accessing and re-accessing the data, which may reside in a variety of file systems, databases, and other data silos in today’s large enterprises. Furthermore, data from these disparate sources must often be cleaned, transformed, and moved before a data science team can even begin developing models. Arguably, the easiest way to speed up the data science process, then, is to simply speed up accessing and processing the source data. Formulus Black’s software, Forsa, does this by enabling all applications and data, including any database, to run fully in-memory without modification.
To see just how much Forsa can speed up analytics, we built an experiment to compare SQL query execution speed when using conventional solid-state storage (a modern NVMe SSD) versus in-memory storage (provisioned by Forsa). To control for other variables, we built two identical databases on otherwise identical virtual machines, or VMs, each with 16 GB system RAM and 16 virtual CPU cores. The VMs were hosted on two separate, identical, server environments (full details below). The only difference between the two databases was their storage media—the conventional VM’s storage was taken from two modern NVME solid-state drives, while the Forsa VM’s storage was provisioned by Forsa from additional server RAM. (see Fig. 1).
For the database, we chose the New York City Taxi and Limousine Commission Trip Record Data, a well-known, publicly available data set. We used the Github project by Todd Schneider to download and import the roughly 350 GB dataset into both PostgreSQL databases. The database includes all the taxi trip data for 2009-2018 in a single table with more than one and a half billion rows, along with some additional related data in other tables. While the database and the queries tested here are not a standardized analytics database benchmark (e.g. TPC-H), actually working with a real-world data set should make the results a bit more interesting.
For the first round of testing, we used Looker, a popular online data analytics platform, for exploratory analysis and visualization, and to measure the over-network query response times an analyst might experience. We chose Looker because its simple point-and-click interface enables easy data exploration without having to encode tedious SQL queries. It also produces nice data visualization (maps, scatter plots, bar graphs) without the need to set up custom plotting scripts to visualize each different query result. For the purposes of our query timing tests, it is worth noting that Looker does not impact the SQL database performance—it configures and sends standard SQL queries as an ordinary database user but does not affect the performance of the database itself. Honestly, we also used Looker because it was quick and easy to set up and use.
We pulled the query response times from Looker’s collected metadata. Because we used the standard, online Looker instance for this test, rather than an on-premises installation of Looker, the Looker metadata of our query timings includes network latencies (for both sending queries and receiving results) in the measurements. Because Looker over the network performed a trivial query (returning the row count for a two-row table in the database) in only 0.2 seconds, we can safely assume the network latencies are generally less than a half-second total.
We set up a query that filtered the taxi drop-off latitude and longitude data according to some likely drop-off locations during the month of April in 2015 to find the average trip distance versus likely airport destination. According to our results, it appears that taxi trips to JFK international airport were, on average, over 4x farther than taxi trips to other destinations.
But what we’re really interested in here is how long it took for the query to return results. For the SSD-hosted database, the query returned in 212.07 seconds (3mins, 32 seconds), while the Forsa-based database returned the query results in only 5.90 seconds. That’s nearly 36x faster!
We ran a query counting trips by approximate location (latitude and longitude) for the taxi trip pickup locations on a single day in April 2015. This kind of simple query and visualization might be an intermediate step for someone figuring out exactly why the average JFK trip is so much farther than the average trips to other locations. Even without running more complex queries to correlate the pickup and drop off locations, it is clear that most of the NYC taxi pickups are centered around the two NYC airports and Manhattan island. Likely, the vast majority of trips ending in JFK start in Manhattan, leading to a pretty lengthy trip compared to most of the other destinations on the map.
As before, our real goal here is to examine the query timing performance. So how did the Forsa in-memory storage perform in comparison to the SSD? In this trial, the SSD-hosted database returned the query in 200.6 seconds (3 minutes, 20 seconds) while the Forsa database returned the query in only 2.38 seconds, or 84.3x faster than the SSD! Because the Forsa timing is dramatically faster than the SSD system, any tiny variation in the Forsa performance (or network delays) can cause a pretty dramatic change in the performance ratios, but over multiple trials on this particular query, we have observed the Forsa system performing 65x-95x faster.
Now, to be fair, the query times vary significantly depending on the exact query made, and not all queries exhibit a greater than 60x speedup. While systematically testing all possible queries is a bit impractical, merely reporting a few nice query timing results alone doesn’t explain much about this wide performance gap.
To explore this gap further, we devised additional tests based on some basic knowledge of PostgreSQL performance, focusing particularly on parallelization. In general, using multiple parallel workers can improve performance by distributing the work across multiple CPU cores. However, the gains from distributing the work among parallel workers comes with an additional, non-zero CPU cost of aggregating and processing the results collected by all the workers, so the benefits of parallelization don’t always outweigh the costs. To balance the potential gains with the costs, when posed with any query, a parallel-enabled PostgreSQL database uses a planning function to decide exactly how many parallel workers to launch based on the configured allowed settings combined with a simple internal predictive model to determine the optimal worker distribution. However, for our purposes, the exact details of the planning function are not important. Instead, we used the following general guideline to design the test: parallelization is especially likely to improve performance when only a small fraction of the data is queried. When the data is heavily filtered, the overhead cost of aggregating multiple worker results is typically smaller than the gains made by spreading the work among many workers.
To test the Forsa-enabled advantage, we varied the amount of the database filtered in the same type of query and examine how the CPU-overhead cost balances against the faster parallel read performance driven by the in-memory database storage. For this set of tests, we used 5 simple queries (row count filtered by 5 different date ranges) to return results for different fractions of the database, and then repeated the same queries under a variety of parallel configurations (we varied max_parallel_workers and max_parallel_workers_per_gather) on both databases. We chose the following row-count queries in order to examine how query performance was affected by the fraction of the table queried. Plots of query timings versus the maximum number of allowed parallel workers for each of the five queries is shown in Figure 3, beneath the table describing the queries.
We used the PostgreSQL ANALYZE EXPLAIN function to report query execution times in order to avoid including network delays in these measurements. The query planning times were on the order of, at most, a little over 10 milliseconds, so have not been added to the execution times.
As can be seen in Fig. 3, parallelization settings can significantly improve the speed of the queries run on the SSD database for the same query. For example, for the SSD, increasing the total parallel workers and the number of parallel workers per gather can improve performance of some queries here by over 100 seconds. But one of the striking outcomes of this test is that none of the tested parallel configurations for the SSD database were ever able to outperform any of the parallel configurations on the identical database run on Forsa. The performance gap is particularly striking in the case of Query 1, where the percent of the database returned in the query is small, and Forsa queries returned in fewer than 3 seconds for all the parallel configurations tested. In contrast, none of the parallel configurations returned Query 1 in fewer than 198 seconds for the SSD. For comparison, the fastest Query 1 test for Forsa executed in only 1.5 seconds.
And even more impressively, we can actually scale back the raw computing resources for Forsa significantly and still outperform the 16 CPU database on the SSD. For a final test, we reconfigured the Forsa virtual machine to have only 4 virtual CPU cores and 4 GB of system RAM (75% reduction in CPU cores and system RAM for the database) and reran the same queries with various parallel settings (up to 4 parallel workers, and 4 workers per gather). All tested query times for the 16 core/16 GB RAM SSD database and the 4 core/4 GB RAM Forsa database are shown in the figure below versus the percentage of the database measured (the five columns of data points represent each of the 5 queries, in order). Even with only a quarter of the computational resources, the Forsa database still outperformed ALL parallel configurations of the 16 core SSD database! For the business audience, most commercial databases are licensed by the core, so being able to achieve higher performance on fewer cores allows for considerable cost-savings.
Even with only a quarter of the computational resources, the Forsa database still outperformed ALL parallel configurations of the 16-core SSD database.
So how is this kind of performance gap possible, when both systems are using identical hardware? The key is that database queries are typically bottlenecked by the database’s storage read operation performance. In other words, any conventional storage system’s ability to deliver information to the CPU cores for processing places a limit on the rate that the data can be processed. In contrast to conventional systems, Forsa enables the database to be stored entirely on DDR4 memory, which is a fundamentally faster medium for all read and write operations.
In addition to being inherently faster, the DDR4 memory channels on a server also feature much higher throughput read/write communication pipelines than any conventional solid-state or spinning-disk drive. With typical SSD storage, the workers cannot gather data to be processed any faster than the storage device can move data into memory. In contrast, using DDR4 RAM as a storage device enables the CPU cores to access stored data over a much wider communication bus. With Forsa, the high-throughput memory bus of a modern server increases CPU efficiency by reducing the amount of time each core spends idly waiting for data to be accessed from conventional storage. In this test, that enhanced CPU processing efficiency translates directly into much faster data queries while using significantly fewer CPU cores.
At Formulus Black, we believe exceptional performance doesn’t need to come at the cost of code modifications or specialized hardware. We are unlocking the power of In-Memory Compute for ALL applications. By breaking the barriers between memory and storage, we free workloads from slow I/O and peripheral buses. With Forsa, all data resides in-memory at all times, for a level of performance unmatchable by any SSD or other I/O-bound technology.
For more information visit: www.formulusblack.com.
Looker is a unified Platform for Data that delivers actionable business insights to every employee at the point of decision. Looker integrates data into the daily workflows of users to allow organizations to extract value from data at web scale. Over 1600 industry-leading and innovative companies such as Sony, Amazon, The Economist, IBM, Spotify, Etsy, Lyft and Kickstarter have trusted Looker to power their data-driven cultures. The company is headquartered in Santa Cruz, California, with offices in San Francisco, New York, Chicago, Boulder, London, Tokyo and Dublin, Ireland. Investors include CapitalG, Kleiner Perkins Caufield & Byers, Meritech Capital Partners, Redpoint Ventures and Goldman Sachs. For more information, connect with us on LinkedIn, Twitter, Facebook and YouTube or visit looker.com.
RAM: 187 GB RAM as system memory, declared from 12x2x64 GB DDR4 Samsung 2400 MT/s RDIMMs (1.5 TB total installed memory)
Additional storage drives: none during this test. Additional SSDs would be required for data persistence, but do not play a role in data performance.
Prior to each SQL query, the VM cache was wiped in order to test with a consistent environment.