How to optimize query performance in a large fact table with billions of rows? - Stack Overflow
I'm thinking of building a data warehouse for a retail company, and we have a fact table called Sales that contains billions of rows. The table stores transaction-level data with columns like transaction_id, product_id, customer_id, store_id, date, quantity_sold, and total_amount.
We also have several dimension tables: Products, Customers, Stores, and Dates
SELECT
p.product_name,
SUM(s.total_amount) AS total_sales
FROM
Sales s
JOIN
Products p ON s.product_id = p.product_id
WHERE
s.date BETWEEN '2014-01-01' AND '2024-12-31'
GROUP BY
p.product_name
ORDER BY
total_sales DESC;
The issue is that some of the analytical queries we run on this fact table are taking too long. For example, this query to calculate total sales by product for a specific date range runs for over 10 minutes:
I have tried:
- Partitioned the Sales table by the date column.
- Added indexes on commonly queried columns like product_id, customer_id, and date.
- Aggregated older data into summary tables for quick reporting.
I'm thinking of building a data warehouse for a retail company, and we have a fact table called Sales that contains billions of rows. The table stores transaction-level data with columns like transaction_id, product_id, customer_id, store_id, date, quantity_sold, and total_amount.
We also have several dimension tables: Products, Customers, Stores, and Dates
SELECT
p.product_name,
SUM(s.total_amount) AS total_sales
FROM
Sales s
JOIN
Products p ON s.product_id = p.product_id
WHERE
s.date BETWEEN '2014-01-01' AND '2024-12-31'
GROUP BY
p.product_name
ORDER BY
total_sales DESC;
The issue is that some of the analytical queries we run on this fact table are taking too long. For example, this query to calculate total sales by product for a specific date range runs for over 10 minutes:
I have tried:
- Partitioned the Sales table by the date column.
- Added indexes on commonly queried columns like product_id, customer_id, and date.
- Aggregated older data into summary tables for quick reporting.
1 Answer
Reset to default 0This reformulation may help:
SELECT ( SELECT p.product_name FROM Products AS p
WHERE s.product_id = p.product_id ) AS product_name
SUM(s.total_amount) AS total_sales
FROM Sales s
WHERE s.date BETWEEN '2014-01-01' AND '2024-12-31'
GROUP BY s.product_id
ORDER BY total_sales DESC;
This is more likely to help:
SELECT p.product_name
ts.total_sales
FROM (
SELECT s.product_id,
SUM(s.total_amount) AS total_sales
FROM Sales s
WHERE s.date BETWEEN '2014-01-01' AND '2024-12-31'
GROUP BY s.product_id ) AS ts
JOIN Products AS p
USING (product_id)
ORDER BY total_sales DESC;
11 years of data? Sounds like a full table scan of all billion rows of Sales
. Then some easy lookups into Products
(assuming PRIMARY KEY(product_id)
; then finally a sort.
For next time a much better approach (perhaps 10x faster) is to build and maintain a Summary Table. It should contain
CREATE TABLE DailySummary (
product_id ... NOT NULL,
product_name ... NOT NULL,
dy DATE NOT NULL, -- or maybe grouped by weeks
days_sales DECIMAL(...) NOT NULL, -- SUM(s.total_amount) for the one dy
PRIMARY KEY(product_id, dy),
INDEX(dy, product_id)
) ENGINE=InnoDB;
It will take a big query (or a bunch of smaller queries) to initialize such a summary table, but after that, the desired query would be something like:
SELECT product_name,
SUM(days_sales) AS total_sales
FROM DailySummary
WHERE dy BETWEEN '2014-01-01' AND '2024-12-31'
GROUP BY product_name
ORDER BY total_sales DESC;
- 京东1.7亿美元投资金蝶原因:布局企业ERP市场
- OS X故障不断 苹果MAC被爆Wifi故障
- Upgrading apache spark core from 3.3.2 to >=3.4.4 results in stackoverflowerror in logging - Stack Overflow
- javascript - How to check if a track is already added to a WebRTC peer connection before adding it? - Stack Overflow
- Jetpack Compose TextField keyboard dismisses immediately after typing first letter - Stack Overflow
- flutter - having issue about qr code scanner that can scan almost everything - Stack Overflow
- rust - Basic bracket-lib example crashes with “unsafe precondition(s) violated: slice::from_raw_parts” - Stack Overflow
- flutter - Uber category selection animation - Stack Overflow
- Teradata: How can I trim a column for leading zeros and trailing spaces? - Stack Overflow
- java - Bluej throws SSLHandshakeException making http request - Stack Overflow
- java - Handling order Id In OMS system on application level - Stack Overflow
- c++ - bootloader _start VEZA video buffer - Stack Overflow
- How to log information in Spring application when request is received and before response is returned - Stack Overflow
- postgresql - How do I connect my AIRFLOW which is installed on WSL to POSTGRES DATABASE which is installed on windows environmen
- OneNote with embeded Excel having Data Connection - security warning and disabled - Stack Overflow
- ios - Persist overlay view in the detail side of NavigationSplitView - Stack Overflow
- java - YubiKey PIV AuthenticationDecryption returns 0x6A80 error - Stack Overflow