How to optimize query performance in a large fact table with billions of rows? - Stack Overflow

时间: 2025-01-06 admin 业界

I'm thinking of building a data warehouse for a retail company, and we have a fact table called Sales that contains billions of rows. The table stores transaction-level data with columns like transaction_id, product_id, customer_id, store_id, date, quantity_sold, and total_amount.

We also have several dimension tables: Products, Customers, Stores, and Dates

SELECT 
    p.product_name, 
    SUM(s.total_amount) AS total_sales
FROM 
    Sales s
JOIN 
    Products p ON s.product_id = p.product_id
WHERE 
    s.date BETWEEN '2014-01-01' AND '2024-12-31'
GROUP BY 
    p.product_name
ORDER BY 
    total_sales DESC;

The issue is that some of the analytical queries we run on this fact table are taking too long. For example, this query to calculate total sales by product for a specific date range runs for over 10 minutes:

I have tried:

  • Partitioned the Sales table by the date column.
  • Added indexes on commonly queried columns like product_id, customer_id, and date.
  • Aggregated older data into summary tables for quick reporting.

I'm thinking of building a data warehouse for a retail company, and we have a fact table called Sales that contains billions of rows. The table stores transaction-level data with columns like transaction_id, product_id, customer_id, store_id, date, quantity_sold, and total_amount.

We also have several dimension tables: Products, Customers, Stores, and Dates

SELECT 
    p.product_name, 
    SUM(s.total_amount) AS total_sales
FROM 
    Sales s
JOIN 
    Products p ON s.product_id = p.product_id
WHERE 
    s.date BETWEEN '2014-01-01' AND '2024-12-31'
GROUP BY 
    p.product_name
ORDER BY 
    total_sales DESC;

The issue is that some of the analytical queries we run on this fact table are taking too long. For example, this query to calculate total sales by product for a specific date range runs for over 10 minutes:

I have tried:

  • Partitioned the Sales table by the date column.
  • Added indexes on commonly queried columns like product_id, customer_id, and date.
  • Aggregated older data into summary tables for quick reporting.
Share Improve this question asked 15 hours ago Michael M.MMichael M.M 111 bronze badge
Add a comment  | 

1 Answer 1

Reset to default 0

This reformulation may help:

SELECT  ( SELECT p.product_name FROM Products AS p
              WHERE s.product_id = p.product_id ) AS product_name
        SUM(s.total_amount) AS total_sales
    FROM  Sales s
    WHERE s.date BETWEEN '2014-01-01' AND '2024-12-31'
    GROUP BY s.product_id
    ORDER BY total_sales DESC;

This is more likely to help:

SELECT p.product_name
       ts.total_sales
    FROM  (
        SELECT  s.product_id,
                SUM(s.total_amount) AS total_sales
            FROM  Sales s
        WHERE s.date BETWEEN '2014-01-01' AND '2024-12-31'
        GROUP BY s.product_id ) AS ts
    JOIN Products AS p
    USING (product_id)
    ORDER BY total_sales DESC;

11 years of data? Sounds like a full table scan of all billion rows of Sales. Then some easy lookups into Products (assuming PRIMARY KEY(product_id); then finally a sort.

For next time a much better approach (perhaps 10x faster) is to build and maintain a Summary Table. It should contain

CREATE TABLE DailySummary (
    product_id ... NOT NULL,
    product_name ... NOT NULL,
    dy DATE NOT NULL,          -- or maybe grouped by weeks
    days_sales DECIMAL(...) NOT NULL,  -- SUM(s.total_amount) for the one dy
    PRIMARY KEY(product_id, dy),
    INDEX(dy, product_id)
) ENGINE=InnoDB;

It will take a big query (or a bunch of smaller queries) to initialize such a summary table, but after that, the desired query would be something like:

SELECT  product_name,
        SUM(days_sales) AS total_sales
    FROM  DailySummary
WHERE dy BETWEEN '2014-01-01' AND '2024-12-31'
GROUP BY product_name
ORDER BY total_sales DESC;
最新文章