Alibaba brings full text search to its data warehouse with ParadeDB. Read their story.

Alibaba Picks ParadeDB to Bring Full Text Search to its Postgres-Based Data Warehouse

By Ming Ying on September 24, 2024

Overview

About Alibaba Cloud

Alibaba Cloud is the cloud computing arm of Chinese tech giant Alibaba Group and is the largest Asia Pacific cloud provider.

One of Alibaba Cloud’s products, AnalyticDB for PostgreSQL, is a data warehouse built on Postgres. AnalyticDB for PostgreSQL uses a distributed architecture known as MPP (massively parallel processing) to process petabytes of data with high concurrency and low latency.

Challenge

Alibaba Needed a Full Text Search Solution That Could Handle Petabytes of Data

Prior to integrating with ParadeDB, AnalyticDB for PostgreSQL’s full text search (FTS) capabilities were limited to Postgres native full text search, which uses the tsvector type and GIN indexing.

As large enterprise customers onboarded with AnalyticDB for PostgreSQL, many of them had sophisticated full text search needs that were not met by tsvector:

  1. Over multi-terabyte tables, tsvector performance degraded significantly. Alibaba's customers needed a search engine that could meet or exceed the query latency and throughput of Lucene in high concurrency scenarios over multi-terabyte tables.
  2. Without BM25 scoring, tsvector ranked results poorly because it does not factor in important variables like term frequency.
  3. Common advanced search queries like fuzzy search, relevance tuning, and faceting were not supported.

Solution

ParadeDB Delivers Full Postgres Compatibility with Zero Infrastructure Overhead

Because AnalyticDB for PostgreSQL is fully Postgres compatible, using an external service like Elasticsearch was not a viable option.

With ParadeDB, AnalyticDB's search indexes are baked directly into Postgres. This means that ParadeDB gives Alibaba Elastic-quality search and all the features of Postgres, including

  1. Full integration with Postgres’ native backup, restore, and high availability functionalities.
  2. First-class JOIN support so data does not need to be denormalized.
  3. Careful integration with Postgres’ query planner so that complex full text search queries can be inspected and optimized.

ParadeDB's SQL-like query syntax is friendly to our users. Data development engineers can quickly master and apply it to application systems.

Pang Bo, Product Manager

Outcomes

ParadeDB Delivers 5X Better Performance per Core Compared to Lucene

Alibaba ran an extensive 60 day evaluation process, during which they benchmarked ParadeDB against Lucene (ElasticSearch's underlying search engine) over a corpus of 100 million Wikipedia documents. Alibaba measured four criteria: index build time, index size on disk, throughput (queries per second), and latency (round trip time in milliseconds).

Test Environment

Both Lucene and ParadeDB were run on identical machines with 4 CPU cores, 16GB RAM, and PL1 ESSDs. 4 data nodes were dedicated to ParadeDB and Lucene each.

Index Build Time

ParadeDB indexed 100 million documents over than twice as quickly as Lucene.

Build Time (Minutes)

Index Size

Lucene and ParadeDB indexes consumed a similar amount of disk space.

Index Size (GB)

Throughput

With 40 concurrent readers, ParadeDB was able to process 5X queries per second compared to Lucene. This difference was magnified as the number of concurrent readers grew.

Queries per Second (QPS) vs. Concurrency

Latency

With 40 concurrent readers, ParadeDB’s round trip query times were 5X faster than Lucene.

Round Trip Time (ms) vs. Concurrency

This means that, with ParadeDB, Alibaba is able to meet the business-critical workloads of its most demanding customers.

ParadeDB has excellent performance and throughput in the field of Full Text Search, helping our clients achieve structured analysis and full-text retrieval using a pure Postgres engine.

Pang Bo, Product Manager