polyzos commented on code in PR #2669:
URL: https://github.com/apache/fluss/pull/2669#discussion_r3050020150


##########
website/docs/quickstart/User-Profile.md:
##########
@@ -0,0 +1,350 @@
+---
+title: Real-Time User Profile
+sidebar_position: 4
+---
+
+# Real-Time User Profile
+
+This tutorial demonstrates how to build a production-grade, real-time user 
profiling system using Apache Fluss. You will learn how to map high-cardinality 
string identifiers (like emails) to compact integers and aggregate user 
behavior directly in the storage layer with exactly-once guarantees.
+
+![arch](/img/user-profile.png)
+
+## How the System Works
+
+### Core Concepts
+
+- **Identity Mapping**: High-cardinality strings (Emails) → Compact 32-bit 
`INT` UIDs.
+- **Offloaded Aggregation**: Computation happens in Fluss TabletServers, 
keeping Flink state-free.
+- **Optimized Storage**: Native [RoaringBitmap](https://roaringbitmap.org/) 
support for sub-second unique visitor (UV) counts.
+
+### Data Flow
+
+1. **Ingestion**: Raw event streams (e.g., clicks, page views) are generated 
by the Faker connector.
+2. **Mapping**: The [Apache Flink](https://flink.apache.org/) job performs a 
temporal lookup join against the `user_dict` table. If a user is new, the 
`insert-if-not-exists` hint triggers Fluss to automatically generate a unique 
`INT` UID using its Auto-Increment feature.
+3. **Merge**: The **Aggregation Merge Engine** updates clicks and bitmaps in 
the storage layer.
+4. **Recovery**: The **Undo Recovery** mechanism ensures exactly-once accuracy 
during failovers.
+
+## Environment Setup
+
+### Prerequisites
+
+Before proceeding, ensure that 
[Docker](https://docs.docker.com/engine/install/) and the [Docker Compose 
plugin](https://docs.docker.com/compose/install/linux/) are installed.
+
+### Starting the Playground
+
+1. Create a working directory.
+   ```shell
+   mkdir fluss-user-profile
+   cd fluss-user-profile
+   ```
+
+2. Set the version environment variables.
+   ```shell
+   export FLUSS_DOCKER_VERSION=0.9.0-incubating
+   export FLINK_VERSION="1.20"
+   ```
+   :::note
+   If you open a new terminal window, remember to re-run these export commands.
+   :::
+
+3. Create a `lib` directory and download the required JARs.
+   ```shell
+   mkdir lib
+
+   # Download Flink Faker for data generation
+   curl -fL -o lib/flink-faker-0.5.3.jar \
+     
https://github.com/knaufk/flink-faker/releases/download/v0.5.3/flink-faker-0.5.3.jar
+
+   # Download Fluss Connector
+   curl -fL -o "lib/fluss-flink-${FLINK_VERSION}-${FLUSS_DOCKER_VERSION}.jar" \
+     
"https://repo1.maven.org/maven2/org/apache/fluss/fluss-flink-${FLINK_VERSION}/${FLUSS_DOCKER_VERSION}/fluss-flink-${FLINK_VERSION}-${FLUSS_DOCKER_VERSION}.jar";
+
+   # Download Bitmap UDFs (TO_RBM and RB_CARDINALITY)
+   curl -fL -o lib/fluss-bitmap-udfs-1.0.0.jar \
+     
https://github.com/Prajwal-banakar/flink-roaringbitmap/releases/download/v1.0.0/fluss-bitmap-udfs-1.0.0.jar
+   ```
+
+   :::note
+   The `flink-roaringbitmap` JAR provides the `TO_RBM` and `RB_CARDINALITY` 
functions needed to work with RoaringBitmap columns. The source code is 
available at 
[flink-roaringbitmap](https://github.com/Prajwal-banakar/flink-roaringbitmap).
+   :::
+
+4. Verify all three JARs downloaded correctly.
+   ```shell
+   ls -lh lib/
+   ```
+   You should see three files: `flink-faker-0.5.3.jar`, 
`fluss-flink-1.20-0.9.0-incubating.jar`, and `fluss-bitmap-udfs-1.0.0.jar`.
+
+5. Create a `docker-compose.yml` file with the following content.
+
+:::tip
+Create the file using the `cat` heredoc command to avoid indentation issues:
+:::
+
+   ```yaml
+   services:
+     coordinator-server:
+       image: apache/fluss:${FLUSS_DOCKER_VERSION}
+       command: coordinatorServer
+       depends_on:
+         - zookeeper
+       environment:
+         - |
+           FLUSS_PROPERTIES=
+           zookeeper.address: zookeeper:2181
+           bind.listeners: FLUSS://coordinator-server:9123
+           remote.data.dir: /remote-data
+       volumes:
+         - fluss-remote-data:/remote-data
+     tablet-server:
+       image: apache/fluss:${FLUSS_DOCKER_VERSION}
+       command: tabletServer
+       depends_on:
+         - coordinator-server
+       environment:
+         - |
+           FLUSS_PROPERTIES=
+           zookeeper.address: zookeeper:2181
+           bind.listeners: FLUSS://tablet-server:9123
+           data.dir: /tmp/fluss/data
+           remote.data.dir: /remote-data
+       volumes:
+         - fluss-remote-data:/remote-data
+     zookeeper:
+       restart: always
+       image: zookeeper:3.9.2
+     jobmanager:
+       image: flink:${FLINK_VERSION}
+       ports:
+         - "8081:8081"
+       environment:
+         - |
+           FLINK_PROPERTIES=
+           jobmanager.rpc.address: jobmanager
+       entrypoint: ["sh", "-c", "cp -v /tmp/lib/*.jar /opt/flink/lib && exec 
/docker-entrypoint.sh jobmanager"]
+       volumes:
+         - ./lib:/tmp/lib
+         - fluss-remote-data:/remote-data
+     taskmanager:
+       image: flink:${FLINK_VERSION}
+       depends_on:
+         - jobmanager
+       environment:
+         - |
+           FLINK_PROPERTIES=
+           jobmanager.rpc.address: jobmanager
+           taskmanager.numberOfTaskSlots: 2
+       entrypoint: ["sh", "-c", "cp -v /tmp/lib/*.jar /opt/flink/lib && exec 
/docker-entrypoint.sh taskmanager"]
+       volumes:
+         - ./lib:/tmp/lib
+         - fluss-remote-data:/remote-data
+     sql-client:
+       image: flink:${FLINK_VERSION}
+       depends_on:
+         - jobmanager
+       environment:
+         - |
+           FLINK_PROPERTIES=
+           jobmanager.rpc.address: jobmanager
+           rest.address: jobmanager
+       entrypoint: ["sh", "-c", "cp -v /tmp/lib/*.jar /opt/flink/lib && exec 
/docker-entrypoint.sh bin/sql-client.sh"]
+       volumes:
+         - ./lib:/tmp/lib
+         - fluss-remote-data:/remote-data
+
+   volumes:
+     fluss-remote-data:
+   ```
+
+   :::note
+   Make sure the `volumes:` section at the bottom has **no leading spaces** it 
must be flush with the left margin.
+   :::
+
+6. Start the environment.
+   ```shell
+   docker compose up -d
+   ```
+
+7. Confirm all containers are running.
+   ```shell
+   docker compose ps
+   ```
+   You should see `coordinator-server`, `tablet-server`, `zookeeper`, 
`jobmanager`, and `taskmanager` all in the `running` state.
+
+8. Launch the Flink SQL Client.
+   ```shell
+   docker compose run sql-client
+   ```
+
+## Step 1: Create the Fluss Catalog
+
+In the SQL Client, run these statements one by one.
+
+:::tip
+Run SQL statements one by one to avoid errors.
+:::
+
+```sql
+CREATE CATALOG fluss_catalog WITH (
+    'type' = 'fluss',
+    'bootstrap.servers' = 'coordinator-server:9123'
+);
+```
+
+```sql
+USE CATALOG fluss_catalog;
+```
+
+## Step 2: Register the Bitmap UDFs
+
+Register the `TO_RBM` and `RB_CARDINALITY` functions. These are required to 
correctly serialize user IDs into RoaringBitmap format before inserting into 
Fluss, and to read back the unique visitor count as a human-readable integer.
+
+```sql
+CREATE TEMPORARY FUNCTION TO_RBM
+    AS 'org.apache.fluss.udfs.ToRbm';
+```
+
+```sql
+CREATE TEMPORARY FUNCTION RB_CARDINALITY
+    AS 'org.apache.fluss.udfs.RbCardinality';
+```
+
+## Step 3: Create the User Dictionary Table
+
+Create the `user_dict` table to map emails to UIDs. The 
`auto-increment.fields` property instructs Fluss to automatically generate a 
unique `INT` UID for every new email it receives.
+
+```sql
+CREATE TABLE user_dict (
+    email STRING,
+    uid   INT,
+    PRIMARY KEY (email) NOT ENFORCED
+) WITH (
+    'connector'             = 'fluss',

Review Comment:
   since we are already in the catalog with dont need to specify the connector 
for every `CREATE TABLE` statement 



##########
website/docs/quickstart/User-Profile.md:
##########
@@ -0,0 +1,350 @@
+---
+title: Real-Time User Profile
+sidebar_position: 4
+---
+
+# Real-Time User Profile
+
+This tutorial demonstrates how to build a production-grade, real-time user 
profiling system using Apache Fluss. You will learn how to map high-cardinality 
string identifiers (like emails) to compact integers and aggregate user 
behavior directly in the storage layer with exactly-once guarantees.
+
+![arch](/img/user-profile.png)
+
+## How the System Works
+
+### Core Concepts
+
+- **Identity Mapping**: High-cardinality strings (Emails) → Compact 32-bit 
`INT` UIDs.
+- **Offloaded Aggregation**: Computation happens in Fluss TabletServers, 
keeping Flink state-free.
+- **Optimized Storage**: Native [RoaringBitmap](https://roaringbitmap.org/) 
support for sub-second unique visitor (UV) counts.
+
+### Data Flow
+
+1. **Ingestion**: Raw event streams (e.g., clicks, page views) are generated 
by the Faker connector.
+2. **Mapping**: The [Apache Flink](https://flink.apache.org/) job performs a 
temporal lookup join against the `user_dict` table. If a user is new, the 
`insert-if-not-exists` hint triggers Fluss to automatically generate a unique 
`INT` UID using its Auto-Increment feature.
+3. **Merge**: The **Aggregation Merge Engine** updates clicks and bitmaps in 
the storage layer.
+4. **Recovery**: The **Undo Recovery** mechanism ensures exactly-once accuracy 
during failovers.
+
+## Environment Setup
+
+### Prerequisites
+
+Before proceeding, ensure that 
[Docker](https://docs.docker.com/engine/install/) and the [Docker Compose 
plugin](https://docs.docker.com/compose/install/linux/) are installed.
+
+### Starting the Playground
+
+1. Create a working directory.
+   ```shell
+   mkdir fluss-user-profile
+   cd fluss-user-profile
+   ```
+
+2. Set the version environment variables.
+   ```shell
+   export FLUSS_DOCKER_VERSION=0.9.0-incubating
+   export FLINK_VERSION="1.20"
+   ```
+   :::note
+   If you open a new terminal window, remember to re-run these export commands.
+   :::
+
+3. Create a `lib` directory and download the required JARs.
+   ```shell
+   mkdir lib
+
+   # Download Flink Faker for data generation
+   curl -fL -o lib/flink-faker-0.5.3.jar \
+     
https://github.com/knaufk/flink-faker/releases/download/v0.5.3/flink-faker-0.5.3.jar
+
+   # Download Fluss Connector
+   curl -fL -o "lib/fluss-flink-${FLINK_VERSION}-${FLUSS_DOCKER_VERSION}.jar" \
+     
"https://repo1.maven.org/maven2/org/apache/fluss/fluss-flink-${FLINK_VERSION}/${FLUSS_DOCKER_VERSION}/fluss-flink-${FLINK_VERSION}-${FLUSS_DOCKER_VERSION}.jar";
+
+   # Download Bitmap UDFs (TO_RBM and RB_CARDINALITY)
+   curl -fL -o lib/fluss-bitmap-udfs-1.0.0.jar \
+     
https://github.com/Prajwal-banakar/flink-roaringbitmap/releases/download/v1.0.0/fluss-bitmap-udfs-1.0.0.jar
+   ```
+
+   :::note
+   The `flink-roaringbitmap` JAR provides the `TO_RBM` and `RB_CARDINALITY` 
functions needed to work with RoaringBitmap columns. The source code is 
available at 
[flink-roaringbitmap](https://github.com/Prajwal-banakar/flink-roaringbitmap).
+   :::
+
+4. Verify all three JARs downloaded correctly.
+   ```shell
+   ls -lh lib/
+   ```
+   You should see three files: `flink-faker-0.5.3.jar`, 
`fluss-flink-1.20-0.9.0-incubating.jar`, and `fluss-bitmap-udfs-1.0.0.jar`.
+
+5. Create a `docker-compose.yml` file with the following content.
+
+:::tip
+Create the file using the `cat` heredoc command to avoid indentation issues:
+:::
+
+   ```yaml
+   services:
+     coordinator-server:
+       image: apache/fluss:${FLUSS_DOCKER_VERSION}
+       command: coordinatorServer
+       depends_on:
+         - zookeeper
+       environment:
+         - |
+           FLUSS_PROPERTIES=
+           zookeeper.address: zookeeper:2181
+           bind.listeners: FLUSS://coordinator-server:9123
+           remote.data.dir: /remote-data
+       volumes:
+         - fluss-remote-data:/remote-data
+     tablet-server:
+       image: apache/fluss:${FLUSS_DOCKER_VERSION}
+       command: tabletServer
+       depends_on:
+         - coordinator-server
+       environment:
+         - |
+           FLUSS_PROPERTIES=
+           zookeeper.address: zookeeper:2181
+           bind.listeners: FLUSS://tablet-server:9123
+           data.dir: /tmp/fluss/data
+           remote.data.dir: /remote-data
+       volumes:
+         - fluss-remote-data:/remote-data
+     zookeeper:
+       restart: always
+       image: zookeeper:3.9.2
+     jobmanager:
+       image: flink:${FLINK_VERSION}
+       ports:
+         - "8081:8081"
+       environment:
+         - |
+           FLINK_PROPERTIES=
+           jobmanager.rpc.address: jobmanager
+       entrypoint: ["sh", "-c", "cp -v /tmp/lib/*.jar /opt/flink/lib && exec 
/docker-entrypoint.sh jobmanager"]
+       volumes:
+         - ./lib:/tmp/lib
+         - fluss-remote-data:/remote-data
+     taskmanager:
+       image: flink:${FLINK_VERSION}
+       depends_on:
+         - jobmanager
+       environment:
+         - |
+           FLINK_PROPERTIES=
+           jobmanager.rpc.address: jobmanager
+           taskmanager.numberOfTaskSlots: 2
+       entrypoint: ["sh", "-c", "cp -v /tmp/lib/*.jar /opt/flink/lib && exec 
/docker-entrypoint.sh taskmanager"]
+       volumes:
+         - ./lib:/tmp/lib
+         - fluss-remote-data:/remote-data
+     sql-client:
+       image: flink:${FLINK_VERSION}
+       depends_on:
+         - jobmanager
+       environment:
+         - |
+           FLINK_PROPERTIES=
+           jobmanager.rpc.address: jobmanager
+           rest.address: jobmanager
+       entrypoint: ["sh", "-c", "cp -v /tmp/lib/*.jar /opt/flink/lib && exec 
/docker-entrypoint.sh bin/sql-client.sh"]
+       volumes:
+         - ./lib:/tmp/lib
+         - fluss-remote-data:/remote-data
+
+   volumes:
+     fluss-remote-data:
+   ```
+
+   :::note
+   Make sure the `volumes:` section at the bottom has **no leading spaces** it 
must be flush with the left margin.
+   :::
+
+6. Start the environment.
+   ```shell
+   docker compose up -d
+   ```
+
+7. Confirm all containers are running.
+   ```shell
+   docker compose ps
+   ```
+   You should see `coordinator-server`, `tablet-server`, `zookeeper`, 
`jobmanager`, and `taskmanager` all in the `running` state.
+
+8. Launch the Flink SQL Client.
+   ```shell
+   docker compose run sql-client
+   ```
+
+## Step 1: Create the Fluss Catalog
+
+In the SQL Client, run these statements one by one.
+
+:::tip
+Run SQL statements one by one to avoid errors.
+:::
+
+```sql
+CREATE CATALOG fluss_catalog WITH (
+    'type' = 'fluss',
+    'bootstrap.servers' = 'coordinator-server:9123'
+);
+```
+
+```sql
+USE CATALOG fluss_catalog;
+```
+
+## Step 2: Register the Bitmap UDFs
+
+Register the `TO_RBM` and `RB_CARDINALITY` functions. These are required to 
correctly serialize user IDs into RoaringBitmap format before inserting into 
Fluss, and to read back the unique visitor count as a human-readable integer.
+
+```sql
+CREATE TEMPORARY FUNCTION TO_RBM
+    AS 'org.apache.fluss.udfs.ToRbm';
+```
+
+```sql
+CREATE TEMPORARY FUNCTION RB_CARDINALITY
+    AS 'org.apache.fluss.udfs.RbCardinality';
+```
+
+## Step 3: Create the User Dictionary Table
+
+Create the `user_dict` table to map emails to UIDs. The 
`auto-increment.fields` property instructs Fluss to automatically generate a 
unique `INT` UID for every new email it receives.
+
+```sql
+CREATE TABLE user_dict (
+    email STRING,
+    uid   INT,
+    PRIMARY KEY (email) NOT ENFORCED
+) WITH (
+    'connector'             = 'fluss',
+    'auto-increment.fields' = 'uid',
+    'bucket.num'            = '1'

Review Comment:
   bucket.num = 1 is the default for every table, so again we can remove this 
for every CREATE TABLE statement.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to