(camel-website) 02/02: Revert "Add blog post about unstructured data extraction experiments with Camel"

aldettinger Fri, 19 Jul 2024 07:00:12 -0700

This is an automated email from the ASF dual-hosted git repository.

aldettinger pushed a commit to branch main
in repository https://gitbox.apache.org/repos/asf/camel-website.git


commit 7799958fc0850f8e64c7eb9134b4ef1e255058c2
Author: aldettinger <aldettin...@gmail.com>
AuthorDate: Fri Jul 19 15:59:29 2024 +0200

    Revert "Add blog post about unstructured data extraction experiments with 
Camel"
    
    This reverts commit 69eba96986d99264495b9edf7cea94438dfddf46.
---
 .../data-extraction-first-experiment/featured.jpg  | Bin 370479 -> 0 bytes
 .../07/data-extraction-first-experiment/index.md   | 176 ---------------------
 2 files changed, 176 deletions(-)

diff --git a/content/blog/2024/07/data-extraction-first-experiment/featured.jpg 
b/content/blog/2024/07/data-extraction-first-experiment/featured.jpg
deleted file mode 100644
index bf2f03f3..00000000
Binary files 
a/content/blog/2024/07/data-extraction-first-experiment/featured.jpg and 
/dev/null differ
diff --git a/content/blog/2024/07/data-extraction-first-experiment/index.md 
b/content/blog/2024/07/data-extraction-first-experiment/index.md
deleted file mode 100644
index 4519cecd..00000000
--- a/content/blog/2024/07/data-extraction-first-experiment/index.md
+++ /dev/null
@@ -1,176 +0,0 @@
----
-title: "Experimenting extraction from unstructured data with Apache Camel and 
Langchain4j"
-date: 2024-07-19
-draft: false
-authors: [aldettinger]
-categories: ["Camel", "AI"]
-preview: "Give directions about how to turn unstructured data into structured 
data with Camel and Langchain4j."
----
-
-This blog is based on experiments done about extracting structured data into 
its structured counterpart. More precisely in this post, we'll give
-directions about how to convert a conversation transcript into a Java object.
-
-# Overview
-
-Reading articles like 
[this](https://www.perfect-memory.com/unlock-the-potential-of-unstructured-data/)
 over the net, it seems that folks have a lot of unstructured data at disposal 
while not being able to take advantage on it. So probably, in the future we 
might expect to deal more and more with unstructured data extraction in 
integration flow.
-
-From there, I started to experiment about ways to do it with Apache Camel. In 
this article, I don't come up with full packaged examples but still can share 
some directions about how to do it. Precisely, we'll use the 
[langchain4j](https://github.com/langchain4j/langchain4j) high level api in 
conjunction with camel [bean binding 
capabilities](https://camel.apache.org/manual/bean-binding.html).
-
-Let's share some directions.
-
-# Serve the model from a local container
-
-Throughout this article, we'll stress the importance of JSON to achieve our 
goal.
-And it starts here by choosing a model that has knowledge about JSON.
-
-Let's run a `codellama` container locally:
-
-```bash
-docker run -p 11434:11434 langchain4j/ollama-codellama:latest
-```
-
-# Set up the langchain4j chat model
-
-In order to request the served model from our Camel application, we need to 
setup the chat model based on [langchain4j 
instructions](https://docs.langchain4j.dev/integrations/language-models/ollama/).
-
-Mainly, we add the `langchain4j-ollama` dependency:
-
-```xml
-<dependency>
-            <groupId>dev.langchain4j</groupId>
-            <artifactId>langchain4j-ollama</artifactId>
-            <version>${langchain.version}</version>
-</dependency>
-```
-
-Then create a chat model:
-
-```java
-ChatLanguageModel model = OllamaChatModel.builder()
-                .baseUrl(modelServingUrl)
-                .modelName(MODEL_NAME)
-                .temperature(0.0)
-                .format("json")
-                .timeout(Duration.ofMinutes(1L))
-                .build();
-```
-
-See how we lower the temperature to `0.0` in order to reduce the variability 
of the LLM answers.
-Another key aspect, is that we configure the model to output JSON only which 
greatly reduces the problem space.
-
-# Define the extraction service
-
-Langchain4j offers [some 
examples](https://docs.langchain4j.dev/tutorials/structured-data-extraction/) 
about how to declare data extraction service with the high level api.
-
-When extracting POJOs, we need to define the structure of the data we would 
like to extract as a class:
-
-```java
-static class CustomPojo {
-  private boolean customerSatisfied;
-  private String customerName;
-  private LocalDate customerBirthday;
-  private String summary;
-}
-```
-
-See how we could mix different sort of information that langchain4j will stuff 
from the JSON output produced by the served model.
-
-Then, we define the extraction service contract:
-
-```java
-interface CamelCustomPojoExtractor {
-  @UserMessage(
-    "Extract information about a customer from the text delimited by triple 
backticks: ```{{text}}```." +
-    "The customerBirthday field should be formatted as YYYY-MM-DD." +
-    "The summary field should concisely relate the customer main ask."
-  )
-  CustomPojo extractFromText(@V("text") String text);
-}
-```
-
-As we return a custom POJO, langchain4j will automatically instructs the LLM 
to produce a valid JSON according to the needed schema.
-Notice how we are able to complete langchain4j instructions with the 
`@UserMessage` annotation where we define the date format output.
-
-As a last step, we create the extraction service and register it in the 
registry:
-
-```java
-@Override
-protected RouteBuilder createRouteBuilder() {
-  ...
-  CamelCustomPojoExtractor extractionService = 
AiServices.create(CamelCustomPojoExtractor.class, chatLanguageModel);
-  this.context.getRegistry().bind("extractionService", extractionService);
-  ...
-}
-```
-
-# Invoke the extraction service from a route
-
-```
-@Override
-protected RouteBuilder createRouteBuilder() {
-...
-  from("...")
-    .bean(extractionService)
-    .bean(prettyPrintCustomPojo);
-...
-}
-```
-
-Now, using bean binding, Camel able to map any textual incoming body and 
inject it as the first parameter of the extraction service.
-The extracted `CustomPojo` could then be pretty printed with any home defined 
method.
-
-# Let's send a conversation transcript to the route
-
-The goodness with Camel is that conversation transcript could originate from a 
lot of systems given the high number of [components 
available](https://camel.apache.org/components/4.4.x/index.html).
-
-So, let's send a conversation transcript into the route:
-
-```
-Operator: Hello, how may I help you?
-Customer: Hello, I am currently at the police station because I've got an 
accident. The police would need a proof that I have an insurance. Could you 
please help me?
-Operator: Sure, could you please remind me your name and birth date?
-Customer: Of course, my name is Kate Hart and I was born on August the 
thirteen in the year nineteen ninety nine.
-Operator: I'm sorry Kate, but we don't have any contract in our records.
-Customer: Oh, I'm sorry that I've made a mistake. Actually, my last name is 
not Hart, but Boss. It changed since I'm married.
-Operator: Indeed, I have now found your contract and everything looks good. 
Shall I send the proof of insurance to the police station?
-Customer: Oh, if possible, my husband will go directly to your office in order 
to get it.
-Operator: Yes, that's possible. I will let the paper at the entrance. Your 
beloved could just ask it to the front desk.
-Customer: Many thanks. That's so helpful. I'm a bit more relieved now.
-Operator: Sure, you're welcome Kate. Please come back to us any time in case 
more information is needed. Bye.
-Customer: Bye.
-```
-
-Reading the whole discussion, we could realize that we have a few challenges 
at hand.
-
-For instance, the customer name is spread in different parts of the text.
-This is what is called the co-reference problem in the data extraction field.
-Worst than that, the customer is giving a wrong name at first, and then 
correcting it later on.
-So, we really need semantic capabilities here to unravel the situation.
-
-Let's process this conversation, after more or less 20 seconds on my machine, 
I'm provided with the result below:
-
-```
-customerSatisfied: true
-customerName: Kate Boss
-customerBirthday: 13 August 1999
-summary: Customer Kate Boss is satisfied with the assistance provided by the 
operator. The customer was able to provide their name and birth date correctly, 
and the operator was able to locate their insurance contract.
-```
-
-Hey, it looks pretty decent:
- + the customer satisfaction was detected has positive
- + the customer name was successfully corrected
- + the challenges around the date format were addressed and we have a valid 
`LocalDate` object
- + the summary is quite relevant
-
-# Conclusion
-
-At the end of the day, we were able to convert an unstructured conversation 
transcript into a structured POJO.
-
-The process under the hood contains multiple steps:
- + Camel receives the conversation transcript as a `String`
- + Camel bean invokes the `extractFromText` method passing the conversation as 
first parameter
- + Langchain4j injects the conversation into the LLM prompt via the 
`@V("text")` annotation and `{{text}}` placeholder
- + Langchain4j completes the prompt with the JSON schema automatically 
generated from the `CustomPojo` class
- + The codellama model served from the container generate the JSON completion
- + Langchain4j maps the provided JSON output into a `CustomPojo` instance
- + Camel bean is now able to pretty print the `CustomPojo` instance helped 
with the `prettyPrintCustomPojo` bean

(camel-website) 02/02: Revert "Add blog post about unstructured data extraction experiments with Camel"

Reply via email to