Decentralized RAG 101 with OriginTrail DKG and Google Gemini

Published in

OriginTrail

11 min readApr 22, 2024

Retrieval Augmented Generation (RAG) has established itself as a key paradigm for builders in the AI space looking to feed LLMs with a specific context and datasets. The term RAG was coined by Patrick Lewis in the Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks paper, introducing it as a technique for enhancing the accuracy and reliability of GenAI models with facts fetched from external sources. It allows AI solutions to dynamically fetch relevant information before the generation process, enhancing the accuracy of responses by limiting the generation to re-working the retrieved inputs.

As a growing number of AI systems are driven to utilize RAG, their builders are creating and curating valuable knowledge bases for their RAG pipelines. This opens up a tremendous opportunity for connecting these individual knowledge bases, enabling the sharing of knowledge and value between AI systems in a decentralized way. The opportunity is analogous to those previously seized by advances in networking, with early computer networks such as Ethernet and the Internet brought about, which have demonstrated tremendous value generation through network effects famously articulated in Metcalfe’s law.

This vision, embodied in the concept of the Verifiable Internet for AI has been articulated in the recent white paper. It is centered around the paradigm of Decentralized Retrieval Augmented Generation (dRAG) as a way to publish and retrieve structured knowledge from multiple sources for AI systems. DRAG focuses on the discoverability of knowledge across knowledge bases, cryptographic verifiability of data, and maintaining ownership of knowledge assets with user-defined access control.

In this blog post we will showcase how to implement a basic Decentralized Retrieval Augmented Generation system using Google Gemini and OriginTrail Decentralized Knowledge Graph. This approach was recently showcased at the Google + OriginTrail meetup in Google Amsterdam offices, with the full recording available below, including a special appearance from Dr Bob Metcalfe himself who joined to offer his perspective.

Enter Decentralized Retrieval Augmented Generation (dRAG)

The dRAG framework advances the RAG approach by leveraging the Decentralized Knowledge Graph (DKG), organizing knowledge in a network of Knowledge Assets. Each Knowledge Asset contains graph data and/or vector embeddings, immutability proofs, a Decentralized Identifier (DID), and the ownership NFT. When connected in the permission-less DKG, the following capabilities are enabled:

Knowledge Graphs — structural knowledge in knowledge graphs allows a hybrid of neural and symbolic AI methodologies, enhancing the GenAI models with deterministic inputs.
Ownership — dRAG uses input from Knowledge Assets that have an owner that can manage access to the data contained in the Knowledge Asset.
Verifiability — every piece of knowledge on the DKG has cryptographic proofs published ensuring that no tampering has occurred since it was published.

DRAG with Google Gemini and OriginTrail DKG

Today we will focus our DRAG example on art assets, using Google Gemini via the Google Vertex platform. We will query data from the OriginTrail Decentralized Knowledge Graph (DKG), which contains easily discoverable and verifiable Knowledge Assets. Each Knowledge Asset contains graph data, immutability proofs, a Decentralized Identifier (DID), and the NFT, which means that you can track the complete history of a Knowledge Asset on the blockchain. Once we retrieve relevant knowledge assets from the DKG, we will feed them to Google Gemini LLM to generate a response.

Prerequisites

A GCP/Google Vertex AI account.
Access to an OriginTrail DKG node. Please visit the official docs to learn how to run one.
A Python project with a virtual environment set up.

Step 1 — Setting up a Python Project

In this step, assuming you have an empty Python project ready, as well as an empty Google Cloud project, you’ll install the necessary packages using pip and set up the credentials for Google Cloud.

Navigate to your Python project’s environment and run the following command to install the packages:

pip install dkg requests google-cloud-aiplatform python-dotenv

You’ll then need to authenticate with Google Cloud. Run the following commands in your shell, replacing the value with your project ID:

gcloud config set project "your-project-goes-here"
gcloud auth application-default login
gcloud init

You will be required to choose the Google Vertex project you want to use and authenticate that you do have the required permissions in it. The roles you need are:

AI Platform Admin
Vertex AI Administrator

You can now move on to setting up dkg.py, the Python SDK for connecting to the OriginTrail Decentralized Knowledge Graph.

Step 2 — Connecting to the DKG using dkg.py

In this step, you’ll set up the environment variables which will hold the necessary keys for connecting to the OriginTrail DKG using dkg.py. Then, you’ll connect to the DKG and print the version of the node you’re connected to.

Create a.env file and add the following lines:

JWT_TOKEN="your_jwt_token"
OT_NODE_HOSTNAME="your_ot_node_hostname"
PRIVATE_KEY="your_private_key"

The JWT_TOKEN is used to authenticate to your DKG node, the OT_NODE_HOSTNAME is the API endpoint for the node, and the PRIVATE_KEY represents the private key of a blockchain address that is equipped with TRAC tokens and appropriate gas tokens (NEURO for Neuroweb, xDAI for Gnosis, etc). For more information on how to obtain tokens, refer to the documentation.

Replace the values with your own, which you can find in the configuration file of your OT Node, as well as your wallet’s private key in order to publish knowledge assets. Keep in mind that this file should be kept private as it contains private keys. When you’re done, save and close the file.

Then, create a Python file and add the following code to connect to the DKG:

from dkg import DKG
from dkg.providers import BlockchainProvider, NodeHTTPProvider
from dotenv import load_dotenv
import os
import json

dotenv_path = './.env' # Update if placed somewhere else
load_dotenv(dotenv_path)
jwt_token = os.getenv('JWT_TOKEN')
ot_node_hostname = os.getenv('OT_NODE_HOSTNAME')
private_key = os.getenv('PRIVATE_KEY')

node_provider = NodeHTTPProvider(ot_node_hostname, jwt_token)
blockchain_provider = BlockchainProvider("mainnet", "otp:2043", private_key=private_key)

dkg = DKG(node_provider, blockchain_provider)
print(dkg.node.info)

Here, you first import the required classes and packages. Then, you load the values from .env and instantiate a NodeHTTPProvider and BlockchainProvider with those values, which you pass into the DKG constructor, creating the dkg object for communicating with the graph.

If all credentials and values are correct, the output will show you the version that your OT Node is running on:

{'version': '6.2.3'}

If you see such a version response, that means you have successfully connected to the DKG!

Step 3 — Making a retrieval query with Gemini

For this dRAG example we will use the Google Gemini LLM to generate a SPARQL query for retrieving relevant knowledge from the DKG. SPARQL is a standardized query language for knowledge graphs and is very similar to SQL, and you can use it to query connected public data across all the nodes on the DKG. Just like SQL, it has a SELECT and a WHERE clause, so as long as you’re familiar with SQL you should be able to understand the structure of the queries pretty well.

The data that you’ll be querying is related to artworks, stored in the DKG as Knowledge Assets. Each Knowledge Asset contains information such as name, description, artform, and author.

First, you’ll need to instruct the Google Gemini LLM on what to do:

instruction_message = '''
I am working on a project involving artworks and their related data. I have a schema in JSON-LD format that outlines the structure and relationships of the data I am dealing with. Based on this schema, I need to construct a SPARQL query to retrieve specific information from a dataset that follows this schema.

The schema is focused on artworks and includes various properties such as the artist, description, artform and author among others. My goal with the SPARQL queries is to retrieve data from the graph about the artworks, based on the natural language question that the user posed.

Here's an example of an artwork the JSON-LD format: {
  "@context": "http://schema.org",
  "@type": "VisualArtwork",
  "@id": "https://origintrail.io/images/otworld/1fc7cb79f299ee4.jpg",
  "name": "The Last Supper",
  "description": "The Last Supper is an iconic Renaissance fresco by Leonardo Da Vinci.",
  "artform": "Painting",
  "author": {
    "@type": "Person",
    "name": "Leonardo da Vinci"
  },
  "image": "https://origintrail.io/images/otworld/1fc7cb79f299ee4.jpg",
  "keywords": [
    "The Last Supper",
    "Leonardo da Vinci",
    "Renaissance",
    "fresco",
    "religious art"
  ],
  "publisher": {
    "@type": "Person",
    "name": "dkgbrka"
  }
}

Here's an example of a query to find artworks from publisher "BranaRakic":

```sparql
PREFIX schema: <http://schema.org/>

SELECT ?artwork ?name ?ual

WHERE { ?artwork a schema:VisualArtwork ;
GRAPH ?g
{ ?artwork schema:publisher/schema:name "BranaRakic" ; schema:name ?name . }

?ual schema:assertion ?g
FILTER(CONTAINS(str(?ual), "2043")) }```

Pay attention to retrieving the UAL, this is a mandatory step of all your queries. After getting the artwork with '?artwork a schema:VisualArtwork ;' you should wrap the next conditions around GRAPH ?g { }, and later use the graph retrieved (g) to get the UAL like in the example above.

Make sure you ALWAYS retrieve the UAL no matter what the user asks for and filter whether it contains "2043".

Make sure you ONLY return the SPARQL query without any extra output.

If you understood the assignment, say 'Yes' and I will proceed with a natural language question which you should convert to a SPARQL query.'''

instruction_understood_message = "Yes."

The instruction_message prompt contains the instructions in natural language. We here provide the model with an expected schema of an artwork object (in JSON-LD notation, based on Schema.org) and an example SPARQL query. We also order it to pay attention to the examples and to return nothing else except the SPARQL query. Feel free to try out other examples of queries on your own and apply filtering on any property, including the identity of the owner of the knowledge asset, it’s publisher, etc

You can now define the chat history to let Gemini know that the instructions precede the actual prompts:

from vertexai.preview.generative_models import GenerativeModel, ChatSession, Content, Part, GenerationConfig

chat_history = [
    Content(parts=[Part.from_text(instruction_message)], role="user"),
    Content(parts=[Part.from_text(instruction_understood_message)], role="model"),
]

Then, instantiate the Gemini model with the chat history. Temperature is set to 0 to reduce the creativity of the LLM model to minimize hallucination:

def get_chat_response(chat: ChatSession, prompt: str) -> str:
    response = chat.send_message(prompt, generation_config=GenerationConfig(temperature=0))
    print(response)

    return response.candidates[0].content.parts[0].text

def clean_sparql_query(input_string):
    if input_string.startswith("```sparql") and input_string.endswith("```"):
        cleaned_query = input_string[9:-3].strip()
        return cleaned_query
    elif input_string.startswith("```") and input_string.endswith("```"):
        cleaned_query = input_string[3:-3].strip()
    else:
        return input_string

gemini_pro_model = GenerativeModel("gemini-1.0-pro-001", generation_config=GenerationConfig(temperature=0))
chat = gemini_pro_model.start_chat(history=chat_history)

The clean_sparql_query() function will remove erroneous backticks that may be returned in the result.

You can now generate SPARQL for searching the DKG using natural language prompts:

question = "Provide me with all the artworks published by Google Demo Amsterdam"
print(get_chat_response(chat, question))

query = clean_sparql_query(get_chat_response(chat, question))
print(query)

Now that you have a query, you can get results from the DKG. An example query that would be returned for the shown prompt looks like this:

PREFIX schema: <http://schema.org/>

SELECT ?artwork ?name ?ual
WHERE { ?artwork a schema:VisualArtwork ;
GRAPH ?g
{ ?artwork schema:publisher/schema:name "Google Demo Amsterdam" ; schema:name ?name . }

?ual schema:assertion ?g
FILTER(CONTAINS(str(?ual), "2043")) }

Step 4 — Retrieval from the DKG with the generated query

Querying the DKG is very easy with SPARQL. You only need to specify the query and the repository to search:

query_result = dkg.graph.query(query, "privateCurrent")
print(query_result)

For completeness we use the privateCurrent repository, as it ensures that the SPARQL query retrieves both the public and private data (if any is present on our node) from Knowledge assets in the DKG.

An example result for the above query, which is looking for artworks published in the DKG by “Google Demo Amsterdam” publisher, looks like this:

[{
 'artwork': 'https://i.gyazo.com/a59c65f0a0dde03314d6ebcedec008cb.jpg',
 'ual': 'did:dkg:otp:2043/0x5cac41237127f94c2d21dae0b14bfefa99880630/4606152',
 'name': '"NeuroWeb Logo"'
}, {
 'artwork': 'https://cryptologos.cc/logos/origintrail-trac-logo.png',
 'ual': 'did:dkg:otp:2043/0x5cac41237127f94c2d21dae0b14bfefa99880630/4606876',
 'name': '"OriginTrail Logo"'
}, {
 'artwork': 'https://i.gyazo.com/72b6cd16e0d5b2e131b0311456dcdefc.png',
 'ual': 'did:dkg:otp:2043/0x5cac41237127f94c2d21dae0b14bfefa99880630/4608743',
 'name': '"Decentralized Hexagon"'
}]

Each of the entries above is a Knowledge asset with its UAL (Uniform Asset Locator), which presents its unique, dereferencable address in the Decentralized Knowledge Graph. These knowledge assets can be crowdsourced from different individual knowledge bases — effectively querying the DKG is equivalent to executing a search over the different data sources (e.g. RAG backends).

Step 5 — Augmented Generation with Gemini

We will now feed the extracted knowledge assets to Gemini for answering our questions, providing it with the data about artworks that you’ve queried from the DKG. First, preprocess the data so that Gemini understands it easier:

formatted_results = "\n".join([f"- Title: {artwork['name']}, UAL: {artwork['ual']}" for artwork in query_result])

Then, define the prompt which is asking the model to answer artwork-related questions based on the knowledge you passed in:

prompt = (
  f"I have retrieved the following information from the Decentralized Knowledge Graph based on the query '{query}':\n"
  f"{formatted_results}\n\n"
  "Imagine you're guiding a tour through a virtual gallery featuring some of the most iconic artworks linked to detailed records in the Decentralized Knowledge Graph.\n"
  "As you introduce these artworks to the audience, delve into the stories behind them. What inspired these pieces? How do they reflect the emotions and techniques of the artist?\n"
  f"Question: {question}\n"
  "Answer:"
)

Finally, run the prompt and get back the answer:

llm_response = gemini_pro_model.generate_content(prompt)
print(llm_response)

Gemini’s response will look similar to this:

candidates {
  content {
    role: "model"
    parts {
      text: "**Artwork 1:**\n\n* **Title:** NeuroWeb Logo\n* **UAL:** did:dkg:otp:2043/0x5cac41237127f94c2d21dae0b14bfefa99880630/4606152\n\nThis striking logo captures the essence of NeuroWeb, a cutting-edge platform that harnesses the power of artificial intelligence to revolutionize the way we interact with the digital world. The vibrant colors and intricate design evoke a sense of innovation and boundless possibilities.\n\n**Artwork 2:**\n\n* **Title:** OriginTrail Logo\n* **UAL:** did:dkg:otp:2043/0x5cac41237127f94c2d21dae0b14bfefa99880630/4606876\n\nOriginTrail\'s logo is a testament to the company\'s mission of bringing transparency and traceability to global supply chains. The interlocking circles symbolize the interconnectedness of the world, while the vibrant green hue represents growth and sustainability.\n\n**Artwork 3:**\n\n* **Title:** Decentralized Hexagon\n* **UAL:** did:dkg:otp:2043/0x5cac41237127f94c2d21dae0b14bfefa99880630/4608743\n\nThis abstract artwork embodies the spirit of decentralization, a fundamental principle of the blockchain revolution. The hexagonal shape represents the interconnectedness of nodes in a decentralized network, while the vibrant colors evoke the diversity and resilience of the community."
    }
  }
  finish_reason: STOP
  safety_ratings {
    category: HARM_CATEGORY_HATE_SPEECH
    probability: NEGLIGIBLE
  }
  safety_ratings {
    category: HARM_CATEGORY_DANGEROUS_CONTENT
    probability: NEGLIGIBLE
  }
  safety_ratings {
    category: HARM_CATEGORY_HARASSMENT
    probability: NEGLIGIBLE
  }
  safety_ratings {
    category: HARM_CATEGORY_SEXUALLY_EXPLICIT
    probability: NEGLIGIBLE
  }
}
usage_metadata {
  prompt_token_count: 384
  candidates_token_count: 364
  total_token_count: 748
}

The text section contains the actual answer, while the usage_metadata part reveals how many tokens were used for generating the answer.

Using this dRAG code snippet, you could build a full-stack chat bot application which relies on the trustable data verified on the DKG. Below is an example of such an application UI, similar to the one found on OriginTrail World.

In the example above each answer corresponds to a specific source art Knowledge asset, published to the OriginTrail DKG. And as it is a constantly growing, Decentralized Knowledge Graph of signed knowledge assets, you can leverage all of this constantly growing knowledge in your dRAG applications.

Conclusion

We’ve showcased a basic dRAG implementation today — you’ve created Knowledge Assets on the OriginTrail DKG and queried it by converting Natural Language queries into SPARQL assisted by Google Gemini. Find the above code here, and let us know your comments in our Discord channel.

Build transformative AI solutions on OriginTrail by diving into ChatDKG.ai and joining our Inception program today.