Decentralized RAG with OriginTrail DKG and NVIDIA Build ecosystem

OriginTrail
OriginTrail
Published in
13 min readMar 21, 2024

--

Introduction

Generative Artificial Intelligence (AI) is already reaching relevant adoption across multiple fields, however, some of its limitations are significantly hurting the potential of mainstream adoption and delivering improvements in all fields of modern humanity. For GenAI to be production-ready for such a scale of impact we need to limit hallucinations, manage bias, and reject intellectual property (or data ownership) infringements. The promise of Verifiable Internet for AI is to address these shortfalls by providing information provenance in model outputs, ensuring verifiability of presented information, respecting data ownership, and incentivizing new knowledge creation.

Below we’re showcasing the implementation framework called Decentralized Retrieval-Augmented Generation (dRAG) on the NVIDIA Build ecosystem, combining an ample amount of powerful models across industries and types. dRAG is advancing the Retrieval-Augmented Generation (RAG) framework proposed by Patrick Lewis in an attempt to increase accuracy and reliability of GenAI models with facts fetched from external sources. The RAG framework has gained prominence both among AI developers and the leading tech companies’ leaders, such as NVIDIA’s CEO Jensen Huang.

The dRAG advances the RAG system by leveraging the Decentralized Knowledge Graph (DKG), a permissionless network of Knowledge Assets. Each Knowledge Asset contains Graph data and/or Vector embeddings, immutability proofs, a Decentralized Identifier (DID), and the ownership NFT. When connected in one permission-less DKG, the following capabilities are enabled:

  • Knowledge Graphs — structural knowledge in knowledge graphs allows a hybrid of neural and symbolic AI methodologies, enhancing the GenAI models with deterministic inputs.
  • Ownership — dRAG uses input from Knowledge Assets that have an owner that can manage access to the data contained in the Knowledge Asset.
  • Verifiability — every piece of knowledge on the DKG has cryptographic proofs published ensuring that no tampering has occurred since it was published.

In this tutorial, you will learn how to query the OriginTrail DKG and retrieve verified Knowledge Assets on the DKG.

Prerequisites

  • A NVIDIA build platform account and API key.
  • A DKG node. Please visit the official docs to learn how to set one up.
  • A Python project with a virtual environment set up.

Step 1 — Installing packages and setting up dkg.py

In this step, you’ll install the necessary packages using pip and set up the credentials for dkg.py.

Navigate to your Python project’s environment and run the following command to install the packages:

pip install openai dkg python-dotenv annoy

The OpenAI client is going to act as an intermediary for interacting with the NVIDIA API. You’ll store the environment variables in a file called .env. Create and open it for editing in your favorite editor:

nano .env

Add the following lines:

OT_NODE_HOSTNAME="your_ot_node_hostname"
PRIVATE_KEY="your_private_key"
NVIDIA_API_TOKEN="your_nvidia_api_token"

Replace the values with your own, which you can find in the configuration file of your OT Node, as well as your wallet’s private key in order to perform the Knowledge Asset create operation, which needs to be funded with TRAC tokens (more information available in the OriginTrail documentation). Keep in mind that this information should be kept private, especially your wallet’s key. When you’re done, save and close the file.

Then, create a Python file where you’ll store the code for connecting to the DKG:

nano dkg_version.py

Add the following code to the file:

from dkg import DKG
from dkg.providers import BlockchainProvider, NodeHTTPProvider
from dotenv import load_dotenv
import os
import json

dotenv_path = './.nvidia.env' # Replace with your .env file address
load_dotenv(dotenv_path)
ot_node_hostname = os.getenv('OT_NODE_HOSTNAME')
private_key = os.getenv('PRIVATE_KEY')

node_provider = NodeHTTPProvider(ot_node_hostname)
blockchain_provider = BlockchainProvider("testnet", "otp:20430", private_key=private_key)

dkg = DKG(node_provider, blockchain_provider)
print(dkg.node.info)

Here, you first import the required classes and packages. Then, you load the values from .env and instantiate a NodeHTTPProvider and BlockchainProvider with those values, which you pass in to the DKG constructor, creating the dkg object for communicating with the graph.

If all credentials and values are correct, the output will show you the version that your OT Node is running on:

{'version': '6.2.3'}

That’s all you have to do to be connected to the DKG!

Step 2 — Instructing the LLM to create Knowledge assets on the DKG

In this step, you’ll connect to the NVIDIA API using the OpenAI Python library. Then, you’ll instruct it to generate

First, you need to initialize the OpenAI class, passing in the NVIDIA API as the base_url along with your API key. OpenAI acts as an intermediary to the NVIDIA API here, and will be able to use multiple LLM models, such as the Google Gemma and Meta’s Llama which are used in the tutorial.

from openai import OpenAI

client = OpenAI(
base_url = "https://integrate.api.nvidia.com/v1",
api_key = os.getenv('NVIDIA_API_TOKEN')
)

Then, you define the instructions, telling the model what to do:

instruction_message = '''
Your task is the following:

Construct a JSON object following the Product JSON-LD schema based on the provided information by the user.
The user will provide the name, description, tags, category and deployer of the product, as well as the URL which you will use as the '@id'.

Here's an example of an Product that corresponds to the mentioned JSON-LD schema.:
{
"@context": "http://schema.org",
"@type": "Product",
"@id": "https://build.nvidia.com/nvidia/ai-weather-forecasting",
"name": "ai-weather-forecasting",
"description": "AI-based weather prediction pipeline with global models and downscaling models.",
"tags": [
"ai weather prediction",
"climate science"
],
"category": "Industrial",
"deployer": "nvidia"
}

Follow the provided JSON-LD schema, using the provided properties and DO NOT add or remove any one of them.
Output the JSON as a string, between ```json and ```.
'''

chat_history = [{"role":"system","content":instruction_message}]

As part of the instructions, you provide the model with an example Product definition, according to which a new one should be generated. We want to create a Knowledge Asset which will represent the ‘rerank-qa-mistral-4b’ model from the NVIDIA Build platform. You add the contents of that message to chat_history with a system role, meaning that it instructs the model before the user comes in with actionable prompts.

Then, you define an example user_instruction for testing the model:

user_instruction = '''I want to create a product (model) with name 'rerank-qa-mistral-4b', which is a GPU-accelerated model optimized for providing a probability score
that a given passage contains the information to answer a question. It's in category Retrieval and deployed by nvidia.
It's used for ranking and retrieval augmented generation. You can reach it at https://build.nvidia.com/nvidia/rerank-qa-mistral-4b. Give me the schema JSON LD object.'''

This user prompt wants the LLM to output a Product with the given name and gives information as to where that model can be found.

Finally, you can ask the LLM to compute the output and print it:

completion = client.chat.completions.create(
model="google/gemma-7b",
messages=chat_history + [{"role":"user","content":user_instruction}],
temperature=0,
top_p=1,
max_tokens=1024,
)

generated_json = completion.choices[0].message.content
print(generated_json)

The output will look like this:

```json
{
"@context": "http://schema.org",
"@type": "Product",
"@id": "https://build.nvidia.com/nvidia/rerank-qa-mistral-4b",
"name": "rerank-qa-mistral-4b",
"description": "GPU-accelerated model optimized for providing a probability score that a given passage contains the information to answer a question.",
"tags": [
"rerank-qa-mistral-4b",
"information retrieval",
"retrieval augmentation"
],
"category": "Retrieval",
"deployer": "nvidia"
}
```

The LLM has returned a JSON-LD structure that can be added to the DKG.

def clean_json_string(input_string):
if input_string.startswith("```json") and input_string.endswith("```"):
cleaned_query = input_string[7:-3].strip()
return cleaned_query
elif input_string.startswith("```") and input_string.endswith("```"):
cleaned_query = input_string[3:-3].strip()
else:
return input_string

product = json.loads(clean_json_string(generated_json))

content = {"public": product}
create_asset_result = dkg.asset.create(content, 2)
print('Asset created!')
print(json.dumps(create_asset_result, indent=4))
print(create_asset_result["UAL"])

Here you first define a function (clean_json_string) that will clean up the JSON string and remove the Markdown code markup. Then, you load the product by deserializing the JSON and add it to the DKG by calling dkg.asset.create().

The output will look like this:

Asset created!
{
"publicAssertionId": "0x09d8d7c5b82bd09bc3f51770f575e15f1157c6292652d977afbe453932e270ef",
"operation": {
"mintKnowledgeAsset": {
"transactionHash": "0x6fb8a6039f97cf3c0d8cb8b1a221be405d4a7cbdeab7f27240ae9848322cad98",
"transactionIndex": 0,
"blockHash": "0x7f73351931b89c4c75d3bf66206df69fc55cb3a0f6e126545d82cfe3e81c0d6a",
"from": "0xD988B6fd921CFab980a7f2F60B9aC9F7918D7F71",
"to": "0xB25D47412721f681f1EaffD1b67ff0638C06f2B7",
"blockNumber": 3674556,
"cumulativeGasUsed": 397582,
"gasUsed": 397582,
"contractAddress": null,
"logs": [
{
"address": "0x1A061136Ed9f5eD69395f18961a0a535EF4B3E5f",
"topics": [
"0xddf252ad1be2c89b69c2b068fc378daa952ba7f163c4a11628f55a4df523b3ef",
"0x0000000000000000000000000000000000000000000000000000000000000000",
"0x000000000000000000000000d988b6fd921cfab980a7f2f60b9ac9f7918d7f71",
"0x000000000000000000000000000000000000000000000000000000000027fb68"
],
"data": "0x",
"blockHash": "0x7f73351931b89c4c75d3bf66206df69fc55cb3a0f6e126545d82cfe3e81c0d6a",
"blockNumber": 3674556,
"transactionHash": "0x6fb8a6039f97cf3c0d8cb8b1a221be405d4a7cbdeab7f27240ae9848322cad98",
"transactionIndex": 0,
"logIndex": 0,
"transactionLogIndex": "0x0",
"removed": false
},
{
"address": "0xf305D2d97C7201Cea2A54A2B074baC2EdfCE7E45",
"topics": [
"0x6228bc6c1a8f028a2e3476a455a34f5fa23b4387611f3c147a965e375ebd17ba",
"0x09d8d7c5b82bd09bc3f51770f575e15f1157c6292652d977afbe453932e270ef"
],
"data": "0x00000000000000000000000000000000000000000000000000000000000003e700000000000000000000000000000000000000000000000000000000000000080000000000000000000000000000000000000000000000000000000000000008",
"blockHash": "0x7f73351931b89c4c75d3bf66206df69fc55cb3a0f6e126545d82cfe3e81c0d6a",
"blockNumber": 3674556,
"transactionHash": "0x6fb8a6039f97cf3c0d8cb8b1a221be405d4a7cbdeab7f27240ae9848322cad98",
"transactionIndex": 0,
"logIndex": 1,
"transactionLogIndex": "0x1",
"removed": false
},
{
"address": "0xFfFFFFff00000000000000000000000000000001",
"topics": [
"0xddf252ad1be2c89b69c2b068fc378daa952ba7f163c4a11628f55a4df523b3ef",
"0x000000000000000000000000d988b6fd921cfab980a7f2f60b9ac9f7918d7f71",
"0x000000000000000000000000f43b6a63f3f6479c8f972d95858a1684d5f129f5"
],
"data": "0x0000000000000000000000000000000000000000000000000000000000000006",
"blockHash": "0x7f73351931b89c4c75d3bf66206df69fc55cb3a0f6e126545d82cfe3e81c0d6a",
"blockNumber": 3674556,
"transactionHash": "0x6fb8a6039f97cf3c0d8cb8b1a221be405d4a7cbdeab7f27240ae9848322cad98",
"transactionIndex": 0,
"logIndex": 2,
"transactionLogIndex": "0x2",
"removed": false
},
{
"address": "0x082AC991000F6e8aF99679f5A2F46cB2Be4E101B",
"topics": [
"0x4b81188c3c973dd634ec0dae5b7e72f92bb03834c830739d63935923950d6f64",
"0x0000000000000000000000001a061136ed9f5ed69395f18961a0a535ef4b3e5f",
"0x000000000000000000000000000000000000000000000000000000000027fb68"
],
"data": "0x00000000000000000000000000000000000000000000000000000000000000c000000000000000000000000000000000000000000000000000000000000000010000000000000000000000000000000000000000000000000000000065fc48a00000000000000000000000000000000000000000000000000000000000000002000000000000000000000000000000000000000000000000000000000076a700000000000000000000000000000000000000000000000000000000000000000600000000000000000000000000000000000000000000000000000000000000341a061136ed9f5ed69395f18961a0a535ef4b3e5f09d8d7c5b82bd09bc3f51770f575e15f1157c6292652d977afbe453932e270ef000000000000000000000000",
"blockHash": "0x7f73351931b89c4c75d3bf66206df69fc55cb3a0f6e126545d82cfe3e81c0d6a",
"blockNumber": 3674556,
"transactionHash": "0x6fb8a6039f97cf3c0d8cb8b1a221be405d4a7cbdeab7f27240ae9848322cad98",
"transactionIndex": 0,
"logIndex": 3,
"transactionLogIndex": "0x3",
"removed": false
},
{
"address": "0xB25D47412721f681f1EaffD1b67ff0638C06f2B7",
"topics": [
"0x60e45db7c8cb9f55f92f3de18053b0b426eb919a763a1daca0ea9ad20961e878",
"0x0000000000000000000000001a061136ed9f5ed69395f18961a0a535ef4b3e5f",
"0x000000000000000000000000000000000000000000000000000000000027fb68",
"0x09d8d7c5b82bd09bc3f51770f575e15f1157c6292652d977afbe453932e270ef"
],
"data": "0x",
"blockHash": "0x7f73351931b89c4c75d3bf66206df69fc55cb3a0f6e126545d82cfe3e81c0d6a",
"blockNumber": 3674556,
"transactionHash": "0x6fb8a6039f97cf3c0d8cb8b1a221be405d4a7cbdeab7f27240ae9848322cad98",
"transactionIndex": 0,
"logIndex": 4,
"transactionLogIndex": "0x4",
"removed": false
}
],
"logsBloom": "0x00000100400000000000800000000000000000000000000000000000000000000000010020000000000000000000000000000000000010800000000000001000000040000000400040000008002400000080000000004000000000000000000000040000020000000000000000000a00000000008000020000000010000210015000000000000000000080000000001000000000000000000000000200000000040000001020002002000000000000000000000000000000000000000000000000000002000000000000000000008004000000000000010000000000000020000000000000002800000000000000000000000000000000100000000000010000",
"status": 1,
"effectiveGasPrice": 40,
"type": 0
},
"publish": {
"operationId": "1bb622c7-8fa1-4414-b39e-0aaf3f5465f9",
"status": "COMPLETED"
}
},
"UAL": "did:dkg:otp:20430/0x1a061136ed9f5ed69395f18961a0a535ef4b3e5f/2620264"
}

Here we can see a lot of useful information, such as the Knowledge Asset issuer, transaction IDs from the blockchain, and the status of the operation, which was completed. The UAL returned is the Uniform Asset Locator, a decentralized identifier connected to each Knowledge Asset on the DKG.

Then, you can retrieve the same product, but from the DKG by passing in the UAL to dkg.asset.get(). The output will look like this:

get_asset_result = dkg.asset.get(create_asset_result["UAL"])
print(json.dumps(get_asset_result, indent=4))

The output will be:

did:dkg:otp:20430/0x1a061136ed9f5ed69395f18961a0a535ef4b3e5f/2620264
{
"operation": {
"publicGet": {
"operationId": "c138515a-d82c-45a8-bef9-82c7edf2ef6b",
"status": "COMPLETED"
}
},
"public": {
"assertion": "<https://build.nvidia.com/nvidia/rerank-qa-mistral-4b> <http://schema.org/category> \"Retrieval\" .\n<https://build.nvidia.com/nvidia/rerank-qa-mistral-4b> <http://schema.org/deployer> \"nvidia\" .\n<https://build.nvidia.com/nvidia/rerank-qa-mistral-4b> <http://schema.org/description> \"GPU-accelerated model optimized for providing a probability score that a given passage contains the information to answer a question.\" .\n<https://build.nvidia.com/nvidia/rerank-qa-mistral-4b> <http://schema.org/name> \"rerank-qa-mistral-4b\" .\n<https://build.nvidia.com/nvidia/rerank-qa-mistral-4b> <http://schema.org/tags> \"information retrieval\" .\n<https://build.nvidia.com/nvidia/rerank-qa-mistral-4b> <http://schema.org/tags> \"rerank-qa-mistral-4b\" .\n<https://build.nvidia.com/nvidia/rerank-qa-mistral-4b> <http://schema.org/tags> \"text retrieval\" .\n<https://build.nvidia.com/nvidia/rerank-qa-mistral-4b> <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://schema.org/Product> .",
"assertionId": "0x09d8d7c5b82bd09bc3f51770f575e15f1157c6292652d977afbe453932e270ef"
}
}

In this step, you’ve seen how to instruct the NVIDIA LLM to generate Product entities according to user prompts, and how to insert them into the DKG. You’ll now learn how to generate SPARQL queries for products using the LLM.

Step 3 — Generating SPARQL with the AI model

In this step, you’ll use the NVIDIA LLM to generate a SPARQL query for retrieving results from the DKG. The data that we’ll be querying consists of Knowledge Assets that represent each of the models from the NVIDIA Build platform — with the same properties as the one created in Step 2.

SPARQL is a query language for graphs and is very similar to SQL. Just like SQL, it has a SELECT and a WHERE clause, so as long as you’re familiar with SQL you should be able to understand the structure of the queries pretty well.

The data that you’ll be querying is related to Products, stored in the DKG as Knowledge Assets.

Similarly to before, you’ll need to instruct the LLM on what to do:

all_categories = ["Biology", "Gaming", "Visual Design", "Industrial", "Reasoning", "Retrieval", "Speech"];
all_tags = ["3d-generation", "automatic speech recognition", "chat", "digital humans", "docking", "drug discovery", "embeddings", "gaming", "healthcare", "image generation", "image modification", "image understanding", "language generation", "molecule generation", "nvidia nim", "protein folding", "ranking", "retrieval augmented generation", "route optimization", "text-to-3d", "advanced reasoning", "ai weather prediction", "climate science"];

instruction_message = '''
You have access to data connected to the new NVIDIA Build platform and the products available there.
You have a schema in JSON-LD format that outlines the structure and relationships of the data you are dealing with.
Based on this schema, you need to construct a SPARQL query to retrieve specific information from the NVIDIA products dataset that follows this schema.

The schema is focused on AI products and includes various properties such as name, description, category, deployer, URL and tags related to the product.
My goal with the SPARQL queries is to retrieve data from the graph about the products, based on the natural language question that the user posed.

Here's an example of a query to find products from category "AI Weather Prediction":
```sparql
PREFIX schema: <http://schema.org/>

SELECT ?product ?name ?description ?ual

WHERE { ?product a schema:Product ;
GRAPH ?g
{ ?product schema:tags "ai weather prediction" ; schema:name ?name ; schema:description ?description }

?ual schema:assertion ?g
FILTER(CONTAINS(str(?ual), "20430")) }```

Pay attention to retrieving the UAL, this is a mandatory step of all your queries. After getting the product with '?product a schema:Product ;' you should wrap the next conditions around GRAPH ?g { }, and later use the graph retrieved (g) to get the UAL like in the example above.

Make sure you ALWAYS retrieve the UAL no matter what the user asks for and filter whether it contains "2043".
Make sure you always retrieve the NAME and the DESCRIPTION of the products.

Only return the SPARQL query wrapped in ```sparql ``` and DO NOT return anything extra.
'''

limitations_instruction = '''\nThe existing categories are: {}. The existing tags are: {}'''.format(all_categories, all_tags)
user_instruction = '''Give me all NVIDIA tools which I can use for use cases related to biology.'''

chat_history = [{"role":"system","content":instruction_message + limitations_instruction}, {"role":"user","content":user_instruction}]

The instruction_message prompt contains the instructions in natural language. You provide the model with a schema of a Product object (in JSON-LD notation) and an example SPARQL query in the appropriate format for the DKG. You also order it to pay attention to the examples and to return nothing else except the SPARQL query.

You can now define the chat history and pass in a user prompt to get the resulting code:

limitations_instruction = '''\nThe existing categories are: {}. The existing tags are: {}'''.format(all_categories, all_tags)
user_instruction = '''Give me all NVIDIA tools which I can use for use cases related to biology.'''

chat_history = [{"role":"system","content":instruction_message + limitations_instruction}, {"role":"user","content":user_instruction}]

completion = client.chat.completions.create(
model="meta/llama2-70b", # NVIDIA lets you choose any LLM from the platform
messages=chat_history,
temperature=0,
top_p=1,
max_tokens=1024,
)

answer = completion.choices[0].message.content
print(answer)

The output will look similar to this:

 ```sparql
PREFIX schema: <http://schema.org/>

SELECT ?product ?name ?description

WHERE { ?product a schema:Product ;
GRAPH ?g
{ ?product schema:category "Biology" ;
?product schema:name ?name ;
?product schema:description ?description }

?ual schema:assertion ?g
FILTER(CONTAINS(str(?ual), "20430")) }
```

This SPARQL query retrieves all products that have the category "Biology" and returns their names and descriptions. The `GRAPH ?g` clause is used to retrieve the graph that contains the product information, and the `FILTER` clause is used to filter the results to only include products that have a UAL that contains "20430".
```

You can employ a similar strategy to clean the result from the Markdown code formatting:

def clean_sparql_query(input_string):
start_index = input_string.find("```sparql")
end_index = input_string.find("```", start_index + 1)
if start_index != -1 and end_index != -1:
cleaned_query = input_string[start_index + 9:end_index].strip()
return cleaned_query
else:
return input_string

query = clean_sparql_query(answer)
print(query)

The output will now be clean SPARQL:

PREFIX schema: <http://schema.org/>

SELECT ?product ?name ?description

WHERE { ?product a schema:Product ;
GRAPH ?g
{ ?product schema:category "Biology" ;
?product schema:name ?name ;
?product schema:description ?description }

?ual schema:assertion ?g
FILTER(CONTAINS(str(?ual), "20430")) }
```

This SPARQL query retrieves all products that have the category "Biology" and returns their names and descriptions. The `GRAPH ?g` clause is used to retrieve the graph that contains the product information, and the `FILTER` clause is used to filter the results to only include products that have a UAL that contains "20430".

Step 4 — Querying the OriginTrail DKG

Querying the DKG is very easy with SPARQL. You only need to specify the query and the repository to search:

query_result = dkg.graph.query(query, "privateCurrent")
print(query_result)

The privateCurrent option ensures that the SPARQL query retrieves the latest state of Knowledge Assets in the DKG, as it includes the private and public data of the latest finalized state of the Graph.

An example result for the above query looks like this:

[
{
'product': 'https: //build.nvidia.com/nvidia/molmim-generate',
'description': '"MolMIM performs controlled generation, finding molecules with the right properties."',
'name': '"molmim-generate"',
'ual': 'did: dkg: otp: 20430/0x1a061136ed9f5ed69395f18961a0a535ef4b3e5f/2619549'
},
{
'product': 'https: //build.nvidia.com/meta/esmfold',
'description': '"Predicts the 3D structure of a protein from its amino acid sequence."',
'name': '"esmfold"',
'ual': 'did: dkg: otp: 20430/0x1a061136ed9f5ed69395f18961a0a535ef4b3e5f/2619597'
},
{
'product': 'https: //build.nvidia.com/mit/diffdock',
'description': '"Predicts the 3D structure of how a molecule interacts with a protein."',
'name': '"diffdock"',
'ual': 'did: dkg: otp: 20430/0x1a061136ed9f5ed69395f18961a0a535ef4b3e5f/2619643'
}
]

You’ll now be able to utilize the DKG to improve the runtime cost of the LLM model, as well as have it rely on trustable data stored in the Knowledge Assets.

Step 5 — Vector search with NVIDIA embed-qa-4 model and the DKG

In this step, you’ll build an in-memory vector DB based on the verified data queried from the DKG and invoke the NVIDIA model with it to generate more accurate results for the end-user. Sometimes, using SPARrQL queries may not be enough to answer a question, and you can use a vector database to extract specific Knowledge Assets by semantic similarity.

First, you initialize the NVIDIA embed-qa-4 model that you’ll use to generate the vector embeddings:

import requests

invoke_url = "https://ai.api.nvidia.com/v1/retrieval/nvidia/embeddings"

headers = {
"Authorization": f"Bearer {os.getenv('NVIDIA_API_TOKEN')}",
"Accept": "application/json",
}

def get_embeddings(input):
payload = {
"input": input,
"input_type": "query",
"model": "NV-Embed-QA"
}

session = requests.Session()

response = session.post(invoke_url, headers=headers, json=payload)

response.raise_for_status()
response_body = response.json()
return response_body["data"][0].embedding

Then, you build the vector DB in-memory by making embeddings based on the Product description:

from annoy import AnnoyIndex

def build_embeddings_index(embeddings, n_trees=10):
dim = len(embeddings[0])
index = AnnoyIndex(dim, 'angular') # Using angular distance

for i, vector in enumerate(embeddings):
index.add_item(i, vector)

index.build(n_trees)
return index

def add_text_embeddings(products):
for product in products:
product["embedding"] = get_embeddings([product["description"]])

add_text_embeddings(products)

Then, you can retrieve the Product that is semantically nearest to the user prompt, in order to generate the answer to his question with the following:

index = build_embeddings_index([product["embedding"] for product in products])
question = "I would like a model which will help me find the molecules with the chosen properties."

nearest_neighbors = index.get_nns_by_vector(get_embeddings(question), 1, include_distances=True)
index_of_nearest_neighbor = nearest_neighbors[0][0]

print(f"Vector search result: {products[index_of_nearest_neighbor]['description']}")
print(f"Product name: {products[index_of_nearest_neighbor]['name']}")
print(f"https://dkg.origintrail.io/explore?ual={products[index_of_nearest_neighbor]['ual']}")

The output will be similar to this:

Vector search result: Predicts the 3D structure of how a molecule interacts with a protein.
Product name: diffdock
https://dkg-testnet.origintrail.io/explore?ual=did:dkg:otp:20430/0x1a061136ed9f5ed69395f18961a0a535ef4b3e5f/2619643

Conclusion

You have now created a Python project which uses tools from the NVIDIA Build platform to help create and query verifiable Knowledge Assets on OriginTrail DKG. You’ve seen how to instruct it to generate SPARQL queries from Natural Language inputs and query the DKG with the resulting code, as well as how to create embeddings and use vector similarity search to find the right Knowledge Assets.

Additionally, you’ve explored the capabilities of the NVIDIA Build platform and how to use it with the DKG, offering versatile options for both structured data querying with SPARQL and semantic similarity search with vectors. With these tools at your disposal, you’re well-equipped to tackle a wide range of tasks requiring knowledge discovery and retrieval by using the decentralized RAG (dRAG).

Build transformative AI solutions on OriginTrail by diving into ChatDKG.ai and joining our Inception program today.

--

--

OriginTrail
OriginTrail

OriginTrail is the Decentralized Knowledge Graph that organizes AI-grade knowledge assets, making them discoverable & verifiable for sustainable global economy.