📖 Introduction

Ce tutoriel vous guide dans la création d'un graphe de connaissances complet, depuis la modélisation jusqu'au déploiement. Nous allons construire ensemble un graphe sur le thème des orateurs et discours historiques.

💡 Prérequis :

Connaissances de base en RDF et SPARQL
Python pour les exemples
Protégé installé (optionnel)

📊 Pipeline de création :

[Sources] → [Extraction] → [Transformation RDF] → [Triple Store] → [SPARQL Endpoint] → [Application]

02

📐 Étape 1 : Définir le périmètre

Questions à se poser

❓ Quel domaine ? (ex: discours historiques)
❓ Quelles entités principales ? (orateurs, discours, événements)
❓ Quelles relations ? (a prononcé, concerne, cite)
❓ Quelle taille ? (POC vs production)

📝 Notre cas d'étude :

Domaine : Discours historiques francophones
Entités : Orateurs, Discours, Événements, Thèmes
Relations : aPrononce, concerne, cite, seDerouleLe
Volume cible : 50 orateurs, 100 discours

03

🏗️ Étape 2 : Modéliser l'ontologie

@prefix rdfs:  .
@prefix schema:  .
@prefix ex:  .

# Classes
ex:Orateur a rdfs:Class ;
    rdfs:subClassOf schema:Person .
ex:Discours a rdfs:Class ;
    rdfs:subClassOf schema:CreativeWork .
ex:Theme a rdfs:Class .
ex:Evenement a rdfs:Class .

# Propriétés
ex:aPrononce a rdf:Property ;
    rdfs:domain ex:Orateur ;
    rdfs:range ex:Discours .
ex:concerne a rdf:Property ;
    rdfs:domain ex:Discours ;
    rdfs:range ex:Theme .
ex:cite a rdf:Property ;
    rdfs:domain ex:Discours ;
    rdfs:range ex:Discours .

💡 Bonnes pratiques :

Réutilisez des vocabulaires existants (Schema.org, FOAF, Dublin Core)
Documentez vos classes et propriétés avec rdfs:comment
Validez votre ontologie avec Protégé

04

📥 Étape 3 : Extraire et transformer les données

Depuis un fichier CSV

import csv
from rdflib import Graph, URIRef, Literal, Namespace
from rdflib.namespace import RDF, RDFS, XSD

# Créer un graphe RDF
g = Graph()
ex = Namespace("https://lemondesemantique.fr/onto/")
schema = Namespace("https://schema.org/")

# Lire le CSV
with open('orateurs.csv', 'r') as f:
    reader = csv.DictReader(f)
    for row in reader:
        orateur = URIRef(ex + row['id'])
        g.add((orateur, RDF.type, ex.Orateur))
        g.add((orateur, schema.name, Literal(row['nom'])))
        g.add((orateur, schema.birthDate, Literal(row['date_naissance'], datatype=XSD.date)))

# Sauvegarder
g.serialize(destination="orateurs.rdf", format="turtle")

Extraction depuis JSON

import json

with open('discours.json', 'r') as f:
    discours_data = json.load(f)

for discours in discours_data:
    discours_uri = URIRef(ex + discours['id'])
    g.add((discours_uri, RDF.type, ex.Discours))
    g.add((discours_uri, schema.title, Literal(discours['titre'])))
    g.add((discours_uri, schema.date, Literal(discours['date'], datatype=XSD.date)))
    
    # Relation avec l'orateur
    orateur_uri = URIRef(ex + discours['orateur_id'])
    g.add((orateur_uri, ex.aPrononce, discours_uri))

05

💾 Étape 4 : Stocker le graphe (Triple Store)

Option 1 : Fichier RDF (démarrage rapide)

# Sauvegarder en Turtle
g.serialize(destination="knowledge_graph.ttl", format="turtle")

# Sauvegarder en RDF/XML
g.serialize(destination="knowledge_graph.rdf", format="xml")

Option 2 : Apache Jena Fuseki (serveur SPARQL)

# Lancer Fuseki
./fuseki-server --update --mem /ds

# Upload du fichier RDF via l'interface web
# http://localhost:3030

# Ou via curl
curl -X POST http://localhost:3030/ds/data \
  -F "file=@knowledge_graph.ttl"

Option 3 : Neo4j avec plugin neosemantics

# Installer le plugin neosemantics
# Importer le RDF
CALL n10s.rdf.import.fetch("file:///knowledge_graph.ttl", "Turtle");

Solutions de stockage

Fichier RDF : Simple, petits volumes
Fuseki : Triple store open source
GraphDB : Enterprise, performant
Neo4j : Property graph + RDF
Oxigraph : Léger, Rust

Critères de choix

Volume de triplets
Performance requise
Budget
Compétences de l'équipe

06

🔍 Étape 5 : Interroger le graphe avec SPARQL

# Requête : Tous les discours de Charles de Gaulle
PREFIX ex: 
PREFIX schema: 

SELECT ?titre ?date
WHERE {
  ex:CharlesDeGaulle ex:aPrononce ?discours .
  ?discours schema:title ?titre ;
            schema:date ?date .
}
ORDER BY ?date

# Requête : Graphe de citations (2 niveaux)
SELECT ?discours1 ?discours2 ?discours3
WHERE {
  ?discours1 ex:cite ?discours2 .
  ?discours2 ex:cite ?discours3 .
}
LIMIT 20

# Requête : Orateurs et nombre de discours
SELECT ?nom (COUNT(?discours) AS ?nbDiscours)
WHERE {
  ?orateur a ex:Orateur ;
           schema:name ?nom ;
           ex:aPrononce ?discours .
}
GROUP BY ?orateur ?nom
ORDER BY DESC(?nbDiscours)

💡 Testez vos requêtes : Utilisez l'interface web de Fuseki (http://localhost:3030) ou le playground SPARQL en ligne.

07

📊 Étape 6 : Visualiser le graphe

Avec Python (NetworkX + matplotlib)

import networkx as nx
import matplotlib.pyplot as plt
from rdflib import Graph

# Charger le graphe RDF
g = Graph()
g.parse("knowledge_graph.ttl", format="turtle")

# Convertir en NetworkX
nx_graph = nx.DiGraph()
for s, p, o in g:
    nx_graph.add_edge(str(s), str(o), label=str(p))

# Visualiser
pos = nx.spring_layout(nx_graph, k=2, iterations=50)
plt.figure(figsize=(20, 15))
nx.draw(nx_graph, pos, node_size=500, font_size=8, with_labels=True)
plt.show()

Avec pyvis (interactif)

from pyvis.network import Network

net = Network(height="750px", width="100%", notebook=True)
for u, v, data in nx_graph.edges(data=True):
    net.add_node(str(u), label=u.split('/')[-1])
    net.add_node(str(v), label=v.split('/')[-1])
    net.add_edge(str(u), str(v), title=data['label'])

net.show("knowledge_graph.html")

📊 Outils de visualisation :

WebVOWL : Visualisation d'ontologies
Neo4j Browser : Intégré à Neo4j
GraphDB Workbench : Visualisation interactive
Pyvis : Graphes interactifs en HTML

08

🚀 Étape 7 : Déployer en production

API REST avec Python

from flask import Flask, request, jsonify
from SPARQLWrapper import SPARQLWrapper, JSON

app = Flask(__name__)
sparql = SPARQLWrapper("http://localhost:3030/ds/sparql")

@app.route('/api/speakers')
def get_speakers():
    query = """
    PREFIX schema: 
    SELECT ?id ?nom WHERE {
      ?orateur a ex:Orateur ;
               schema:name ?nom .
      BIND(STR(REPLACE(STR(?orateur), "https://lemondesemantique.fr/onto/", "")) AS ?id)
    }
    """
    sparql.setQuery(query)
    sparql.setReturnFormat(JSON)
    results = sparql.query().convert()
    return jsonify(results['results']['bindings'])

@app.route('/api/speeches/')
def get_speeches(speaker_id):
    query = f"""
    PREFIX schema: 
    SELECT ?titre ?date WHERE {{
      ex:{speaker_id} ex:aPrononce ?discours .
      ?discours schema:title ?titre ;
                schema:date ?date .
    }}
    """
    # ... exécution et retour
    return jsonify(results)

if __name__ == '__main__':
    app.run(port=5000)

🚀 Dernier tutoriel : 🔗 Connecter un LLM à un graphe de connaissances →

Créer un graphe de connaissances