编者注：这篇文章由 Neo4j 团队的 Tomaz Bratanic 撰写。

从文本等非结构化数据中提取结构化信息已经存在一段时间了，这并不是什么新鲜事。然而，LLM 给信息提取领域带来了重大转变。如果以前您需要一个机器学习专家团队来管理数据集和训练自定义模型，那么现在您只需要访问 LLM 即可。准入门槛已显着降低，使得几年前还只为领域专家保留的技术，现在甚至对非技术人员也更易于访问。

该图像描绘了非结构化文本到结构化信息的转换。这个过程，被标记为信息提取管道，最终形成信息的图形表示。节点代表关键实体，而连接线表示这些实体之间的关系。知识图谱对于多跳问答、实时分析，或者当您想要在单个数据库中组合结构化和非结构化数据时非常有用。

虽然由于 LLM 的出现，从文本中提取结构化信息变得更加容易，但这绝不是一个已解决的问题。在这篇博文中，我们将使用OpenAI 函数与 LangChain 结合，从维基百科示例页面构建知识图谱。在此过程中，我们将讨论最佳实践以及当前 LLM 的一些局限性。

tldr; 代码可在 GitHub 上找到。

Neo4j 环境设置

您需要设置 Neo4j 才能跟随本博文中的示例进行操作。最简单的方法是在 Neo4j Aura 上启动一个免费实例，它提供 Neo4j 数据库的云实例。或者，您也可以通过下载 Neo4j Desktop 应用程序并创建本地数据库实例来设置 Neo4j 数据库的本地实例。

以下代码将实例化一个 LangChain 包装器以连接到 Neo4j 数据库。

from langchain.graphs import Neo4jGraph

url = "neo4j+s://databases.neo4j.io"
username ="neo4j"
password = ""
graph = Neo4jGraph(
    url=url,
    username=username,
    password=password
)

信息提取管道

典型的信息提取管道包含以下步骤。

在第一步中，我们通过共指消解模型运行输入文本。共指消解是查找所有指代特定实体的表达式的任务。简而言之，它将所有代词链接到所指代的实体。在管道的命名实体识别部分，我们尝试提取所有提及的实体。上面的例子包含三个实体：Tomaz、Blog 和 Diagram。下一步是实体消歧步骤，这是信息提取管道中一个重要但经常被忽视的部分。实体消歧是准确识别和区分具有相似名称或引用的实体，以确保在给定上下文中识别出正确实体的过程。在最后一步中，模型尝试识别实体之间的各种关系。例如，它可以找到 Tomaz 和 Blog 实体之间的 LIKES 关系。

使用 OpenAI 函数提取结构化信息

OpenAI 函数非常适合从自然语言中提取结构化信息。 OpenAI 函数背后的想法是让 LLM 输出一个预定义的 JSON 对象，其中填充了值。预定义的 JSON 对象可以用作所谓 RAG 应用程序中其他函数的输入，或者它可以用于从文本中提取预定义的结构化信息。

在 LangChain 中，您可以传递一个 Pydantic 类作为描述 OpenAI 函数功能的所需 JSON 对象。因此，我们将首先定义我们想要从文本中提取的信息的所需结构。 LangChain 已经有了节点和关系的定义，我们可以将其作为 Pydantic 类重用。

class Node(Serializable):
    """Represents a node in a graph with associated properties.

    Attributes:
        id (Union[str, int]): A unique identifier for the node.
        type (str): The type or label of the node, default is "Node".
        properties (dict): Additional properties and metadata associated with the node.
    """

    id: Union[str, int]
    type: str = "Node"
    properties: dict = Field(default_factory=dict)


class Relationship(Serializable):
    """Represents a directed relationship between two nodes in a graph.

    Attributes:
        source (Node): The source node of the relationship.
        target (Node): The target node of the relationship.
        type (str): The type of the relationship.
        properties (dict): Additional properties associated with the relationship.
    """

    source: Node
    target: Node
    type: str
    properties: dict = Field(default_factory=dict)

不幸的是，事实证明 OpenAI 函数目前不支持将字典对象作为值。因此，我们必须覆盖 properties 定义以遵守函数端点的限制。

from langchain.graphs.graph_document import (
    Node as BaseNode,
    Relationship as BaseRelationship
)
from typing import List, Dict, Any, Optional
from langchain.pydantic_v1 import Field, BaseModel

class Property(BaseModel):
  """A single property consisting of key and value"""
  key: str = Field(..., description="key")
  value: str = Field(..., description="value")

class Node(BaseNode):
    properties: Optional[List[Property]] = Field(
        None, description="List of node properties")

class Relationship(BaseRelationship):
    properties: Optional[List[Property]] = Field(
        None, description="List of relationship properties"
    )

在这里，我们已将 properties 值覆盖为 Property 类列表，而不是字典，以克服 API 的限制。因为您只能将单个对象传递给 API，所以我们需要将节点和关系组合在一个名为 KnowledgeGraph 的类中。

class KnowledgeGraph(BaseModel):
    """Generate a knowledge graph with entities and relationships."""
    nodes: List[Node] = Field(
        ..., description="List of nodes in the knowledge graph")
    rels: List[Relationship] = Field(
        ..., description="List of relationships in the knowledge graph"
    )

剩下的唯一事情就是做一些提示工程，我们就可以开始了。我通常进行提示工程的方式如下

迭代提示并使用自然语言改进结果
如果某些内容未按预期工作，请要求 ChatGPT 使其更清晰，以便 LLM 理解任务
最后，当提示具有所有需要的指令时，请要求 ChatGPT 以 markdown 格式总结指令，节省令牌，并可能获得更清晰的指令

我特别选择 markdown 格式，因为我在某处看到 OpenAI 模型对提示中的 markdown 语法反应更好，而且从我的经验来看，这似乎至少是合理的。

通过迭代提示工程，我为信息提取管道提出了以下系统提示。

llm = ChatOpenAI(model="gpt-3.5-turbo-16k", temperature=0)

def get_extraction_chain(
    allowed_nodes: Optional[List[str]] = None,
    allowed_rels: Optional[List[str]] = None
    ):
    prompt = ChatPromptTemplate.from_messages(
    [(
      "system",
      f"""# Knowledge Graph Instructions for GPT-4
## 1. Overview
You are a top-tier algorithm designed for extracting information in structured formats to build a knowledge graph.
- **Nodes** represent entities and concepts. They're akin to Wikipedia nodes.
- The aim is to achieve simplicity and clarity in the knowledge graph, making it accessible for a vast audience.
## 2. Labeling Nodes
- **Consistency**: Ensure you use basic or elementary types for node labels.
  - For example, when you identify an entity representing a person, always label it as **"person"**. Avoid using more specific terms like "mathematician" or "scientist".
- **Node IDs**: Never utilize integers as node IDs. Node IDs should be names or human-readable identifiers found in the text.
{'- **Allowed Node Labels:**' + ", ".join(allowed_nodes) if allowed_nodes else ""}
{'- **Allowed Relationship Types**:' + ", ".join(allowed_rels) if allowed_rels else ""}
## 3. Handling Numerical Data and Dates
- Numerical data, like age or other related information, should be incorporated as attributes or properties of the respective nodes.
- **No Separate Nodes for Dates/Numbers**: Do not create separate nodes for dates or numerical values. Always attach them as attributes or properties of nodes.
- **Property Format**: Properties must be in a key-value format.
- **Quotation Marks**: Never use escaped single or double quotes within property values.
- **Naming Convention**: Use camelCase for property keys, e.g., `birthDate`.
## 4. Coreference Resolution
- **Maintain Entity Consistency**: When extracting entities, it's vital to ensure consistency.
If an entity, such as "John Doe", is mentioned multiple times in the text but is referred to by different names or pronouns (e.g., "Joe", "he"), 
always use the most complete identifier for that entity throughout the knowledge graph. In this example, use "John Doe" as the entity ID.  
Remember, the knowledge graph should be coherent and easily understandable, so maintaining consistency in entity references is crucial. 
## 5. Strict Compliance
Adhere to the rules strictly. Non-compliance will result in termination."""),
        ("human", "Use the given format to extract information from the following input: {input}"),
        ("human", "Tip: Make sure to answer in the correct format"),
    ])
    return create_structured_output_chain(KnowledgeGraph, llm, prompt, verbose=False)

您可以看到我们正在使用 GPT-3.5 模型的 16k 版本。主要原因是 OpenAI 函数输出是结构化的 JSON 对象，而结构化的 JSON 语法给结果增加了大量的令牌开销。本质上，您正在为结构化输出的便利性付出增加令牌空间的代价。

除了通用指令外，我还添加了限制应从文本中提取哪些节点或关系类型的选项。您将通过示例了解为什么这可能会派上用场。

我们已经准备好了 Neo4j 连接和 LLM 提示，这意味着我们可以将信息提取管道定义为单个函数。

def extract_and_store_graph(
    document: Document,
    nodes:Optional[List[str]] = None,
    rels:Optional[List[str]]=None) -> None:
    # Extract graph data using OpenAI functions
    extract_chain = get_extraction_chain(nodes, rels)
    data = extract_chain.run(document.page_content)
    # Construct a graph document
    graph_document = GraphDocument(
      nodes = [map_to_base_node(node) for node in data.nodes],
      relationships = [map_to_base_relationship(rel) for rel in data.rels],
      source = document
    )
    # Store information into a graph
    graph.add_graph_documents([graph_document])

该函数接受 LangChain 文档以及可选的节点和关系参数，这些参数用于限制我们希望 LLM 识别和提取的对象类型。大约一个月前，我们向 Neo4j 图对象添加了 add_graph_documents 方法，我们可以在此处利用它来无缝导入图。

评估

我们将从沃尔特·迪士尼维基百科页面提取信息并构建知识图谱来测试管道。在这里，我们将利用 LangChain 提供的维基百科加载器和文本分块模块。

from langchain.document_loaders import WikipediaLoader
from langchain.text_splitter import TokenTextSplitter

# Read the wikipedia article
raw_documents = WikipediaLoader(query="Walt Disney").load()
# Define chunking strategy
text_splitter = TokenTextSplitter(chunk_size=2048, chunk_overlap=24)

# Only take the first the raw_documents
documents = text_splitter.split_documents(raw_documents[:3])

您可能已经注意到我们使用了相对较大的 chunk_size 值。原因是我们希望在一个句子周围提供尽可能多的上下文，以便共指消解部分尽可能好地工作。请记住，共指步骤只有在实体及其引用出现在同一块中时才有效；否则，LLM 没有足够的信息来链接两者。

现在我们可以继续并通过信息提取管道运行文档。

from tqdm import tqdm

for i, d in tqdm(enumerate(documents), total=len(documents)):
    extract_and_store_graph(d)

该过程大约需要 5 分钟，这相对较慢。因此，您可能需要在生产环境中进行并行 API 调用来解决此问题并实现某种程度的可扩展性。

我们首先看看 LLM 识别的节点和关系的类型。

由于未提供图模式，LLM 会动态决定它将使用哪些类型的节点标签和关系类型。例如，我们可以观察到有 Company 和 Organization 节点标签。这两件事可能在语义上相似或相同，因此我们希望只有一个节点标签代表两者。这个问题在关系类型上更明显。例如，我们有 CO-FOUNDER 和 COFOUNDEROF 关系，以及 DEVELOPER 和 DEVELOPEDBY 。

对于任何更严肃的项目，您都应该定义 LLM 应该提取的节点标签和关系类型。幸运的是，我们添加了通过传递附加参数来限制提示中类型的功能。

# Specify which node labels should be extracted by the LLM
allowed_nodes = ["Person", "Company", "Location", "Event", "Movie", "Service", "Award"]

for i, d in tqdm(enumerate(documents), total=len(documents)):
    extract_and_store_graph(d, allowed_nodes)

在本例中，我仅限制了节点标签，但您可以通过将另一个参数传递给 extract_and_store_graph 函数来轻松限制关系类型。

提取的子图的可视化具有以下结构。

图的结果比预期的要好（在五次迭代之后 :) ）。我无法在可视化中很好地捕捉到整个图，但您可以在 Neo4j 浏览器或其他工具中自行探索它。

实体消歧

我应该提到的一件事是我们部分跳过了实体消歧部分。我们使用了较大的块大小，并在系统提示中添加了关于共指消解和实体消歧的特定指令。但是，由于每个块都是单独处理的，因此无法确保不同文本块之间实体的一致性。例如，您最终可能会得到两个代表同一个人的节点。

在本例中，沃尔特·迪士尼和沃尔特·伊莱亚斯·迪士尼指的是同一个真实世界的人。实体消歧问题并不新鲜，并且已经提出了各种解决方案来解决它

使用实体链接或实体消歧 NLP 模型
进行第二次通过 LLM 并要求它执行实体消歧
基于图的方法

您应该使用哪种解决方案取决于您的领域和用例。但是，请记住，不应忽视实体消歧步骤，因为它会对您的 RAG 应用程序的准确性和有效性产生重大影响。

Rag 应用程序

我们要做的最后一件事是向您展示如何通过构建 Cypher 语句来浏览知识图谱中的信息。 Cypher 是一种结构化查询语言，用于处理图数据库，类似于 SQL 用于关系数据库的方式。 LangChain 有一个 GraphCypherQAChain，它可以读取图的模式并根据用户输入构建适当的 Cypher 语句。

# Query the knowledge graph in a RAG application
from langchain.chains import GraphCypherQAChain

graph.refresh_schema()

cypher_chain = GraphCypherQAChain.from_llm(
    graph=graph,
    cypher_llm=ChatOpenAI(temperature=0, model="gpt-4"),
    qa_llm=ChatOpenAI(temperature=0, model="gpt-3.5-turbo"),
    validate_cypher=True, # Validate relationship directions
    verbose=True
)
cypher_chain.run("When was Walter Elias Disney born?")

结果如下

总结

当您需要结构化和非结构化数据的组合来驱动 RAG 应用程序时，知识图谱非常适合。在这篇博文中，您学习了如何使用 OpenAI 函数在任意文本上在 Neo4j 中构建知识图谱。 OpenAI 函数提供了整洁结构化输出的便利性，使其成为提取结构化信息的理想选择。为了获得使用 LLM 构建图谱的良好体验，请确保尽可能详细地定义图模式，并确保在提取后添加实体消歧步骤。

如果您渴望了解更多关于使用图谱构建 AI 应用程序的信息，请加入我们在 2023 年 10 月 26 日由 Neo4j 组织的 NODES，在线，24 小时会议。

代码可在 GitHub 上找到。

Neo4j 环境设置

信息提取管道

使用 OpenAI 函数提取结构化信息

评估

实体消歧

Rag 应用程序

总结

加入我们的新闻邮件