Skip to content
Go back

Code Review Practice Based on Large Models + Knowledge Bases

Published:  at  16:00

This is meant to spark discussion. The author has limited understanding of large language models (LLMs), so please feel free to point out anything inappropriate!

Background

💡The idea came from a Code Review where I asked Claude which coding style was more elegant. At the time, I wondered: can we let AI help us with Code Review?

1.png

Pain Points

Introduction

In one sentence: this is a Code Review practice based on open-source large language models + a knowledge base, similar to a code review assistant (CR Copilot).

Features

Complies with company security standards: all code data stays within the intranet, and all inference processes are completed within the intranet

Terminology

TermDefinition
CR / Code ReviewMore and more companies require R&D teams to perform Code Review (CR for short) during code development. It helps ensure code quality while promoting communication among team members and improving coding skills.
llm / Large Language ModelLarge Language Models (LLMs) are neural network models trained on large amounts of text data in natural language processing. They can generate high-quality text and understand language, such as GPT, BERT, etc.
AIGCUses NLP, NLG, computer vision, speech technologies, etc. to generate text, images, videos, and other content. The full name is Artificial Intelligence Generated Content; after UGC and PGC, it is a production method that uses AI technology to automatically generate content. The development of underlying AIGC technologies is driving the accelerated emergence of applications around different content types (modalities) and vertical domains.
LLaMAMeta’s (Facebook’s) large multimodal language model.
ChatGLMChatGLM is an open-source conversational language model that supports both Chinese and English, with the GLM language model as its foundation.
BaichuanBaichuan 2 is a next-generation open-source large language model released by Baichuan Intelligence, trained on 2.6 trillion Tokens of high-quality corpus data.
PromptA piece of text or a statement used to guide a machine learning model to generate output of a specific type, topic, or format. In natural language processing, a Prompt usually consists of a question or task description, such as “write me an article about artificial intelligence” or “translate this English sentence into French.” In image recognition, a Prompt can be an image description, tag, or classification information.
langchainLangChain is an open-source Python library developed by Harrison Chase, designed to support developing applications using large language models (LLMs) and external resources (such as data sources or language processing systems). It provides standard interfaces, integrates with other tools, and offers end-to-end chains for common applications.
embeddingMaps arbitrary text into a fixed-dimensional vector space. Texts with similar semantics have vectors located closer together in that space. In LLM applications, it is commonly used for similarity-based text search.
Vector storesDatabases that store vector representations, used for similarity search. Examples include Milvus, Pinecone, etc.
Similarity SearchSearches for vectors closest to a query vector in a vector database, used to retrieve similar items.
Knowledge BaseA database that stores structured knowledge; LLMs can use this knowledge to enhance their understanding.
In-context LearningIn-Context Learning is a concept in machine learning. It refers to the ability to solve new problems without adjusting the model’s own parameters, by including information related to the specific problem in the Prompt context.
Finetune / Fine-tuningFine-tunes a pretrained model on a specific dataset to improve the model’s performance on a given task.

Implementation Approach

Flowchart

image.png

System Architecture

To complete a CR process, the following technical modules are needed:

4.png

LLMs / Open-source Large Language Model Selection

The core of CR Copilot lies in the large language model foundation. The quality of CR generated based on different model foundations also varies. For the CR scenario, the model we choose needs to meet the following conditions:

image

FlagEval large model evaluation ranking in August(https://flageval.baai.ac.cn/#/trending)

The -{n}b after a model name means n*10 hundred million parameters. For example, 13b means 13 billion parameters. In my personal trial, parameter count does not determine how good the results are; it should be judged based on actual circumstances.

Initially, among many large language models, I selected “Llama2-Chinese-13b-Chat”, “chatglm2-6b”, and “Baichuan2-13B-Chat”. After racing the models for a period of time 🐎, I subjectively felt that Llama2 is more suitable for CR scenarios, while ChatGLM2 is more like a liberal arts student: it does not offer many constructive suggestions for code review, but it has more advantages in Chinese AIGC!

image.png

Logs from the execution process of the two models

Due to compliance issues around large language models, CR Copilot uses ChatGLM2-6B by default. If you need to use the Llama2 model, you need to apply to Meta, and use it after approval.

image

Llama 2 requires that an enterprise have no more than 700 million monthly active users

Currently supported model options are listed below, with scores for reference only:

8.png

Knowledge Base Design

Why do we need a knowledge base?

The large model foundation only contains public data from the internet, and does not understand internal company framework knowledge and usage documentation.

For example 🌰: suppose there is an internal framework called Lynx. We want the large language model to learn from internal documentation: “What is Lynx?” and “How do you write Lynx?

image.png

A picture is worth a thousand words

The “enhanced mode” here uses a vector database, generates a Prompt from the matched knowledge base snippets and the question “What is Lynx?”, and sends it to the LLM for execution.

How do we find highly relevant knowledge?

Once we have a knowledge base, how do we find the “most relevant content” in the “knowledge base” for the “search question/code” we provide?

The answer is through three processes:

  1. Text Embeddings
  2. Vector Stores
  3. Similarity Search

image

Text similarity matching flowchart, image source: Langchain-Chatchat

Text Embeddings

Unlike fuzzy search/keyword matching in traditional databases, we need semantic/feature matching.

For example: if you search for “cat”, you can only get results matching the keyword “cat”. You cannot get results such as “Ragdoll” or “blue-white”. A traditional database treats “Ragdoll” as “Ragdoll” and “cat” as “cat”. To implement associative semantic search, features need to be manually tagged. This process is also known as Feature Engineering.

image

How can we automatically extract these features from text? This is achieved through Vector Embedding. Currently, the community commonly uses OpenAI’s text-embedding-ada-002 model to generate embeddings, which raises two issues:

  1. Data security issue: OpenAI’s API needs to be called to perform vectorization image.png
  2. Cost: roughly 3,000 pages per dollar

image.png

We use the domestic text similarity calculation model bge-large-zh and deploy it privately on the company intranet. A single embedding vectorization basically takes milliseconds.

image.png

Vector Stores

We perform Vector Embeddings on official documentation in advance, and then store them in a vector database. The vector database we chose here is Qdrant, mainly because it is written in Rust, so storage and queries may be faster! Here is a quoted comparison of several dimensions for selecting a vector database:

Vector DatabaseURLGitHub** **StarLanguageCloud
chromahttps://github.com/chroma-core/chroma8.5KPython
milvushttps://github.com/milvus-io/milvus22.8KGo/Python/C++
pineconehttps://www.pinecone.io/
qdranthttps://github.com/qdrant/qdrant12.7KRust
typesensehttps://github.com/typesense/typesense14.4KC++
weaviatehttps://github.com/weaviate/weaviate7.4KGo

Data as of September 10, 2023

The principle is to determine similarity by comparing the distance between vectors

image

So once we have the “vector of the query question” and the “knowledge base vectors stored in the database”, we can directly use the Similarity Search method provided by the vector database to match relevant content.

image.png

Loading the Knowledge Base

The CR Copilot knowledge base is divided into an “built-in official documentation knowledge base” and a “custom knowledge base”. For the query input, we first take the first half of the complete code + an LLM-generated summary, then perform similarity matching against the knowledge base for context. The matching process is as follows:

image.png

The reason we take the first half of the complete code as query input is that most languages declare modules and packages in the first half. This improves the similarity matching rate of the knowledge base.

Official Documentation Knowledge Base (Built-in)

To avoid everyone repeatedly importing and embedding official documentation, CR Copilot has built-in official documentation, including:

ContentData Source
React official documentationhttps://react.dev/learn
TypeScript official documentationhttps://www.typescriptlang.org/docs/
Rspack official documentationhttps://www.rspack.dev/zh/guide/introduction.html
Garfishhttps://github.com/web-infra-dev/garfish
Internal company programming standards for Go / Python / Rust, etc.

And the built-in knowledge base is managed through a simple CRUD

image.png

Custom Knowledge Base - Feishu Documents (Custom)

Feishu documents have no formatting requirements; as long as one can understand what correct code looks like, that’s enough

Here we directly use the LarkSuite document loader provided by LangChain to retrieve Feishu documents for which we have permissions. We use CharacterTextSplitter / RecursiveCharacterTextSplitter to split text into fixed-length chunks. The method has two main parameters:

image.png

Prompt Instruction Design

Because large language models have enough data, if we want them to execute according to requirements, we need to use a “Prompt”.

image.png

(Image source: Stephen Wolfram)

Code Summary Instruction

Have the LLM analyze the current code’s knowledge points from the file code, for subsequent similarity matching against the knowledge base:

prefix = "user: " if model == "chatglm2" else "<s>Human: "
suffix = "assistant(用中文): let's think step by step." if model == "chatglm2" else "\n</s><s>Assistant(用中文): let's think step by step."

return f"""{prefix}根据这段 {language} 代码,列出关于这段 {language} 代码用到的工具库、模块包。
{language} 代码:
```{language}
{source_code}
```diff
请注意:
- 知识列表中的每一项都不要有类似或者重复的内容
- 列出的内容要和代码密切相关
- 最少列出 3 个, 最多不要超过 6 个
- 知识列表中的每一项要具体
- 列出列表,不要对工具库、模块做解释
- 输出中文
{suffix}"""

Where:

CR Instruction

If the model used (such as LLaMA 2) has relatively poor support for Chinese Prompts, the Prompt needs to be designed in the form of “English input” and “Chinese output”, namely:

# llama2
f"""Human: please briefly review the {language}code changes by learning the provided context to do a brief code review feedback and suggestions. if any bug risk and improvement suggestion are welcome(no more than six)
<context>
{context}
</context>

<code_changes>
{diff_code}
</code_changes>\n</s><s>Assistant: """

# chatglm2
f"""user: 【指令】请根据所提供的上下文信息来简要审查{language} 变更代码,进行简短的代码审查和建议,变更代码有任何 bug 缺陷和改进建议请指出(不超过 6 条)。
【已知信息】:{context}

【变更代码】:{diff_code}

assistant: """

Where:

Commenting on Changed Code Lines

To calculate the changed code lines, I wrote a function that parses the diff and outputs the changed line numbers:

import re


def parse_diff(input):
    if not input:
        return []
    if not isinstance(input, str) or re.match(r"^\s+$", input):
        return []

    lines = input.split("\n")
    if not lines:
        return []

    result = []
    current_file = None
    current_chunk = None
    deleted_line_counter = 0
    added_line_counter = 0
    current_file_changes = None

    def normal(line):
        nonlocal deleted_line_counter, added_line_counter
        current_chunk["changes"].append({
            "type": "normal",
            "normal": True,
            "ln1": deleted_line_counter,
            "ln2": added_line_counter,
            "content": line
        })
        deleted_line_counter += 1
        added_line_counter += 1
        current_file_changes["old_lines"] -= 1
        current_file_changes["new_lines"] -= 1

    def start(line):
        nonlocal current_file, result
        current_file = {
            "chunks": [],
            "deletions": 0,
            "additions": 0
        }
        result.append(current_file)

    def to_num_of_lines(number):
        return int(number) if number else 1

    def chunk(line, match):
        nonlocal current_file, current_chunk, deleted_line_counter, added_line_counter, current_file_changes
        if not current_file:
            start(line)
        old_start, old_num_lines, new_start, new_num_lines = match.group(1), match.group(2), match.group(
            3), match.group(4)

        deleted_line_counter = int(old_start)
        added_line_counter = int(new_start)
        current_chunk = {
            "content": line,
            "changes": [],
            "old_start": int(old_start),
            "old_lines": to_num_of_lines(old_num_lines),
            "new_start": int(new_start),
            "new_lines": to_num_of_lines(new_num_lines),
        }
        current_file_changes = {
            "old_lines": to_num_of_lines(old_num_lines),
            "new_lines": to_num_of_lines(new_num_lines),
        }
        current_file["chunks"].append(current_chunk)

    def delete(line):
        nonlocal deleted_line_counter
        if not current_chunk:
            return

        current_chunk["changes"].append({
            "type": "del",
            "del": True,
            "ln": deleted_line_counter,
            "content": line
        })
        deleted_line_counter += 1
        current_file["deletions"] += 1
        current_file_changes["old_lines"] -= 1

    def add(line):
        nonlocal added_line_counter
        if not current_chunk:
            return
        current_chunk["changes"].append({
            "type": "add",
            "add": True,
            "ln": added_line_counter,
            "content": line
        })
        added_line_counter += 1
        current_file["additions"] += 1
        current_file_changes["new_lines"] -= 1

    def eof(line):
        if not current_chunk:
            return
        most_recent_change = current_chunk["changes"][-1]
        current_chunk["changes"].append({
            "type": most_recent_change["type"],
            most_recent_change["type"]: True,
            "ln1": most_recent_change["ln1"],
            "ln2": most_recent_change["ln2"],
            "ln": most_recent_change["ln"],
            "content": line
        })

    header_patterns = [
        (re.compile(r"^@@\s+-(\d+),?(\d+)?\s++(\d+),?(\d+)?\s@@"), chunk)
    ]

    content_patterns = [
        (re.compile(r"^\ No newline at end of file$"), eof),
        (re.compile(r"^-"), delete),
        (re.compile(r"^+"), add),
        (re.compile(r"^\s+"), normal)
    ]

    def parse_content_line(line):
        nonlocal current_file_changes
        for pattern, handler in content_patterns:
            match = re.search(pattern, line)
            if match:
                handler(line)
                break
        if current_file_changes["old_lines"] == 0 and current_file_changes["new_lines"] == 0:
            current_file_changes = None

    def parse_header_line(line):
        for pattern, handler in header_patterns:
            match = re.search(pattern, line)
            if match:
                handler(line, match)
                break

    def parse_line(line):
        if current_file_changes:
            parse_content_line(line)
        else:
            parse_header_line(line)

    for line in lines:
        parse_line(line)

    return result

image.png

The comments here are made by a bot account calling the Gitlab API, and will be Resolved by default. This avoids having too many CR Copilot comments that each require manually clicking Resolved

Some Thoughts


Share this post on:

Previous Post
Possibly the Best Tab Management Extension for Chrome
Next Post
Using ChatGPT to Customize a Team-Specific ESLint Ruleset