When it comes to LLM evaluations, There are 4 pieces-
There is a dataset for evaluation like question answer pair (1), there is evaluation functions (2) that may not only evaluate ground truth but also comparison of various results etc, LLM task (3), and interpretation of evaluation results (4) to understand easily. The more granular level of idea is-
What are traces, runs and work unit ?
all the calls to LLM from input to output would be traces. it is similar to logger in engineering where various information is saved. in above diagram, project is considered as collection of traces, a trace mai have multiple run which represent all the other calling it may be call to external tools like in case of RAG, a trace would have 2 runs ie. one from question to Doc ( Retrieval) and then answer ( generation). a Run can be considered as single unit of work ( call).
What are datasets- data to be evaluated ie. input and output values.
in above image , you can see that actual output; ground truth is compared with output from the LLM ( through Run) and is been evaluated in form of a metric.
dataset with various version can be stored and various experiment can be run on these datasets.
there are also datasets created from traces like capturing input and output and keep these for further training/ evaluating the model. at-least once gets the questions and manually right answer can be given to keep data for various uses. Also in langsmith dataset can be stored at project level so can be used in multiple experiments .
What are evaluations- https://docs.smith.langchain.com/old/evaluation/faq/evaluator-implementations
already stored dataset can be used for different evaluations. As shown in above digram, there can be various types of evaluations-
labelled- when output value is given,
Criteria - Default criteria are implemented for the conciseness, relevance, correctness, coherence, harmfulness, maliciousness, helpfulness, controversiality, misogyny, and criminality.
JSON evaluator- provide functionality to check your model's output consistency.
Embedding - To measure the similarity between a predicted string and a reference.
Regex- evaluate predictions against a reference regular expression pattern.
User defined criteria- custom metric for example , 1 if answer is present else 0, this will check at what points LLM is not giving output. other can be string matching, using another LLM in custom metric.
A/B testing- same evaluators can be used to compare results from multiple model/ multiple parameters.
Prompt Playground/experiments- to see outputs from various prompt, for different models and evals-
in above example, we can see online-evaluator provides correctness metric for every experiment. Thus one can look look further only for incorrect ones.
Adding Unit Tests - unit tests are the assertions as part of CI. In langsmith, like pytest unit tests can be written in a separate file. only requirement is to add 'Unit' decorator from langsmith so that it will be logged and referred in main LLM code. For example, in code generation use-case output must follow a pydantic data structure. [ pydantic allows you to define data structures using Python classes, ensuring that the data conforms to the specified types and constraints.] Here test case can confirm if out is following this structure.
To be continued...
Comentarios