ChatGPT

ChatGPT does not need an introduction. More information can be found on the official website. Or on the promptingguide site

Below is information about OpenAI models in the context of programming.

Benchmark

The HumanEval benchmark was created by OpenAI for measuring functional correctness for synthesizing programs from docstrings. It consists of 164 Python programming problems. Each problem includes a function signature, docstring, body, and several unit tests, with an average of 7.7 tests per problem. Programming tasks in the HumanEval dataset assess language comprehension, reasoning, algorithms, and simple mathematics.

Leaderboard

While various prompting techniques allow models to approach 95% accuracy, let's consider the performance of models in zero-shot prompting.

GPT-4(zero-shot)	67.0
GPT-3.5(zero-shot)	48.1

GPT-4 Technical Report

Temperature and code generation

The article discusses temperature coefficients in LLMs. The temperature coefficient (T) impacts randomness in code generation; a higher T enables more exploration, whereas a lower T reduces randomness noise. The study shows that raising T can improve code generation, but may cause more errors. This leads to the introduction of Adaptive Temperature (AdapT) sampling, which adjusts T based on token type. Experiments demonstrate AdapT outperforms standard techniques across several metrics. Another article asserts that setting T to 0, contrary to popular belief, does not ensure determinism in code generation, but only reduces it.

We recommend using the default temperature value of 1 until further research is conducted.

Context window

The context window is an invaluable resource. GPT-4 models offer a substantial context window of up to 128,000 tokens for the gpt-4-1106-preview. However, this may be misleading as the graph demonstrates the increase in lost facts as context grows.

Evaluating the size of your context is crucial, and when feasible, employing gpt-4-1106-preview, which deteriorates at a slower pace, is advisable. When handling a large context, the usage of RAG technology should be considered.

System tokens(TODO)

I found it challenging to understand what the author meant by just one tweet.

I presume that "<⏐im_start⏐>" and "<⏐im_end⏐>" are system tokens indicating the beginning and end of messages. The author seems to suggest that using these tokens in API prompts enhances accuracy, likely because the model was also trained on user inputs via chat. However, roles and messages can already be specified in prompts.

OpenAI Function Calling Tutorial(TODO)

Learn how OpenAI's new Function Calling capability enables GPT models to generate structured JSON output, resolving common dev issues caused by irregular outputs.

https://www.datacamp.com/tutorial/open-ai-function-calling-tutorial