The basic struct of string consists of three members:
struct string {
char* mPtr; // dynamically allocated memory
size_t mSize; // the length of the string
size_t mCapacity; // the size of allocated memory
};
Allocating memory for small strings (e.g., empty string with a null \0
character) is wasteful. Hence, to avoid this waste, most implementations of string
structs apply Small String Optimization (SSO), which stores small strings directly within the string object on the stack
, rather than allocating memory dynamically on the heap
. I found this trick is rather interesting, as it showcases how different C++ compilers and tied standard libraries implement the same concept in different ways.
There are four main implementations of SSO from GCC, MSVC, Clang and Facebook.
Implementation | Buffer Size | Key Design Choice | Trade-offs |
---|---|---|---|
GCC | 8 bytes | Consistent pointer approach | Simple but smaller buffer |
MSVC | 8 bytes | Union of pointer/buffer | Extra checks needed |
Clang | 16 bytes | Single bit flag | Large buffer, though waste 7 bytes for padding |
23 bytes | Last byte stores metadata | Largest buffer, complex byte-bit metadata |
Let’s take a look at them one by one.
libstdc++
from GCC
The GCC_String
class is a minimal implementation of a string structure with Small String Optimization (SSO), following GCC. Notice that the actual GCC implementation costs 32 bytes, with mBuf
occupying 15 bytes, which is larger than the 8-byte mCapacity
(8 bytes) in the union
.
- Data Members:
char* mPtr
: Points to the memory holding the string data.size_t mSize
: Stores the size (length) of the string.union { size_t mCapacity; char mBuf[8]; }
:- For small strings,
mBuf
acts as an internal buffer to store the string directly (SSO). - For larger strings,
mCapacity
holds the capacity of the dynamically allocated memory.
- For small strings,
- SSO Logic:
- If the string length is small (less than 7 characters), the string data is stored in
mBuf
. - For longer strings, memory is dynamically allocated on the heap, and
mPtr
points to it. - GCC uses a consistent pointer
mPtr
either points to the internal stack buffer or the heap memory.
- If the string length is small (less than 7 characters), the string data is stored in
#include <cstddef>
#include <iostream>
#include <string>
#include <cstdio>
struct GCC_String {
char* mPtr;
size_t mSize{};
union {
size_t mCapacity;
char mBuf[8];
};
static_assert((sizeof(mBuf) + sizeof(mPtr) + sizeof(mSize)) == 24);
// The mPtr{mBuf} sets the mPtr to point to the internal buffer mBuf,
// indicating that this string instance will use SSO.
// mBuf{} zero-initializes mBuf, ensuring all characters in the buffer
// are set to '\0' by default.
constexpr GCC_String(): mPtr{mBuf}, mBuf{} {}
constexpr GCC_String(const char* data)
:GCC_String(data, data? std::char_traits<char>::length(data): 0){}
constexpr GCC_String(const char* data, size_t len)
: mPtr{fits_into_small_string(len)? mBuf: new char[len]},
mSize{len},
mBuf{} {
if(!is_small_string()) {
mCapacity = len;
}
std::copy_n(data, len, mPtr);
}
constexpr const char* data() const {
return mPtr;
}
constexpr bool fits_into_small_string(size_t len) {
return len <= small_string_capacity();
}
constexpr bool is_small_string() const {
return mBuf == mPtr;
}
constexpr size_t small_string_capacity() const {
return sizeof(mBuf) - 1; // -1 for '\0'
}
constexpr size_t size() const {
return mSize;
}
constexpr size_t capacity() const {
return is_small_string()? small_string_capacity(): mCapacity;
}
};
int main () {
std::cout << "gcc_string struct size: " << sizeof(GCC_String) << std::endl;
GCC_String small("short");
GCC_String large("this is a long string");
GCC_String empty(nullptr);
std::cout << "small: struct size: " << sizeof(small)
<< " capacity: " << small.capacity()
<< " size: " << small.size()
<< " -- ";
printf("%s\n", small.data());
std::cout << "large: struct size: " << sizeof(large)
<< " capacity: " << large.capacity()
<< " size: " << large.size()
<< " -- ";
printf("%s\n", large.data());
// gcc_string struct size: 24
// small: struct size: 24 capacity: 7 size: 5 -- short
// large: struct size: 24 capacity: 21 size: 21 -- this is a long string
return 0;
}
MS STL
from MSVC
The implementation of MSVC is similar to GCC, but it uses a union
differently:
- Data Members:
- A union for either:
- A pointer to dynamically allocated memory.
- A static buffer (char
mBuf[8]
).
size_t mSize
: Current size.size_t mCapacity
: Capacity indicator.
- A union for either:
- SSO Logic:
- If the string length is small (less than 7 characters), the string data is stored in
mBuf
. - For longer strings, memory is dynamically allocated on the heap, and
mPtr
points to it. - MSVC uses a union to switch between the buffer and the pointer, requiring a check each time to determine which one to use.
- If the string length is small (less than 7 characters), the string data is stored in
#include <cstddef>
#include <iostream>
#include <string>
struct MS_String {
union {
char* mPtr;
char mBuf[8];
};
size_t mSize{};
size_t mCapacity;
static_assert((sizeof(mBuf) + sizeof(mPtr) + sizeof(mSize)) == 24);
constexpr MS_String(): mBuf{} {}
constexpr MS_String(const char* data)
:MS_String(data, data? std::char_traits<char>::length(data): 0){}
constexpr MS_String(const char* data, size_t len)
: mBuf{},
mSize{len},
mCapacity{fits_into_small_string(len)? small_string_capacity(): len} {
if(!is_small_string()) {
mPtr = new char[len];
std::copy_n(data, len, mPtr);
}
else {
std::copy_n(data, len, mBuf);
}
}
constexpr const char* data() const {
return is_small_string()? mBuf: mPtr;
}
constexpr bool fits_into_small_string(size_t len) {
return len <= small_string_capacity();
}
constexpr bool is_small_string() const {
return mCapacity <= small_string_capacity();
}
constexpr size_t small_string_capacity() const {
return sizeof(mBuf) - 1; // -1 for '\0'
}
constexpr size_t size() const {
return mSize;
}
constexpr size_t capacity() const {
return mCapacity;
}
};
int main () {
std::cout << "MS_string struct size: " << sizeof(MS_String) << std::endl;
MS_String small("short");
MS_String large("this is a long string");
MS_String empty(nullptr);
std::cout << "small: struct size: " << sizeof(small)
<< " capacity: " << small.capacity()
<< " size: " << small.size()
<< " -- ";
printf("%s\n", small.data());
std::cout << "large: struct size: " << sizeof(large)
<< " capacity: " << large.capacity()
<< " size: " << large.size()
<< " -- ";
printf("%s\n", large.data());
// MS_string struct size: 24
// small: struct size: 24 capacity: 7 size: 5 -- short
// large: struct size: 24 capacity: 21 size: 21 -- this is a long string
return 0;
}
libstdc++
from Clang
The implementation of Clang uses a single bit to switch between large and small strings. With a more compact layout, it provides a larger buffer (16 bytes) than GCC and MSVC (8 bytes).
- Data Field: Union of two structs:
Large_String
struct (24 bytes total):- A 1-bit flag indicating if it’s a large string
- 63 bits for capacity
- Size field (8 bytes)
- Pointer to data (8 bytes)
Small_String
struct (24 bytes total):- A 1-bit flag indicating if it’s a large string
- 7 bits for size
- 7 bytes padding
- 16 bytes inline buffer for string data
The SSO logic:
- Uses just 1 bit to distinguish between small/large strings
- Provides 16 bytes buffer for small strings
- For small strings, stores data directly in the
mData
buffer - For large strings, allocates memory and stores pointer in
mData
- Both structs are exactly 24 bytes, allowing them to be stored in a union
#include <cstddef>
#include <iostream>
#include <string>
struct Clang_String {
struct Large_String {
size_t is_large : 1;
size_t mCapacity : 63;
size_t mSize;
char* mData;
};
struct Small_String {
uint8_t is_large : 1;
uint8_t mSize : 7;
uint8_t mPaddingBytes[7]; // sizeof(size_t) - sizeof(uint8_t), ensure alignment
char mBuf[16]; // sizeof(Large_String) - sizeof(size_t)
};
union {
Large_String large;
Small_String small;
} packed;
static_assert((sizeof(Large_String) == 24 && sizeof(Large_String) == sizeof(Small_String)));
constexpr Clang_String(): packed{} {}
constexpr Clang_String(const char* data)
:Clang_String(data, data? std::char_traits<char>::length(data): 0){}
constexpr Clang_String(const char* data, size_t len)
{
if(fits_into_small_string(len)) {
packed.small.mSize = len;
std::copy_n(data, len, packed.small.mBuf);
} else {
packed.large.is_large = true;
packed.large.mSize = len;
packed.large.mCapacity = len;
packed.large.mData = new char[len];
std::copy_n(data, len, packed.large.mData);
}
}
constexpr const char* data() const {
return is_small_string()? packed.small.mBuf: packed.large.mData;
}
constexpr bool fits_into_small_string(size_t len) {
return len <= small_string_capacity();
}
constexpr bool is_small_string() const {
return !packed.large.is_large;
}
constexpr size_t small_string_capacity() const {
return sizeof(packed.small.mBuf) - 1;
}
constexpr size_t size() const {
return is_small_string()? packed.small.mSize: packed.large.mSize;
}
constexpr size_t capacity() const {
return is_small_string()? small_string_capacity(): packed.large.mCapacity;
}
};
int main () {
auto str = Clang_String{};
std::cout << "Clang_String struct size: " << sizeof(Clang_String) << std::endl;
Clang_String small("short");
Clang_String large("this is a long string");
Clang_String empty(nullptr);
std::cout << "small: struct size: " << sizeof(small)
<< ", capacity: " << small.capacity()
<< ", size: " << small.size()
<< " -- ";
printf("%s\n", small.data());
std::cout << "large: struct size: " << sizeof(large)
<< ", capacity: " << large.capacity()
<< ", size: " << large.size()
<< " -- ";
printf("%s\n", large.data());
// Clang_String struct size: 24
// small: struct size: 24, capacity: 15, size: 5 -- short
// large: struct size: 24, capacity: 21, size: 21 -- this is a long string
return 0;
}
“FB_String” from Facebook
Facebook’s implementation provides an even larger buffer for small strings (23 bytes) than Clang (16 bytes). It is not tied to specific compiler, but rather a stl
compatible library.
- Data Field: Union of two structs:
Large_String
struct (24 bytes total):- Pointer to data (8 bytes)
- Size field (8 bytes)
- Capacity field (7 bytes), with 1 byte is virtually reduced for
mode_byte
Small_String
struct (24 bytes total):- 23 bytes inline buffer for string data
- 1 byte for mode/size info
The SSO logic:
- Uses the last byte of the buffer to store mode and size information
- For small strings: last byte = (23 - length)
- For large strings: last byte = 0x40 (64)
- If last byte >= 0x40, it’s a large string
- Small string length = 23 - last byte
#include <iostream>
#include <cstdint>
struct FB_String {
struct Large {
char* mData; // mbuf[0-7]
size_t mSize; // mbuf[8-15]
size_t mCapacity; // mbuf[16-22], mbuf[23] is for mode_byte
};
struct Small {
char mBuf[sizeof(Large)];
};
union {
Small small;
Large large;
} packed;
static_assert(sizeof(Small) == sizeof(Large) && sizeof(Large)==24);
constexpr FB_String(): packed{} {}
constexpr FB_String(const char* data)
: FB_String(data, data? std::char_traits<char>::length(data): 0){}
constexpr FB_String(const char* data, size_t len): packed{} {
if(fits_into_small_string(len)) {
get_mode_byte() = small_string_capacity() - len;
std::copy_n(data, len, packed.small.mBuf);
} else {
packed.large.mSize = len;
packed.large.mData = new char[len];
std::copy_n(data, len, packed.large.mData);
packed.large.mCapacity = len;
get_mode_byte() = 0x40; // 64 - 0100 0000
}
}
constexpr bool fits_into_small_string(size_t len) const {
return len <= small_string_capacity();
}
constexpr size_t small_string_capacity() const {
return sizeof(Small) - 1; // -1 for '\0'. 23 is 0001 0111 < 0x40 (64)
}
constexpr size_t large_string_capacity() const {
return packed.large.mCapacity & 0x0fffffffffffffff;
}
constexpr size_t capacity() const {
return is_small_string()? small_string_capacity(): large_string_capacity();
}
constexpr size_t size() const {
return is_small_string()? small_string_capacity() - get_mode_byte(): packed.large.mSize;
}
constexpr bool is_small_string() const {
return (get_mode_byte() & 0x40) == 0; // 0100 0000 =
}
constexpr char get_mode_byte() const {
return packed.small.mBuf[23]; // get last byte in the string buffer
}
constexpr char& get_mode_byte() {
return packed.small.mBuf[23];
}
constexpr const char* data() const {
return is_small_string()? packed.small.mBuf: packed.large.mData;
}
};
int main() {
std::cout << "FB_string struct size: " << sizeof(FB_String) << std::endl;
FB_String small("short");
FB_String large("this is a looooooong string");
FB_String empty(nullptr);
std::cout << "small: struct size: " << sizeof(small)
<< ", capacity: " << small.capacity()
<< ", size: " << small.size()
<< " -- ";
printf("%s\n", small.data());
std::cout << "large: struct size: " << sizeof(large)
<< ", capacity: " << large.capacity()
<< ", size: " << large.size()
<< " -- ";
printf("%s\n", large.data());
// FB_string struct size: 24
// small: struct size: 24, capacity: 23, size: 5 -- short
// large: struct size: 24, capacity: 27, size: 27 -- this is a looooooong string
return 0;
}