The basic struct of string consists of three members:

struct string {
    char* mPtr;         // dynamically allocated memory
    size_t mSize;       // the length of the string
    size_t mCapacity;   // the size of allocated memory
};

Allocating memory for small strings (e.g., empty string with a null \0 character) is wasteful. Hence, to avoid this waste, most implementations of string structs apply Small String Optimization (SSO), which stores small strings directly within the string object on the stack, rather than allocating memory dynamically on the heap. I found this trick is rather interesting, as it showcases how different C++ compilers and tied standard libraries implement the same concept in different ways.

There are four main implementations of SSO from GCC, MSVC, Clang and Facebook.

ImplementationBuffer SizeKey Design ChoiceTrade-offs
GCC8 bytesConsistent pointer approachSimple but smaller buffer
MSVC8 bytesUnion of pointer/bufferExtra checks needed
Clang16 bytesSingle bit flagLarge buffer, though waste 7 bytes for padding
Facebook23 bytesLast byte stores metadataLargest buffer, complex byte-bit metadata

Let’s take a look at them one by one.

libstdc++ from GCC

The GCC_String class is a minimal implementation of a string structure with Small String Optimization (SSO), following GCC. Notice that the actual GCC implementation costs 32 bytes, with mBuf occupying 15 bytes, which is larger than the 8-byte mCapacity (8 bytes) in the union.

  1. Data Members:
    • char* mPtr: Points to the memory holding the string data.
    • size_t mSize: Stores the size (length) of the string.
    • union { size_t mCapacity; char mBuf[8]; }:
      • For small strings, mBuf acts as an internal buffer to store the string directly (SSO).
      • For larger strings, mCapacity holds the capacity of the dynamically allocated memory.
  2. SSO Logic:
    • If the string length is small (less than 7 characters), the string data is stored in mBuf.
    • For longer strings, memory is dynamically allocated on the heap, and mPtr points to it.
    • GCC uses a consistent pointer mPtr either points to the internal stack buffer or the heap memory.
#include <cstddef>
#include <iostream>
#include <string>
#include <cstdio>

struct GCC_String {
    char* mPtr;
    size_t mSize{};
    union {
        size_t mCapacity;
        char mBuf[8];
    };
    static_assert((sizeof(mBuf) + sizeof(mPtr) + sizeof(mSize)) == 24);

    // The mPtr{mBuf} sets the mPtr to point to the internal buffer mBuf,
    // indicating that this string instance will use SSO.
    // mBuf{} zero-initializes mBuf, ensuring all characters in the buffer
    // are set to '\0' by default.
    constexpr GCC_String(): mPtr{mBuf}, mBuf{} {}

    constexpr GCC_String(const char* data)
        :GCC_String(data, data? std::char_traits<char>::length(data): 0){}

    constexpr GCC_String(const char* data, size_t len)
        : mPtr{fits_into_small_string(len)? mBuf: new char[len]},
         mSize{len},
         mBuf{} {
        if(!is_small_string()) {
            mCapacity = len;
        }
        std::copy_n(data, len, mPtr); 
    }
    
    constexpr const char* data() const {
        return mPtr;
    }

    constexpr bool fits_into_small_string(size_t len) {
        return len <= small_string_capacity();
    }

    constexpr bool is_small_string() const {
        return mBuf == mPtr;
    }

    constexpr size_t small_string_capacity() const {
        return sizeof(mBuf) - 1; // -1 for '\0'
    }

    constexpr size_t size() const {
        return mSize;
    }

    constexpr size_t capacity() const {
        return is_small_string()? small_string_capacity(): mCapacity;
    }
};


int main () {
  std::cout << "gcc_string struct size: " << sizeof(GCC_String) << std::endl;
  GCC_String small("short");
  GCC_String large("this is a long string");
  GCC_String empty(nullptr);
  std::cout << "small: struct size: " << sizeof(small) 
    << " capacity: " << small.capacity() 
    << " size: " << small.size()
    << " -- ";
    printf("%s\n", small.data());

  std::cout << "large: struct size: " << sizeof(large) 
    << " capacity: " << large.capacity() 
    << " size: " << large.size()
    << " -- ";
    printf("%s\n", large.data());
// gcc_string struct size: 24
// small: struct size: 24 capacity: 7 size: 5 -- short
// large: struct size: 24 capacity: 21 size: 21 -- this is a long string
  return 0;
}

MS STL from MSVC

The implementation of MSVC is similar to GCC, but it uses a union differently:

  1. Data Members:
    • A union for either:
      • A pointer to dynamically allocated memory.
      • A static buffer (char mBuf[8]).
    • size_t mSize: Current size.
    • size_t mCapacity: Capacity indicator.
  2. SSO Logic:
    • If the string length is small (less than 7 characters), the string data is stored in mBuf.
    • For longer strings, memory is dynamically allocated on the heap, and mPtr points to it.
    • MSVC uses a union to switch between the buffer and the pointer, requiring a check each time to determine which one to use.
#include <cstddef>
#include <iostream>
#include <string>

struct MS_String {
    union {
        char* mPtr;
        char mBuf[8];
    };
    size_t mSize{};
    size_t mCapacity;

    static_assert((sizeof(mBuf) + sizeof(mPtr) + sizeof(mSize)) == 24);

    constexpr MS_String(): mBuf{} {}

    constexpr MS_String(const char* data)
        :MS_String(data, data? std::char_traits<char>::length(data): 0){}

    constexpr MS_String(const char* data, size_t len)
        : mBuf{},
         mSize{len}, 
         mCapacity{fits_into_small_string(len)? small_string_capacity(): len} {
        if(!is_small_string()) {
            mPtr = new char[len];
            std::copy_n(data, len, mPtr);
        } 
        else {
            std::copy_n(data, len, mBuf);
        }
    }

    constexpr const char* data() const {
        return is_small_string()? mBuf: mPtr;
    }

    constexpr bool fits_into_small_string(size_t len) {
        return len <= small_string_capacity();
    }

    constexpr bool is_small_string() const {
        return mCapacity <= small_string_capacity();
    }

    constexpr size_t small_string_capacity() const {
        return sizeof(mBuf) - 1; // -1 for '\0'
    }

    constexpr size_t size() const {
        return mSize;
    }

    constexpr size_t capacity() const {
        return mCapacity;
    }
};


int main () {
  std::cout << "MS_string struct size: " << sizeof(MS_String) << std::endl;
  MS_String small("short");
  MS_String large("this is a long string");
  MS_String empty(nullptr);
  std::cout << "small: struct size: " << sizeof(small) 
    << " capacity: " << small.capacity() 
    << " size: " << small.size()
    << " -- ";
    printf("%s\n", small.data());


  std::cout << "large: struct size: " << sizeof(large) 
    << " capacity: " << large.capacity() 
    << " size: " << large.size()
    << " -- ";
    printf("%s\n", large.data());
// MS_string struct size: 24
// small: struct size: 24 capacity: 7 size: 5 -- short
// large: struct size: 24 capacity: 21 size: 21 -- this is a long string
  return 0;
}

libstdc++ from Clang

The implementation of Clang uses a single bit to switch between large and small strings. With a more compact layout, it provides a larger buffer (16 bytes) than GCC and MSVC (8 bytes).

  1. Data Field: Union of two structs:
    • Large_String struct (24 bytes total):

      • A 1-bit flag indicating if it’s a large string
      • 63 bits for capacity
      • Size field (8 bytes)
      • Pointer to data (8 bytes)
    • Small_String struct (24 bytes total):

      • A 1-bit flag indicating if it’s a large string
      • 7 bits for size
      • 7 bytes padding
      • 16 bytes inline buffer for string data

The SSO logic:

  • Uses just 1 bit to distinguish between small/large strings
  • Provides 16 bytes buffer for small strings
  • For small strings, stores data directly in the mData buffer
  • For large strings, allocates memory and stores pointer in mData
  • Both structs are exactly 24 bytes, allowing them to be stored in a union
#include <cstddef>
#include <iostream>
#include <string>

struct Clang_String {
    
    struct Large_String {
        size_t is_large  : 1;
        size_t mCapacity : 63;
        size_t mSize;
        char*  mData;
    };

    struct Small_String {
        uint8_t is_large : 1;
        uint8_t mSize    : 7; 
        uint8_t mPaddingBytes[7]; // sizeof(size_t) - sizeof(uint8_t), ensure alignment
        char    mBuf[16]; // sizeof(Large_String) - sizeof(size_t)
    };

    union {
        Large_String large;
        Small_String small;
    } packed;

    static_assert((sizeof(Large_String) == 24 && sizeof(Large_String) == sizeof(Small_String)));

    constexpr Clang_String(): packed{} {}

    constexpr Clang_String(const char* data)
        :Clang_String(data, data? std::char_traits<char>::length(data): 0){}

    constexpr Clang_String(const char* data, size_t len)
    {
        if(fits_into_small_string(len)) {
            packed.small.mSize = len;
            std::copy_n(data, len, packed.small.mBuf);
        } else {
            packed.large.is_large = true;
            packed.large.mSize = len;
            packed.large.mCapacity = len;
            packed.large.mData = new char[len];
            std::copy_n(data, len, packed.large.mData);
        }
    }

    constexpr const char* data() const {
        return is_small_string()? packed.small.mBuf: packed.large.mData;
    }

    constexpr bool fits_into_small_string(size_t len) {
        return len <= small_string_capacity();
    }

    constexpr bool is_small_string() const {
        return !packed.large.is_large;
    }

    constexpr size_t small_string_capacity() const {
        return sizeof(packed.small.mBuf) - 1;
    }

    constexpr size_t size() const {
        return is_small_string()? packed.small.mSize: packed.large.mSize;
    }

    constexpr size_t capacity() const {
        return is_small_string()? small_string_capacity(): packed.large.mCapacity;
    }
};


int main () {
    auto str = Clang_String{};
    std::cout << "Clang_String struct size: " << sizeof(Clang_String) << std::endl;
    Clang_String small("short");
    Clang_String large("this is a long string");
    Clang_String empty(nullptr);
    std::cout << "small: struct size: " << sizeof(small) 
    << ", capacity: " << small.capacity() 
    << ", size: " << small.size()
    << " -- ";
    printf("%s\n", small.data());


    std::cout << "large: struct size: " << sizeof(large) 
    << ", capacity: " << large.capacity() 
    << ", size: " << large.size()
    << " -- ";
    printf("%s\n", large.data());
// Clang_String struct size: 24
// small: struct size: 24, capacity: 15, size: 5 -- short
// large: struct size: 24, capacity: 21, size: 21 -- this is a long string
  return 0;
}

“FB_String” from Facebook

Facebook’s implementation provides an even larger buffer for small strings (23 bytes) than Clang (16 bytes). It is not tied to specific compiler, but rather a stl compatible library.

  1. Data Field: Union of two structs:
    • Large_String struct (24 bytes total):

      • Pointer to data (8 bytes)
      • Size field (8 bytes)
      • Capacity field (7 bytes), with 1 byte is virtually reduced for mode_byte
    • Small_String struct (24 bytes total):

      • 23 bytes inline buffer for string data
      • 1 byte for mode/size info

The SSO logic:

  • Uses the last byte of the buffer to store mode and size information
  • For small strings: last byte = (23 - length)
  • For large strings: last byte = 0x40 (64)
  • If last byte >= 0x40, it’s a large string
  • Small string length = 23 - last byte
#include <iostream>
#include <cstdint>

struct FB_String {
    struct Large {
        char* mData;   // mbuf[0-7]
        size_t mSize;  // mbuf[8-15]
        size_t mCapacity; // mbuf[16-22], mbuf[23] is for mode_byte
    };

    struct Small {
        char mBuf[sizeof(Large)];
    };

    union {
        Small small;
        Large large;
    } packed;

    static_assert(sizeof(Small) == sizeof(Large) && sizeof(Large)==24);

    constexpr FB_String(): packed{} {}

    constexpr FB_String(const char* data)
        : FB_String(data, data? std::char_traits<char>::length(data): 0){}

    constexpr FB_String(const char* data, size_t len): packed{} {
        if(fits_into_small_string(len)) {
            get_mode_byte() = small_string_capacity() - len;
            std::copy_n(data, len, packed.small.mBuf);
        } else {
            packed.large.mSize = len;
            packed.large.mData = new char[len];
            std::copy_n(data, len, packed.large.mData);
            packed.large.mCapacity = len;
            get_mode_byte() = 0x40; // 64 - 0100 0000

        }
    }

    constexpr bool fits_into_small_string(size_t len) const {
        return len <= small_string_capacity();
    }

    constexpr size_t small_string_capacity() const {
        return sizeof(Small) - 1; // -1 for '\0'. 23 is 0001 0111 < 0x40 (64)
    }

    constexpr size_t large_string_capacity() const {
        return packed.large.mCapacity & 0x0fffffffffffffff;
    }

    constexpr size_t capacity() const {
        return is_small_string()? small_string_capacity(): large_string_capacity();
    }

    constexpr size_t size() const {
        return is_small_string()? small_string_capacity() - get_mode_byte(): packed.large.mSize;
    }

    constexpr bool is_small_string() const {
        return (get_mode_byte() & 0x40) == 0; // 0100 0000 = 
    }

    constexpr char get_mode_byte() const {
        return packed.small.mBuf[23]; // get last byte in the string buffer
    }

    constexpr char& get_mode_byte() {
        return packed.small.mBuf[23];
    }

    constexpr const char*  data() const {
        return is_small_string()?  packed.small.mBuf: packed.large.mData;
    }
};



int main() {
  std::cout << "FB_string struct size: " << sizeof(FB_String) << std::endl;
  FB_String small("short");
  FB_String large("this is a looooooong string");
  FB_String empty(nullptr);
  std::cout << "small: struct size: " << sizeof(small) 
    << ", capacity: " << small.capacity() 
    << ", size: " << small.size()
    << " -- ";
    printf("%s\n", small.data());


  std::cout << "large: struct size: " << sizeof(large) 
    << ", capacity: " << large.capacity() 
    << ", size: " << large.size()
    << " -- ";
    printf("%s\n", large.data());
// FB_string struct size: 24
// small: struct size: 24, capacity: 23, size: 5 -- short
// large: struct size: 24, capacity: 27, size: 27 -- this is a looooooong string
    return 0;
}