Skip to main content

Whole program lexical analysis

I was thinking about parsing and lexical analysis of source code recently (after all who doesn't... right??). Everywhere I've looked - which admittedly isn't in very many places - parsers still seem to treat input as a stream of tokens. The stream abstraction made sense in an era where memory was more limited. Does it still make sense now, when 8 Gb of RAM is considered small?

What if, instead, we cache the entire token array for each file? So we mmap the file, lex it in place and store the tokens in an in-memory array. Does this make parsing any easier? Does it lead to any speed savings? Or does it just use an infeasible amount of memory?

Time for some back-of-the napkin analysis. Let's say we're using data structures kind of like this to represent tokens:

  enum TokenType { /* an entry for each distinct type of token */ };

  struct Token {
    TokenType type; // The type of token
    int start;      // Byte offset for the start of the token
    int end;        // Byte offset for the end of the token
  };

  struct TokenFile {
    std::string filename;
    std::vector<Token> tokens; // The tokens in order.
    std::vector<int> lines;    // Byte offsets for each line number.
  };

So for each file we tokenise, we'll store it in memory as an array of tokens. The tokens themselves don't store a string or a value - they just identify which range of bytes in the file the token came from. We also store byte offsets for line numbers, so that if we need them for any reason - error reporting, or a __LINE__ macro, for example - we can find them using a binary search (these are low frequency uses, so it's wasteful to store a line number for every token).

Assuming a 64 bit system, we get the following sizes:
  • std::string = 24 bytes (the GNU stdlibc++ representation is 3 pointers, one each for start, end and capacity) + the size of the data
  • std::vector = 24 bytes (same representation as a std::string) + the data size
  • Token = 12 bytes (4 bytes for each of type, start and end).
  • TokenFile = 72 bytes + 12 * T + 4 * L + F, where T is the number of tokens in the file, L is the number of lines in the file and F is the length of the filename.

Now let's make up a theoretical large c++ code base and some metrics for it to see how things stack up. Let's say our code base has the following characteristics:
  • 10,000,000 lines of code
  • 10,000 separate files
  • An average of 1000 lines per file
  • 10 tokens per line, on average
  • The average filename length is 100 bytes.

Running the numbers, this means we have:
  • 10,000 TokenFile objects at (on average) 72 + 12 * 10,000 + 4 * 1000 + 100 bytes
    = 10,000 * 124,172 bytes
    = roughly 1.24 Gb;
Of which
  • 720 Kb is for the TokenFile objects
  • 1.2 Gb is for the Token data
  • 40 Mb is for the per-file line numbers
  • 1 Mb is for the file names

That doesn't seem too bad considering it's supposed to be a large code base. 1.24 Gb of RAM won't even touch the sides on a modern desktop computer! We could make a few optimisations to get the size down even further, but let's stick with that for now.

This structure doesn't store the token contents. For example, you can't use it to distinguish between two identifiers unless you load up the files containing the tokens and look at the byte ranges for each. Doing that might actually be feasible on an OS like Linux where opening files is fast, but it would be pretty bad on Windows where that's a more expensive operation. So what if we store the token contents as well? Is that feasible?

Let's make a few more assumptions about our theoretical code base:
  • There are around 1000 unique token strings per text file, on average.
  • Across the whole project there are about 10,000,000 unique strings with an average length of 10 chars.
We'll keep a cache of interned strings for each token (we won't worry about converting literals into actual ints/floats/doubles/whatever for now), so that each unique token string is stored once and once only. We'll also replace the 'end' member of our Token struct with a 'pos' member, holding an index into the interned string list.

The Token struct stays the same size so the sums above don't change. For the data size, we get:
  • 10,000,000 unique strings * (11 bytes per string + 8 bytes for the pointer to it)
    = approximately 190 Mb
Based on all that, storing the entire tokenised source code for a 10 million line c++ project would only take about 1.43 Gb. That's way below the 8Gb of RAM you get in an average desktop these days, so it seems like it would be workable.

Does it actually lead to any efficiency gains? That remains to be seen. It should mean that the file access pattern interacts better with the disk cache; and being able to index into an array of tokens might help reduce the cost of lookahead and backtracking in the parser; but for now it's just conjecture. There's no github repo to go with this post.

So it's over to you, readers. Do you know of anything that already does this? Any reasons why it'd be a crazy thing to do? If so, I'd love to hear about it!

Comments

Popular posts from this blog

Assert no lock required

This is a technique I learnt about from Jason Gregory's excellent book, Game Engine Architecture (3rd Edition) . If you have a shared resource accessed by multiple threads, where you're fairly certain that it's only ever accessed by one thread at a time, you can use an assert() to check for this at debug time without having to pay the runtime cost of locking a mutex. The implementation is fairly straightforward: class UnnecessaryMutex { public: void lock() { assert(!_locked); _locked = true; } void unlock() { assert(_locked); _locked = false; } private: volatile bool _locked = false; }; #ifdef ENABLE_LOCK_ASSERTS #define BEGIN_ASSERT_LOCK_NOT_REQUIRED(mutex) (mutex).lock() #define END_ASSERT_LOCK_NOT_REQUIRED(mutex) (mutex).unlock() #else #define BEGIN_ASSERT_LOCK_NOT_REQUIRED(mutex) #define END_ASSERT_LOCK_NOT_REQUIRED(mutex) #endif Usage is equally straightforward: UnnecessaryMutex gMutex; void PossiblyOverlappingFunction

Triangle bounding boxes in a single byte

Just thought of a way to store the bounding box for a single triangle in only one byte. It's not really practical or something you'd ever really want to use, but what the hell. Assume we have some kind of indexed mesh structure with a list of vertex positions and a list of triangle indices:   struct Mesh {     std::vector<vec3> verts;     std::vector<uvec3> triangles;   }; We can find the bounding box of a triangle by taking the min and max of all three vertices:   vec3 Mesh::lowerBound(uint32_t tri) const {     vec3 v0 = verts[triangles[tri].x];     vec3 v1 = verts[triangles[tri].y];     vec3 v2 = verts[triangles[tri].z];     return min(min(v0, v1), v2);   }   vec3 Mesh::upperBound(uint32_t tri) const {     vec3 v0 = verts[triangles[tri].x];     vec3 v1 = verts[triangles[tri].y];     vec3 v2 = verts[triangles[tri].z];     return max(max(v0, v1), v2);   } This is nice and simple and probably way better than what I'm about to suggest. W

LD_DEBUG

Posting this mainly as a reminder to myself... If you ever find yourself needing to figure out a dynamic library loading problem on Linux, LD_DEBUG can be a massive help. This is an environment variable you can set to make the dynamic linker print out a ton of useful diagnostic info. There are a number of different values which control the amount and type of diagnostics printed. One of the values is help; if you set LD_DEBUG to this and run executable it will print out a list of all the available options along with brief descriptions. For example, on my Linux workstation at the office: > LD_DEBUG=help cat Valid options for the LD_DEBUG environment variable are: libs display library search paths reloc display relocation processing files display progress for input file symbols display symbol table processing bindings display information about symbol binding versions display version dependencies all all previous options combi