Ggthjy

Question

BrainFreeze

This is my implementation of a brainfuck compiler in C. The instruction to build/run the project (GitHub repo) can be found here.

I am first converting the brainfuck source to x86 assembly and then using ld and as to generate the executable file (currently generates a 32 bit executable for linux only). I was planning on having no dependencies, but the ELF format was too intimidating to learn at first, therefore; I am using as (the GNU assembler) and ld (the GNU linker) for now.

The size of the brain fuck array is currently set to 30,000 and is allocated on the heap and one byte is allocated to each cell. Negative cell index should be UB right now.

I have zero experience with making compilers and am pretty sure this is not the way to make a compiler. I haven't even implemented lexers/parsers. Also, I thought it would be easier to implement it in a switch case since the language is so simple.

That being said, I think I have done a good job with this compiler. The compiled program runs x2 as fast as this when I used it for mandelbrot generation.

There is some information about the functions in the header files. Hope that improves readability and understanding.

I would appreciate any improvements/suggestions.

style/convention/performance/memory and everything else.

I am not currently concerned with optimizing the compiler. I have got a few ideas. But if I missed some pretty obvious ones (very easy to implement), I would love if you could mention them.

bf_asm.h

#pragma once



/*

    Some general techniques used.

    1) edi always contains the address in memory of the current cell

    2) edx is always 1 (for read and write syscalls).

*/



/** The assembly instructions for allocating 30,000 bytes of memory.

    Uses the brk() system call without any library wrappers. 

    NOTE: brk() zeros out memory during first call. No need to initialize to 0.*/

const char ALLOCATE_MEMORY[] = ".intel_syntax noprefixn"

                         ".global _startn"

                         "_start:n"

                         "mov eax,0x2dn"

                         "xor ebx,ebxn"

                         "int 0x80n"

                         "add eax,30000n"

                         "mov ebx,eaxn"

                         "mov eax,0x2dn"

                         "int 0x80n"

                         "mov edi,eaxn"

                         "sub edi,30000n"

                         //Sets edx=1 for future read and write calls. This is never modified.

                         "xor edx,edxn"

                         "inc edxn";



/** The assembly instructions for increasing the current cell pointer. */

const char INCREMENT_POINTER[] = "add edi,0x%02xn";



/** The assembly instructions for decreasing the current cell pointer. */

const char DECREMENT_POINTER[] = "sub edi,0x%02xn";



/** The assembly instruction to increase the value at current cell */

const char INCREMENT[] = "add byte ptr [edi],0x%02xn";



/** The assembly instruction to decrease the value at current cell */

const char DECREMENT[] = "sub byte ptr [edi],0x%02xn";



/** The assembly instructions for starting a loop in bf. LABEL references are not present and are calculated at compile time. */

const char LOOP_START[] =

    //"_LABEL_XXXX:n"

    "cmp byte ptr [edi],0x00n"

    //"je _LABEL_XXXXn";

    "je ";



/** The assembly instructions for ending a loop in bf. LABEL references are not present and are calculated at compile time. */

const char LOOP_END[] =

    //"_LABELXXXX:n"

    "cmp byte ptr [edi],0x00n"

    //"jne _LABEL_XXXXn";

    "jne ";



/** The assembly instructions for the write syscall. Executes the raw syscall. The value is printed to stdout. */

const char WRITE_CHAR[] = "mov eax,0x04n"

                    "xor ebx,ebxn"

                    "inc ebxn"

                    "mov ecx,edin"

                    "int 0x80n";



/** The assembly instructions for the read syscall. Executes the raw syscall. The value is stored in the current cell. */

const char READ_CHAR[] = "mov eax,0x03n"

                   "xor ebx,ebxn"

                   "mov ecx,edin"

                   "int 0x80n";



/** The assembly instructions for the exit syscall. Executes the raw syscall.*/

const char SYS_EXIT_CALL[] = "mov eax,0x01n"

                       "xor ebx,ebxn"

                       "int 0x80n";

compiler.h

#pragma once

#include <stdio.h>



/** The structure which represents the parsed global args.*/

struct global_args_t

{

    /** The path to the source brainfuck code.*/

    char *input_file;

    /** The path to the output executable file.*/

    char *output_file;

} global_args; ///< The variable which is populated by the parse_args() function.



/**

 * @brief Print the help text

 */

void print_help();



/**

 * @brief Parse the args provided from the command line

 *

 * The global_args struct is populated from this function.

 * 

 * @param argc Number of arguments.

 * @param argv The list of arguments.

 */

void parse_args(int argc, char **argv);



/**

 * @brief Verifies if the brainfuck source file is valid

 *

 *  The file seek is set to 0 after the verfication of the source is complete.

 * 

 * @param input_fp The brainfuck input source file.

 */

void verify_syntax(FILE *input_fp);



/**

 * @brief Creates the .s assembly source file that contains the list of assembly instructions for the brainfuck program.

 * 

 * Calls assemble() after checking if the output file could be created.

 * 

 * @param input_fp The brainfuck input source file.

 * @param output_file The .s assembly file.

 */

void create_assembly_source(FILE *input_fp, const char *output_file);



/**

 * @brief Creates the .s assembly source file that contains the list of assembly instructions for the brainfuck program.

 *

 * Writes the assembly instructions by parsing the source and refering to the bf_asm.h constants.

 * 

 * @param input_fp The brainfuck input source file.

 * @param output_fp The .s assembly file.

 */

void assemble(FILE *input_fp, FILE *output_fp);



/**

 * @brief Creates the .o elf file from the .s assembly file.

 * 

 * Calls the GNU assembler 'as'.

 * 

 * @param input_file The file path to the .s assembly file.

 */

void execute_assembler(char *input_file);



/**

 * @brief Creates the executable file from the .o elf file.

 * 

 * Calls the GNU linker 'ld'.

 * 

 * @param input_file The file path to the .o elf file.

 */

void execute_linker(const char *input_file);



/**

 * @brief Creates the executable file from the .o elf file.

 * 

 * Compiles the source.

 * This function calls create_assembly_source(),execute_assembler() and execute_linker() with proper arguments.

 * 

 * @param input_fp The brainfuck source file.

 */

void compile(FILE *input_fp);



/**

 * @brief The main start point of the compiler.

 * 

 * @param argc The number of arguments.

 * @param argv Pointer to the argument list.

 */

int main(int argc, char **argv);

stack.h

#pragma once

#include <stdbool.h>

#include <stdlib.h>



/** A Node on the stack */

struct StackNode

{

    /** The address of loop start */

    unsigned int jmp_address;



    /** The address of loop end */

    unsigned int second_jmp_address;



    /** The address of the next node in stack */

    struct StackNode *next;

};



/**

 * @brief Creates a new node

 *

 * Creates a new node in the stack.

 * Memory is allocated on the heap. 

 * The next stackNode is automatically set to NULL.

 * 

 * @param jmp_addr The jump label index for the start of loop.

 * @param second_jmp_addr The jump label index for the end of loop.

 * @return The new node created.

 */

struct StackNode *newNode(int jmp_addr, int second_jmp_addr);



/**

 * @brief Checks if the stack is empty.

 *

 * @param root The current stack top.

 * @return If stack is empty

 */

bool isEmpty(struct StackNode *root);



/**

 * @brief Push a new node to top of stack

 *

 * @param root The current stack top.

 * @param jmp_addr The jump label index for the start of loop.

 * @param second_jmp_addr The jump label index for the end of loop.

 */

void push(struct StackNode **root, int jmp_addr, int second_jmp_addr);



/**

 * @brief Pop the top of stack and free memory allocated on the heap.

 * The next node becomes the top of the stack.

 *

 * @param root The current stack top.

 */

bool pop(struct StackNode **root);

compiler.c

#include <stdlib.h>

#include <getopt.h>

#include <string.h>

#include "bf_asm.h"

#include "stack.h"

#include "compiler.h"



void verify_syntax(FILE *inputfp)

{

    char c;

    unsigned int stack_count = 0;

    unsigned int line = 0;

    while ((c = fgetc(inputfp)) != EOF)

    {

        if (c == ']')

            stack_count--;

        if (c == '[')

            stack_count++;

        if (c == 'n')

            line++;

        if (stack_count < 0)

        {

            fprintf(stderr, "SYNTAX ERROR: Unexpected instruction '%c' in line %u. A corresponding bracket couldn't be found.n", c, line);

            exit(EXIT_FAILURE);

        }

    }

    if (stack_count > 0)

    {

        fprintf(stderr, "SYNTAX ERROR: Brackets are not balanced. Found %u stray brackets.n", stack_count);

        exit(EXIT_FAILURE);

    }

    fseek(inputfp, 0, SEEK_SET);

}



void assemble(FILE *inputfp, FILE *outputfp)

{

    char c;

    fputs(ALLOCATE_MEMORY, outputfp);

    unsigned int jmp_label = 0;

    struct StackNode *stack = NULL;

    while ((c = fgetc(inputfp)) != EOF)

    {

        switch (c)

        {

        case '+':

        case '-':

        {

            int counter = (c == '+') ? 1 : -1;

            while ((c = fgetc(inputfp)) != EOF) //Counts the total value to be put into the cell

            {

                if (c == '+')

                    counter++;

                if (c == '-')

                    counter--;

                if (c == '<' || c == '>' || c == '.' || c == ',' || c == ']' || c == '[')

                {

                    fseek(inputfp, -1, SEEK_CUR);

                    break;

                }

            }

            if (counter > 0)

                fprintf(outputfp, INCREMENT, counter);

            else if (counter < 0)

                fprintf(outputfp, DECREMENT, (-counter));

            break;

        }

        case '>':

        case '<':

        {

            int counter = (c == '>') ? 1 : -1;

            while ((c = fgetc(inputfp)) != EOF) //Counts the total value be increased to the pointer of cell.

            {

                if (c == '>')

                    counter++;

                if (c == '<')

                    counter--;

                if (c == '+' || c == '-' || c == '.' || c == ',' || c == ']' || c == '[')

                {

                    fseek(inputfp, -1, SEEK_CUR);

                    break;

                }

            }

            if (counter > 0)

                fprintf(outputfp, INCREMENT_POINTER, (unsigned int)(counter * sizeof(char) * 4));

            else if (counter < 0)

                fprintf(outputfp, DECREMENT_POINTER, (unsigned int)((-counter) * sizeof(char) * 4));

            break;

        }

        case '.':

            fputs(WRITE_CHAR, outputfp);

            break;

        case ',':

            fputs(READ_CHAR, outputfp);

            break;

        case '[':

        {

            push(&stack, jmp_label, jmp_label + 1);

            fprintf(outputfp, "_LABEL_%u:n", jmp_label);

            fputs(LOOP_START, outputfp);

            fprintf(outputfp, "_LABEL_%un", jmp_label + 1);

            jmp_label += 2;

            break;

        }

        case ']':

        {

            fprintf(outputfp, "_LABEL_%u:n", stack->second_jmp_address);

            fputs(LOOP_END, outputfp);

            fprintf(outputfp, "_LABEL_%un", stack->jmp_address);

            pop(&stack);

        }

        default:;

        }

    }

    fputs(SYS_EXIT_CALL, outputfp);

}



void print_help()

{

    puts("Usage: bf [OPTION] -f [INPUT FILE]n"

         "Compiles brainfuck code to 32-bit executable file.n"

         "nThe cell size is set to 30,000 and one byte is allocated to each cell.n"

         "nThe default output file is 'a.out'.n"

         "This compiler also uses the GNU linker 'ld' and the GNU assembler 'as'.n"

         "Make sure both of these are installed and in your path before running this program.n"

         "nOptions:n"

         " -f <INPUT FILE>  ttSpecifies the input file.n"

         " -o <OUTPUT FILE> ttSpecifies the output file.n"

         " -h               ttDisplay the help message.n");

}



void parse_args(int argc, char **argv)

{

    global_args.output_file = "a.out";

    global_args.input_file = NULL;

    int opt;

    while ((opt = getopt(argc, argv, "ho:f:")) != -1)

    {

        if (opt == 'h')

        {

            print_help();

            exit(EXIT_SUCCESS);

        }

        if (opt == 'o')

            global_args.output_file = optarg;

        if (opt == 'f')

            global_args.input_file = optarg;

    }

    if (global_args.input_file == NULL)

    {

        fputs("ERROR: No input file was specified.nn", stderr);

        print_help();

        exit(EXIT_FAILURE);

    }

}



void create_assembly_source(FILE *inputfp, const char *assembler_output_file)

{

    FILE *assembler_output_fp = fopen(assembler_output_file, "w");

    if (assembler_output_fp == NULL)

    {

        fprintf(stderr, "The file %s couldn't be opened while assembling code.", assembler_output_file);

        exit(EXIT_FAILURE);

    }

    assemble(inputfp, assembler_output_fp);

    fclose(assembler_output_fp);

}



void execute_assembler(char *assembler_input_file)

{

    const char *start_command = "as --32 ";

    char *command = (char *)malloc(sizeof(char) * (strlen(start_command) + strlen(assembler_input_file) * 2 + 5));

    if (command == NULL)

    {

        fputs("FATAL: Memory allocation failed.", stderr);

        exit(EXIT_FAILURE);

    }

    strcpy(command, start_command);

    strcat(command, assembler_input_file);

    strcat(command, " -o ");

    assembler_input_file[strlen(assembler_input_file) - 1] = 'o';

    strcat(command, assembler_input_file);

    assembler_input_file[strlen(assembler_input_file) - 1] = 's';

    system(command);

    remove(assembler_input_file);

    free(command);

}



void execute_linker(const char *linker_input_file)

{

    const char *start_command = "ld -m elf_i386 -s ";

    char *command = (char *)malloc(sizeof(char) * (strlen(start_command) + strlen(linker_input_file) * 2 + 4));

    if (command == NULL)

    {

        fputs("FATAL: Memory allocation failed.", stderr);

        exit(EXIT_FAILURE);

    }

    strcpy(command, start_command);

    strcat(command, linker_input_file);

    strcat(command, " -o ");

    strcat(command, global_args.output_file);

    system(command);

    remove(linker_input_file);

    free(command);

}



void compile(FILE *inputfp)

{

    int str_size = strlen(global_args.output_file) + 3;

    char *input_file = (char *)malloc(str_size * sizeof(char));

    if (input_file == NULL)

    {

        fputs("FATAL: Memory allocation failed.", stderr);

        exit(EXIT_FAILURE);

    }

    strcpy(input_file, global_args.output_file);

    strcat(input_file, ".s");

    create_assembly_source(inputfp, input_file);

    execute_assembler(input_file);

    input_file[strlen(input_file) - 1] = 'o'; //Converting input_file to .o which was created by the assembler.

    execute_linker(input_file);

    free(input_file);

}



int main(int argc, char **argv)

{

    parse_args(argc, argv);



    FILE *inputfp = fopen(global_args.input_file, "r");



    if (inputfp == NULL)

    {

        fputs("ERROR: The input file couldn't be opened.", stderr);

        exit(EXIT_FAILURE);

    }



    verify_syntax(inputfp);

    compile(inputfp);



    fclose(inputfp);

    return EXIT_SUCCESS;

}

stack.c

#include "stack.h"

#include <stddef.h>

#include <stdio.h>



struct StackNode *newNode(int jmp_addr, int second_jmp_addr)

{

    struct StackNode *stack_node =

        (struct StackNode *)malloc(sizeof(struct StackNode));

    if (stack_node == NULL)

    {

        fputs("FATAL: Memory allocation failed.", stderr);

        exit(EXIT_FAILURE);

    }

    stack_node->jmp_address = jmp_addr;

    stack_node->second_jmp_address = second_jmp_addr;

    stack_node->next = NULL;

    return stack_node;

}



bool isEmpty(struct StackNode *root)

{

    return !root;

}



void push(struct StackNode **root, int jmp_addr, int second_jmp_addr)

{

    struct StackNode *stack_node = newNode(jmp_addr, second_jmp_addr);

    stack_node->next = *root;

    *root = stack_node;

}



bool pop(struct StackNode **root)

{

    if (isEmpty(*root))

        return false;

    struct StackNode *temp = *root;

    *root = (*root)->next;

    free(temp);

    return true;

}

$begingroup$
Updated my answer after having more time.
$endgroup$
– Peter Cordes
Mar 21 at 5:57 — Mar 21 at 5:57

score 4 · Accepted Answer · 2019-03-21 06:03:06Z

I'm mostly reviewing the code-gen choices, not the style / implementation of the compiler itself.

The compiler itself mostly looks fine in the parts I've skimmed over (looking for how it uses the format strings). The design is pretty clear and easy to dive into, and the C is well-formatted. It could be simpler and/or more efficient in some cases. (e.g. an array instead of linked-list for the stack would be faster, but without a C++ std::vector to handle reallocation for you you'd have to pick an arbitrary large size. But that's normally fine because untouched virtual memory normally doesn't cost anything.)

Writing an optimizing compiler is a huge task. If you want efficient asm, by far your best bet is to take advantage of an existing optimizing compiler back-end like LLVM or GCC.

For example, compile to LLVM-IR instead of x86 asm directly, and let LLVM-end produce optimized asm for whatever target platform. (This also gets you portability to multiple platforms, like ARM, MIPS, and/or different OSes, except for your dependence on x86 Linux system calls.) LLVM-IR looks like an assembly language, and can be compiled from a stand-alone file by clang, similar to how as + ld can compile a .s

Or see https://llvm.org/docs/tutorial/index.html for a (possibly out-of-date) tutorial on implementing a language with LLVM as a back end, where you write the front-end. (Writing your compiler as a gcc front-end is also an option, using the GCC back-end instead of LLVM.)

It's kind of fun that BF is such a simple language that you can get non-terrible performance by transliterating it into x86 asm, though, with only local optimizations like collapsing a sequence if increments into a single add/sub.

Another option is to transliterate to C, and feed that to an optimizing C compiler. Let it optimize a few frequently-used memory cells (array elements) into registers. Or even auto-vectorize a sequence of INCREMENT and INCREMENT_POINTER into paddb instructions.

32-bit Linux `int 0x80` is low-performance

If you're going to make a lot of system calls, x86-64 syscall has lower overhead than 32-bit int 0x80, both intrinsically (the int instruction itself is slower than syscall) and a little bit inside the kernel with slightly more efficient dispatch for native system calls instead of compat 32-bit. However, this is pretty minor compared to system call overhead with Spectre + Meltdown mitigation enabled on current CPUs, though (that makes adds enough overhead to make simple system calls maybe 10x more expensive than before, dwarfing the difference between int 0x80 vs. syscall or sysenter)

If you do want to keep using 32-bit code, calling into the VDSO so it can use sysenter is faster than using the simple / slow 32-bit x86 int 0x80 ABI.

Or better, use libc's stdio to buffer I/O. emit code to call putchar_unlocked and getchar_unlocked. (The 32-bit ABI sucks, passing args on the stack. The x86-64 ABI uses register args. In x86-64 System V, rdi is a call-clobbered register so you'd probably want to use rbx for your pointer.)

You might have to call fflush_unlocked(stdout) before getchar, because (unlike C++ cin/cout), getchar doesn't automatically flush buffered output before blocking to read input, and a BF program might print a prompt that doesn't end with n before reading input. To keep the generated code compact, you might want to emit a definition (at the top of the asm) for a helper function that sets up the arg-passing and calls the real libc function.

In I/O intensive BF programs, buffered stdio could maybe give you a speedup by a factor of 100 or 1000, with getchar / putchar_unlocked being nearly free compared to a system call that's only needed much less frequently.

Code-gen choices

xor ebx,ebx / inc ebx saves 2 bytes but isn't generally faster than mov ebx, 1. But you're only doing this as part of a system-call sequence, and you're potentially compiling large programs into big straight-line sequences of code, so this is not bad. You actually have a register with a known value of 1 (EDX) at all times other than inside the ALLOC block, so mov ebx, edx is by far your best bet.

add eax,30000 / mov ebx,eax to EBX should be done with lea ebx, [eax + 30000] since you overwrite EAX afterwards anyway. The LEA instruction is good as a copy-and-add.

You can also use LEA to put small constants in registers, given a known base. You have EDX=1, so you can do lea eax, [edx-1 + 0x04] for __NR_write. This is a 3-byte instruction, same as xor-zero/inc or push imm8/pop, but almost as efficient as mov eax, 0x04. (Runs on fewer possible ports on many CPUs, but still only 1 single-uop instruction with single-cycle latency on all relevant CPUs. https://agner.org/optimize/)

ALLOCATE_MEMORY: use the BSS unless you want dynamic growth

You allocate this at program startup, and never dynamically allocate more. If you wanted to support that (e.g. handle SIGSEGV if the program uses more memory than had allocated), then yes, brk is a decent choice for allocating more contiguous memory. (mmap(MAP_ANONYMOUS) does let you give a hint address, which the kernel will use if there aren't any conflicting mappings, but brk leaves room to grow.)

But for your current simple design, where you allocate it all up front, you might as well use the BSS and let the kernel's ELF program loader do the allocation for you.

You're targeting Linux, which like other modern OSes does lazy allocation. Allocated virtual pages aren't backed by real physical pages until they're written. They're often not even wired into the hardware page table. Thus it costs you basically nothing to use a large-ish BSS array like 2MB, and won't waste physical RAM for programs that only touch the first 4k of it. 2MB is the size of an x86-64 hugepage, so a BF program that uses most of its 2MB might get fewer TLB misses if the kernel decides to use one hugepage instead of separate 4k pages. (Linux can use transparent hugepages for the BSS, I think.)

const char PROGRAM_INIT[] = ".intel_syntax noprefixn"

                         ".global _startn"

                         "_start:n"

                         ".lcomm bf_memory, 2 * 1024 * 1024n"

                         "mov    edx, 1n"

                         "mov    edi, OFFSET bf_memoryn"

Note I changed the variable name from ALLOCATE_MEMORY, because that doesn't clearly imply once-at-startup, and some of the stuff in that string is definitely only once.

.lcomm allocates space in the BSS without needing to use .section .bss. (And unlike .comm, the label is a file-scoped local label.)

I tested this, and it correctly assembles to a program that maps 0x200000 bytes of virtual memory. You can see the segment in the program headers with readelf -a.

ADD vs. SUB decision

        if (counter > 0)

            fprintf(outputfp, INCREMENT, counter);

        else if (counter < 0)

            fprintf(outputfp, DECREMENT, (-counter));

This is unnecessary: x86's ADD instruction works with negative immediates. For the plain INCREMENT (memory-destination), the operand-size is byte so it's always going to be an 8-bit immediate, even if it has to be truncated at assemble time. (Which is fine; it means the BF program wrapped its counter, and that truncation by the assembler does the same modulo-256 that you want to implement for BF, I assume.) So there's no saving from using sub -128 instead of add 128 to allow a narrower immediate. 8-bit 128 and -128 are the same number.

You might as well just ADD the signed count, unless you care about how CF and OF flags are set.

        if (counter & 0xFF != 0)

            fprintf(outputfp, INCREMENT, counter);

        // else optimize away cancelling / wrapping operations

For the INCREMENT_POINTER version of this, you want to use add edi, -128 instead of sub 128, because -128 can use a more compact signed 8-bit imm8 encoding. Immediates in the [-128, 127] range can use the add r/m32, sign_extended_imm8 encoding.

It's worth considering keeping your pointer in EAX so the compact add/sub eax, imm32 encoding is available (saving 1 byte vs. the generic add/sub r/m32, imm32 encoding that uses a ModRM to encode the register or addressing mode instead of implicit EAX). But EAX is used for system calls, so they'd need to mov eax, ecx after system calls to put the pointer back. Since pointer increments of more than 128 are probably rare in most BF programs, and it only saves 1 byte of code-size then, it's not worth it.

The Linux int 0x80 ABI does not clobber any registers other than the return value in EAX, so your best bet is to keep your pointer in ECX, not EDI. Then you don't need any mov ecx, edi instructions.

So for the pointer version, you might want to do

        if (counter > 0)

            fprintf(outputfp, DECREMENT_POINTER, -counter);  // Prefer negative immediates for code-size

        else if (counter < 0)

            fprintf(outputfp, INCREMENT_POINTER, counter);

        // else no net offset: optimize away

Passing signed negative counts to printf means that for correctness we should change the format strings to use decimal counts, %d, not 0x%02x. (And BTW, you can get printf to print the 0x part by using %#02x). GCC prints asm using signed decimal integers, so there's precedent there for simplicity of doing it this way, even though hex is a more efficient format to write and to parse.

(Your version with %02x was not strictly correct: counter is a signed int but you're passing it a format string that expects unsigned int. On a 2's complement system where they're the same width with no padding, that's equivalent, but you don't need to assume that if you make your types match your format string. Compile with -Wall to get gcc/clang to warn about format string mismatches. You might need to enable optimization to get warnings to work with global variables instead of string literals.)

Pointer scaling??

        if (counter > 0)

            fprintf(outputfp, INCREMENT_POINTER, (unsigned int)(counter * sizeof(char) * 4));

        else if (counter < 0)

            fprintf(outputfp, DECREMENT_POINTER, (unsigned int)((-counter) * sizeof(char) * 4));

        break;

Scaling the target count by the host sizeof(char) is a bug: it will compile differently depending on what platform you're cross-compiling from. Or it would if sizeof(char) wasn't fixed at 1 by ISO C.

It's still a logical error, unless your program is only supposed to work when running on the target machine (not cross-compiling). (If that was the case, you could avoid hard-coding the system-call numbers, and #include <unistd_32.h> and use __NR_write instead of 0x04. But there's no reason this compiler shouldn't itself be portable C and run on PowerPC Windows NT for example, still producing the same x86 asm.)

I don't understand why we're scaling by 4, though, because you're using add/sub byte ptr [edi], so your cells are only 1 byte wide in the target machine's memory. Maybe I missed something elsewhere, and you're using the other 3 bytes of each cell for something?

For most modern CPUs, inc/dec are not slower than add/sub on registers, so (if remove the scale factor) you could look for that peephole optimization on INCREMENT_POINTER / DECREMENT_POINTER. (INC instruction vs ADD 1: Does it matter?). This will save you 2 bytes in 32-bit mode, or 1 byte in 64-bit mode. (Although instruction-fetch in the front-end is probably not a bottleneck when executing code generated by this compiler.)

But on modern Intel CPUs inc/dec [mem] is 3 uops vs. 2 for add/sub 1, so you don't want to look for inc/dec as a peephole optimization for INCREMENT. That's why I didn't mention it in the previous section.

Thus inc edi/dec edi is a peephole optimization we can look for. I've renamed the format-strings to x86-specific instruction names. If you're planning ports to other targets, you might keep the ADD/SUB names and have a name like SUB1_POINTER for that special case. But some ISAs like MIPS for example don't have a sub-immediate instruction at all: just add-immediate with a sign-extended immediate. (Some MIPS assemblers support subi as a pseudo-instruction).

    // look for the inc / dec peephole optimization.

        if (counter > 0) {

            if (counter == 1) {       // && tune != silvermont or P4

                fprintf(outputfp, INC_POINTER);

            } else {

                fprintf(outputfp, SUB_POINTER, -counter);  // Prefer negative immediates for larger range with imm8

            }

        } else if (counter < 0) {

            if (counter == -1) {

                fprintf(outputfp, DEC_POINTER);

            } else {

                fprintf(outputfp, ADD_POINTER, counter);

            }

        }

        // else no net offset: optimize away

Using FLAGS results

LOOP_START / LOOP_END can omit the cmp byte ptr [edi],0x00 if ZF is already set according to the 0 / non-zero status of [edi]. This is the case after a non-zero counter for INCREMENT / DECREMENT, because add/sub set ZF according to the zero / non-zero status of the result.

Looking for this optimization might be as easy as keeping a flags_good local variable that's set to true inside the if(counter & 0xFF != 0) to emit a memory-destination add, and cleared by any ADD_POINTER/DEC_POINTER / etc.

Optimizing for multi-byte I/O with one system call

If we can look for patterns like inc / write / inc / write, we can turn that into an n-byte sys_write. (Or if you decide to use libc stdio, an fwrite function call, but with stdio buffering this optimization becomes much less valuable.)

Maybe after a '.', look-ahead for a sequence of +.. This might miss the optimization in some cases like -++ instead of +, but dumb source defeating optimizing compilers is not your problem if there's no reason anyone would ever write that in a BF program.

stdio lookahead in the compiler

Instead of using fseek to rewind, you can use ungetc(c, stdin); to put back a char you read. ISO C guarantees that at least 1 char of pushback is supported, which is all you need. I assume glibc usually supports more.

Another option might be to put reading the next character at the bottom of the loop, separate from the loop condition, so you could use continue;. That might be kind of ugly, though.

Or since it's not rare for your parser to want lookahead, bake that in to the main loop.

int c, next_c = fgetc(inputfp);

while (c = next_c, next_c = fgetc(inputfp), c != EOF) {

    ...

}

The loop body will run with c = last_char and next_c = EOF. I'm not sure if there's a better idiom for this; I just made this up and I'm not sure it's as clean as I'd like. Perhaps a do{}while() loop structure would be better, with the c = next_c stuff at the top.

score 4 · Accepted Answer · 2019-03-21 06:03:06Z

I'm mostly reviewing the code-gen choices, not the style / implementation of the compiler itself.

The compiler itself mostly looks fine in the parts I've skimmed over (looking for how it uses the format strings). The design is pretty clear and easy to dive into, and the C is well-formatted. It could be simpler and/or more efficient in some cases. (e.g. an array instead of linked-list for the stack would be faster, but without a C++ std::vector to handle reallocation for you you'd have to pick an arbitrary large size. But that's normally fine because untouched virtual memory normally doesn't cost anything.)

Writing an optimizing compiler is a huge task. If you want efficient asm, by far your best bet is to take advantage of an existing optimizing compiler back-end like LLVM or GCC.

For example, compile to LLVM-IR instead of x86 asm directly, and let LLVM-end produce optimized asm for whatever target platform. (This also gets you portability to multiple platforms, like ARM, MIPS, and/or different OSes, except for your dependence on x86 Linux system calls.) LLVM-IR looks like an assembly language, and can be compiled from a stand-alone file by clang, similar to how as + ld can compile a .s

Or see https://llvm.org/docs/tutorial/index.html for a (possibly out-of-date) tutorial on implementing a language with LLVM as a back end, where you write the front-end. (Writing your compiler as a gcc front-end is also an option, using the GCC back-end instead of LLVM.)

It's kind of fun that BF is such a simple language that you can get non-terrible performance by transliterating it into x86 asm, though, with only local optimizations like collapsing a sequence if increments into a single add/sub.

Another option is to transliterate to C, and feed that to an optimizing C compiler. Let it optimize a few frequently-used memory cells (array elements) into registers. Or even auto-vectorize a sequence of INCREMENT and INCREMENT_POINTER into paddb instructions.

32-bit Linux `int 0x80` is low-performance

If you're going to make a lot of system calls, x86-64 syscall has lower overhead than 32-bit int 0x80, both intrinsically (the int instruction itself is slower than syscall) and a little bit inside the kernel with slightly more efficient dispatch for native system calls instead of compat 32-bit. However, this is pretty minor compared to system call overhead with Spectre + Meltdown mitigation enabled on current CPUs, though (that makes adds enough overhead to make simple system calls maybe 10x more expensive than before, dwarfing the difference between int 0x80 vs. syscall or sysenter)

If you do want to keep using 32-bit code, calling into the VDSO so it can use sysenter is faster than using the simple / slow 32-bit x86 int 0x80 ABI.

Or better, use libc's stdio to buffer I/O. emit code to call putchar_unlocked and getchar_unlocked. (The 32-bit ABI sucks, passing args on the stack. The x86-64 ABI uses register args. In x86-64 System V, rdi is a call-clobbered register so you'd probably want to use rbx for your pointer.)

You might have to call fflush_unlocked(stdout) before getchar, because (unlike C++ cin/cout), getchar doesn't automatically flush buffered output before blocking to read input, and a BF program might print a prompt that doesn't end with n before reading input. To keep the generated code compact, you might want to emit a definition (at the top of the asm) for a helper function that sets up the arg-passing and calls the real libc function.

In I/O intensive BF programs, buffered stdio could maybe give you a speedup by a factor of 100 or 1000, with getchar / putchar_unlocked being nearly free compared to a system call that's only needed much less frequently.

Code-gen choices

xor ebx,ebx / inc ebx saves 2 bytes but isn't generally faster than mov ebx, 1. But you're only doing this as part of a system-call sequence, and you're potentially compiling large programs into big straight-line sequences of code, so this is not bad. You actually have a register with a known value of 1 (EDX) at all times other than inside the ALLOC block, so mov ebx, edx is by far your best bet.

add eax,30000 / mov ebx,eax to EBX should be done with lea ebx, [eax + 30000] since you overwrite EAX afterwards anyway. The LEA instruction is good as a copy-and-add.

You can also use LEA to put small constants in registers, given a known base. You have EDX=1, so you can do lea eax, [edx-1 + 0x04] for __NR_write. This is a 3-byte instruction, same as xor-zero/inc or push imm8/pop, but almost as efficient as mov eax, 0x04. (Runs on fewer possible ports on many CPUs, but still only 1 single-uop instruction with single-cycle latency on all relevant CPUs. https://agner.org/optimize/)

ALLOCATE_MEMORY: use the BSS unless you want dynamic growth

You allocate this at program startup, and never dynamically allocate more. If you wanted to support that (e.g. handle SIGSEGV if the program uses more memory than had allocated), then yes, brk is a decent choice for allocating more contiguous memory. (mmap(MAP_ANONYMOUS) does let you give a hint address, which the kernel will use if there aren't any conflicting mappings, but brk leaves room to grow.)

But for your current simple design, where you allocate it all up front, you might as well use the BSS and let the kernel's ELF program loader do the allocation for you.

You're targeting Linux, which like other modern OSes does lazy allocation. Allocated virtual pages aren't backed by real physical pages until they're written. They're often not even wired into the hardware page table. Thus it costs you basically nothing to use a large-ish BSS array like 2MB, and won't waste physical RAM for programs that only touch the first 4k of it. 2MB is the size of an x86-64 hugepage, so a BF program that uses most of its 2MB might get fewer TLB misses if the kernel decides to use one hugepage instead of separate 4k pages. (Linux can use transparent hugepages for the BSS, I think.)

const char PROGRAM_INIT[] = ".intel_syntax noprefixn"

                         ".global _startn"

                         "_start:n"

                         ".lcomm bf_memory, 2 * 1024 * 1024n"

                         "mov    edx, 1n"

                         "mov    edi, OFFSET bf_memoryn"

Note I changed the variable name from ALLOCATE_MEMORY, because that doesn't clearly imply once-at-startup, and some of the stuff in that string is definitely only once.

.lcomm allocates space in the BSS without needing to use .section .bss. (And unlike .comm, the label is a file-scoped local label.)

I tested this, and it correctly assembles to a program that maps 0x200000 bytes of virtual memory. You can see the segment in the program headers with readelf -a.

ADD vs. SUB decision

        if (counter > 0)

            fprintf(outputfp, INCREMENT, counter);

        else if (counter < 0)

            fprintf(outputfp, DECREMENT, (-counter));

This is unnecessary: x86's ADD instruction works with negative immediates. For the plain INCREMENT (memory-destination), the operand-size is byte so it's always going to be an 8-bit immediate, even if it has to be truncated at assemble time. (Which is fine; it means the BF program wrapped its counter, and that truncation by the assembler does the same modulo-256 that you want to implement for BF, I assume.) So there's no saving from using sub -128 instead of add 128 to allow a narrower immediate. 8-bit 128 and -128 are the same number.

You might as well just ADD the signed count, unless you care about how CF and OF flags are set.

        if (counter & 0xFF != 0)

            fprintf(outputfp, INCREMENT, counter);

        // else optimize away cancelling / wrapping operations

For the INCREMENT_POINTER version of this, you want to use add edi, -128 instead of sub 128, because -128 can use a more compact signed 8-bit imm8 encoding. Immediates in the [-128, 127] range can use the add r/m32, sign_extended_imm8 encoding.

It's worth considering keeping your pointer in EAX so the compact add/sub eax, imm32 encoding is available (saving 1 byte vs. the generic add/sub r/m32, imm32 encoding that uses a ModRM to encode the register or addressing mode instead of implicit EAX). But EAX is used for system calls, so they'd need to mov eax, ecx after system calls to put the pointer back. Since pointer increments of more than 128 are probably rare in most BF programs, and it only saves 1 byte of code-size then, it's not worth it.

The Linux int 0x80 ABI does not clobber any registers other than the return value in EAX, so your best bet is to keep your pointer in ECX, not EDI. Then you don't need any mov ecx, edi instructions.

So for the pointer version, you might want to do

        if (counter > 0)

            fprintf(outputfp, DECREMENT_POINTER, -counter);  // Prefer negative immediates for code-size

        else if (counter < 0)

            fprintf(outputfp, INCREMENT_POINTER, counter);

        // else no net offset: optimize away

Passing signed negative counts to printf means that for correctness we should change the format strings to use decimal counts, %d, not 0x%02x. (And BTW, you can get printf to print the 0x part by using %#02x). GCC prints asm using signed decimal integers, so there's precedent there for simplicity of doing it this way, even though hex is a more efficient format to write and to parse.

(Your version with %02x was not strictly correct: counter is a signed int but you're passing it a format string that expects unsigned int. On a 2's complement system where they're the same width with no padding, that's equivalent, but you don't need to assume that if you make your types match your format string. Compile with -Wall to get gcc/clang to warn about format string mismatches. You might need to enable optimization to get warnings to work with global variables instead of string literals.)

Pointer scaling??

        if (counter > 0)

            fprintf(outputfp, INCREMENT_POINTER, (unsigned int)(counter * sizeof(char) * 4));

        else if (counter < 0)

            fprintf(outputfp, DECREMENT_POINTER, (unsigned int)((-counter) * sizeof(char) * 4));

        break;

Scaling the target count by the host sizeof(char) is a bug: it will compile differently depending on what platform you're cross-compiling from. Or it would if sizeof(char) wasn't fixed at 1 by ISO C.

It's still a logical error, unless your program is only supposed to work when running on the target machine (not cross-compiling). (If that was the case, you could avoid hard-coding the system-call numbers, and #include <unistd_32.h> and use __NR_write instead of 0x04. But there's no reason this compiler shouldn't itself be portable C and run on PowerPC Windows NT for example, still producing the same x86 asm.)

I don't understand why we're scaling by 4, though, because you're using add/sub byte ptr [edi], so your cells are only 1 byte wide in the target machine's memory. Maybe I missed something elsewhere, and you're using the other 3 bytes of each cell for something?

For most modern CPUs, inc/dec are not slower than add/sub on registers, so (if remove the scale factor) you could look for that peephole optimization on INCREMENT_POINTER / DECREMENT_POINTER. (INC instruction vs ADD 1: Does it matter?). This will save you 2 bytes in 32-bit mode, or 1 byte in 64-bit mode. (Although instruction-fetch in the front-end is probably not a bottleneck when executing code generated by this compiler.)

But on modern Intel CPUs inc/dec [mem] is 3 uops vs. 2 for add/sub 1, so you don't want to look for inc/dec as a peephole optimization for INCREMENT. That's why I didn't mention it in the previous section.

Thus inc edi/dec edi is a peephole optimization we can look for. I've renamed the format-strings to x86-specific instruction names. If you're planning ports to other targets, you might keep the ADD/SUB names and have a name like SUB1_POINTER for that special case. But some ISAs like MIPS for example don't have a sub-immediate instruction at all: just add-immediate with a sign-extended immediate. (Some MIPS assemblers support subi as a pseudo-instruction).

    // look for the inc / dec peephole optimization.

        if (counter > 0) {

            if (counter == 1) {       // && tune != silvermont or P4

                fprintf(outputfp, INC_POINTER);

            } else {

                fprintf(outputfp, SUB_POINTER, -counter);  // Prefer negative immediates for larger range with imm8

            }

        } else if (counter < 0) {

            if (counter == -1) {

                fprintf(outputfp, DEC_POINTER);

            } else {

                fprintf(outputfp, ADD_POINTER, counter);

            }

        }

        // else no net offset: optimize away

Using FLAGS results

LOOP_START / LOOP_END can omit the cmp byte ptr [edi],0x00 if ZF is already set according to the 0 / non-zero status of [edi]. This is the case after a non-zero counter for INCREMENT / DECREMENT, because add/sub set ZF according to the zero / non-zero status of the result.

Looking for this optimization might be as easy as keeping a flags_good local variable that's set to true inside the if(counter & 0xFF != 0) to emit a memory-destination add, and cleared by any ADD_POINTER/DEC_POINTER / etc.

Optimizing for multi-byte I/O with one system call

If we can look for patterns like inc / write / inc / write, we can turn that into an n-byte sys_write. (Or if you decide to use libc stdio, an fwrite function call, but with stdio buffering this optimization becomes much less valuable.)

Maybe after a '.', look-ahead for a sequence of +.. This might miss the optimization in some cases like -++ instead of +, but dumb source defeating optimizing compilers is not your problem if there's no reason anyone would ever write that in a BF program.

stdio lookahead in the compiler

Instead of using fseek to rewind, you can use ungetc(c, stdin); to put back a char you read. ISO C guarantees that at least 1 char of pushback is supported, which is all you need. I assume glibc usually supports more.

Another option might be to put reading the next character at the bottom of the loop, separate from the loop condition, so you could use continue;. That might be kind of ugly, though.

Or since it's not rare for your parser to want lookahead, bake that in to the main loop.

int c, next_c = fgetc(inputfp);

while (c = next_c, next_c = fgetc(inputfp), c != EOF) {

    ...

}

The loop body will run with c = last_char and next_c = EOF. I'm not sure if there's a better idiom for this; I just made this up and I'm not sure it's as clean as I'd like. Perhaps a do{}while() loop structure would be better, with the c = next_c stuff at the top.

Brainfreeze: A Brainfuck compiler in CBrainfuck to x86 Assembly CompilerASCII table in...

BrainFreeze

bf_asm.h

compiler.h

stack.h

compiler.c

stack.c

BrainFreeze

bf_asm.h

compiler.h

stack.h

compiler.c

stack.c

BrainFreeze

bf_asm.h

compiler.h

stack.h

compiler.c

stack.c

BrainFreeze

bf_asm.h

compiler.h

stack.h

compiler.c

stack.c

1 Answer 1

32-bit Linux int 0x80 is low-performance

Code-gen choices

ALLOCATE_MEMORY: use the BSS unless you want dynamic growth

ADD vs. SUB decision

Pointer scaling??

Using FLAGS results

Optimizing for multi-byte I/O with one system call

stdio lookahead in the compiler

Your Answer

Sign up or log in

Post as a guest

Post as a guest

1 Answer 1

1 Answer 1

32-bit Linux int 0x80 is low-performance

Code-gen choices

ALLOCATE_MEMORY: use the BSS unless you want dynamic growth

ADD vs. SUB decision

Pointer scaling??

Using FLAGS results

Optimizing for multi-byte I/O with one system call

stdio lookahead in the compiler

32-bit Linux int 0x80 is low-performance

Code-gen choices

ALLOCATE_MEMORY: use the BSS unless you want dynamic growth

ADD vs. SUB decision

Pointer scaling??

Using FLAGS results

Optimizing for multi-byte I/O with one system call

stdio lookahead in the compiler

32-bit Linux int 0x80 is low-performance

Code-gen choices

ALLOCATE_MEMORY: use the BSS unless you want dynamic growth

ADD vs. SUB decision

Pointer scaling??

Using FLAGS results

Optimizing for multi-byte I/O with one system call

stdio lookahead in the compiler

32-bit Linux int 0x80 is low-performance

Code-gen choices

ALLOCATE_MEMORY: use the BSS unless you want dynamic growth

ADD vs. SUB decision

Pointer scaling??

Using FLAGS results

Optimizing for multi-byte I/O with one system call

stdio lookahead in the compiler

Sign up or log in

Post as a guest

Post as a guest

Sign up or log in

Post as a guest

Sign up or log in

Post as a guest

Sign up or log in

1 Answer
1

32-bit Linux `int 0x80` is low-performance

1 Answer
1

1 Answer
1

32-bit Linux `int 0x80` is low-performance

32-bit Linux `int 0x80` is low-performance

32-bit Linux `int 0x80` is low-performance

32-bit Linux `int 0x80` is low-performance