Corey's Projects: GetField


	Professional Software Consulting

The getField C utility

In modern programming languages, doing an operation such as splitting a string into parts is a simple process. However, in the C world, this seemingly simple task is not a part of the standard ANSI-C library. If you use C++, there is a string class generally available that helps you in this regard, but even those may vary in operation from compiler to compiler.

Towards this end, many years ago I wrote a function that I've used on practically every C programming job I've ever had. Initially, this was simply used to extract a field from a string and place it into a buffer provided by the caller. That is what the function getField does.

This worked fine as long as all I had to deal with was English ASCII text that was defined as a sequence of single byte characters terminated by a NULL character (i.e., a C string). My first job dealing with internationalization changed this drastically.

ASCII was defined many decades ago, by mono-lingual English speakers. It didn't consider other languages. It defines 128 characters, half of what a byte can hold, and it seemed plenty: 26 for lower case letters, 26 for upper-case letters, 10 for the digits, a few dozen for special characters (percent, ampersand, caret, etc.), some hidden characters for computer talk (NULL, SOH, EOL, etc.), and you still have some left over.

But then the Europeans complained, "What about our umlauts and cidillas and other special characters?" So the powers that be decided to define "extended ASCII": this used the other 127 values in a byte to define all those extra characters that European languages use. Offically called ISO 8859-1, this is frequently called Latin1, and often is the default character set on computers sold in western countries. Many western web sites also use this encoding.

This website, however, uses what is fast becoming the best encoding for internationalization: UTF-8. UTF-8 is a type of unicode encoding whereby all letters in the world in any language can be uniquely identified. So Japanese, Arabic, Hebrew, Mandarin, and any other language can be displayed properly.

I wanted my getField function to be as generic as possible: therefore, it shouldn't care about what kind of encoding is used for the data. Towards this end, I wrote another function, called getFieldMB, where MB stood for "Multi-Byte". This takes a data buffer (not a string) of any size, a delimiter of any size, and fills a buffer created by the caller with the desired field data. Using this, my function didn't have to know that the most commonly used delimiter character in Korean programming ends up being three bytes in length.

This solved all my internationalization problems, but then I ran into one regarding efficiency. I was designing a startup process whereby a large configuration file (with multi-lingual data) would be loaded into memory, an algorithm would be used to call getFieldMB many times to put the data into linked lists, and then the original memory would be freed. While this worked fine in test, our customer complained that the speed was unacceptably slow on the hardware they were using.

I determined that if I could convert my linked lists so that they held pointers to the data in the large config file (which was held in memory), that would save me a memory allocation operation for every piece of data, as well as the CPU cycles required to copy the data into a second buffer. Thus was born my third function in this suite, getLocationMB. Using this, there is no copying; it returned a pointer (not an offset) to where the requested field was located in the buffer that was provided.

If you have some very low-level, multi-lingual data manipulation you have to do, this functions may be of some use. There is one very large assumption throughout this code that you must know about: I assume that the size of a char and unsigned char is one byte. If your system defines these differently, then these functions may not work for you. I've included a test suite so you can ensure the functions work properly on your system, as well as showing many examples of how the functions are used.

View the code
getField()	The `getField` function.
getFieldMB()	The `getFieldMB` function.
getLocationMB()	The `getLocationMB` function.
countFieldsMB()	The `countFieldsMB` function.
utl_memStr()	The `utl_memStr` function, used internally.
getField.c	All five functions in one file.
getField.h	Information required by calling programs
test_getField.c	Attempts to break the functions within `getField.c`

Download the code
getfield.tar.gz	7.95 KB	getfield.zip	9.73 KB
Note: Command to create test executable: `# cc -o test_getField getField.c test_getField.c`
Note: Conventions used in these C files are my own particular style, refined over many years to make code easier to deal with and easier to read, in my opinion. For details of these conventions as well as their rationale, click here.

Something wrong with this page or this site? Let the webmaster know by clicking HERE