Archive

Archive for the ‘Uncategorized’ Category

Using numeric literals to identify application domains

February 28, 2010 1 comment

I regularly get to look at large quantities of source and need to quickly get some idea of the application domains used within the code; for instance, statistical calculations, atomic physics or astronomical calculations. What tool might I use to quickly provide a list of the application domains used within some large code base?

Within many domains some set of numbers are regularly encountered and for several years I have had the idea of extracting numeric literals from source and comparing them against a database of interesting numbers (broken down by application). I have finally gotten around to writing a program do this extraction and matching, its imaginatively called numbers. The hard part is creating an extensive database of interesting numbers and I’m hoping that that releasing everything under an open source license will encourage people to add their own lists of domain specific interesting numbers.

The initial release is limited to numeric literals containing a decimal point or exponent, i.e., floating-point literals. These tend to be much more domain specific than integer literals and should cut down on the noise, making it easier for me tune the matching algorithms without (currently numeric equality within some fuzz-factor and fuzzy-matching of digits in the mantissa).

Sometimes an algorithm uses a set of numbers (e.g., crc checking) and a match should only occur if all values from this set are encountered.

The larger the interesting number database the larger the probability of matching against a value from an unrelated domain. The list of atomic weights seem to be very susceptible to this problem. I am currently investigating whether the words that co-occur with an instance of a numeric literal can be used to reduce this problem, perhaps by requiring that at least one word from a provided list occur in the source before a match is flagged for some literal.

Some numbers frequently occur in several domains. I am hoping that the word analysis might also be used to help reduce the number of domains that need to be considered (ideally to one).

Another problem is how to handle conversion factors, that is the numeric constant used to convert one unit to another, e.g., meters to furlongs. What is needed is to calculate all unit conversion values from a small set of ‘basic’ units. It would probably be very interesting to find conversions between some rarely seen units occurring in source.

I have been a bit surprised by how many apparently non-interesting floating-point literals occur in source and am hoping that a larger database will turn them into interesting numbers (I suspect this will only occur for a small fraction of these literals).

Categories: Uncategorized Tags: ,

Designing a processor for increased source portability costs

February 9, 2010 4 comments

How might a vendor make it difficult for developers to port open source applications to their proprietary cpu? Keeping the instruction set secret is one technique, another is to design a cpu that breaks often relied upon assumptions that developers have about the characteristics of the architecture on which their code executes.

Of course breaking architectural assumptions does not prevent open source being ported to a platform, but could significantly slow down the migration; giving more time for customers to become locked into the software shipped with the product.

Which assumptions should be broken to have the maximum impact on porting open source? The major open source applications (e.g., Firefox, MySQL, etc) run on 32/64-bit architectures that have an unsigned address space, whose integer representation uses two’s complement arithmetic and arithmetic operations on these integer values wrap on over/underflow.

32/64-bit. There is plenty of experience showing that migrating code from 16-bit to 32-bit environments can involve a lot of effort (e.g., migrating Windows 286/386 code to the Intel 486) and plenty of companies are finding the migration from 32 to 64-bits costly.

Designing a 128-bit processor might not be cost effective, but what about a 40-bit processor, like a number of high end DSP chips? I suspect that there are many power-of-2 assumptions lurking in a lot of code. A 40-bit integer type could prove very expensive for ports of code written with a 32/64-bit mindset (dare I suggest a 20-bit short; DSP vendors have preferred 16-bits because it uses less storage?).

Unsigned address space (i.e., lowest address is zero). Some code assumes that addresses with the top bit set are at the top end of memory and not just below the middle (e.g., some garbage collectors). Processors having a signed address space (i.e., zero is in the middle of storage) are sufficiently rare (e.g., the Inmos Transputer) that source is unlikely to support a HAS_SIGNED_ADDRESS build option.

How much code might need to be rewritten? I have no idea. While the code is likely to be very important there might not be a lot of it.

Two’s complement. Developers are constantly told not to write code that relies on the internal representation of data types. However, they might be forgiven for thinking that nobody uses anything other than two’s complement to represent integer types these days (I suspect Univac does not have that much new code ported to it’s range of one’s complement machines).

How much code will break when ported to a one’s complement processor? The representation of negative numbers in one’s complement and two’s complement is different and the representation of positive numbers the same. In common usage positive values are significantly more common than negative values and many variables (having a signed type) never get to hold a negative value.

While I have no practical experience, or know of anybody who has, I suspect the use of one’s complement might not be that big a problem. If you have experience please comment.

Arithmetic that wraps (i.e., positive values overflow negative and negative values underflow positive). While expressions explicitly written to wrap might be rare, how many calculations contain intermediate values that have wrapped but deliver a correct final result because they are ‘unwrapped’ by a subsequent operation?

Arithmetic operation that saturate are needed in applications such as graphics where, for instance, increasing the brightness should not suddenly cause the darkest setting to occur. Some graphics processors include support for arithmetic operations that saturate.

The impact of saturation arithmetic on portability is difficult to judge. A lot of code contains variables having signed char and short types, but when they appear as the operand in a binary operation these are promoted to int in C/C++/etc which probably has sufficient range not to overflow (most values created during program execution are small). Again I am lacking in practical experience and comments are welcome.

Floating-point. Many programs do not make use of floating-point arithmetic and those that do rarely manipulate such values at the bit level. Using a non-IEEE 754 floating-point representation will probably have little impact on the portability of applications of interest to most users.

Update. Thanks to Cate for pointing out that I had forgotten to discuss why using non-8-bit chars does is not a worthwhile design decision.

Both POSIX and the C/C++ Standards require that the char type be represented in at least 8 bits. Computers supporting less than 8-bits were still being used in the early 80s (e.g., the much beloved ICL 1900 supported 6-bit characters). The C Standard also requires that char be the smallest unit of addressable storage, which means that it must be possible for a pointer to point at an object having a char type.

Designing a processor where the smallest unit of storage is greater than 8-bits but not a power-of-2 is likely to substantially increase all sorts of costs and complicate things enormously (e.g., interfaces to main memory which are designed to work with power of two interfaces). The purpose of this design is to increase other people’s cost, not the proprietary vendor’s cost.

What about that pointer requirement? Perhaps the smallest unit of storage that a pointer could address might be 16 or 40 bits? Such processors exist and compiler writers have used both solutions to the problems they present. One solution is for a pointer to contain the address of the storage location + offset of the byte within that storage (Cray used this approach on a processor whose pointers could only point at 64-bit chunks of storage, with the compiler generating the code to extract the appropriate byte), the other is to declare that the char type occupies 40-bits (several DSP compilers have taken this approach).

Having the compiler declare that char is not 8-bits wide would cause all sorts of grief, so lets not go there. What about the Cray compiler approach?

Some of the address bits on 64-bit processors are not used yet (because few customers need that amount of storage) so compiler writers could get around host-processor pointers not supporting the granularity needed to point at 8-bit objects by storing the extra information in ‘unused’ pointer bits (the compiler generating the appropriate insertion and extraction code). The end result is that the compiler can hide pointer addressability issues :-).

Minimum information needed for writing a code generator

January 29, 2010 No comments

If a compiler writer is faced with writing a back-end for an undocumented processor, what is the minimum amount of information that needs to be reverse engineered?

It is possible to implement a universal computer using a cpu that has a single instruction, subtract-and-branch-if-lessthan-or-equal-to-zero. This is all very well, but processors based on using a single instruction are a bit thin on the ground and the processor to hand is likely to support a larger number of simpler instructions.

A subtract-and-branch-if-lessthan-or-equal-to-zero instruction could be implemented on a register based machine using the appropriate sequence of load-from-memory, subtract-two-registers, store-register-to-memory and jump-if-subtract-lessthan-or-equal instructions. Information about other instructions, such as add and multiply, would be useful for code optimization. (The Turing machine model of computation is sufficiently far removed from how most programs and computers operate that it is not considered further.)

Are we done? In theory yes, in practice no. A couple pf practical problems have been glossed over; how do source literals (e.g., "Hello World") initially get written to storage, where does the storage used by the program come from and what is the file format of an executable?

Literals that are not created using an instruction (most processors have instructions for loading an integer constant into a register) are written to a part of the executable file that is read into storage by the loader on program startup. All well and good if we know enough about the format of an executable file to be able to correct generate this information and can get the operating system to put in the desired storage location. Otherwise we have to figure out some other solution.

If we know two storage locations containing values that differ by one a sequence of instructions could subtract one value from the other to eventually obtain any desired value. This bootstrap process would speed up as a wider range of know value/location pairs was built up.

How do we go about obtaining a chunk of storage? An executable file usually contains information on the initial amount of storage needed by a program when it is loaded. Calls to the heap manager are another way of obtaining storage. Again we need to know where to write the appropriate information in the executable file.

What minimum amount of storage might be expected to be available? A program executing within a stack/heap based memory model has a default amount of storage allocated for the stack (a minimum of 16k I believe under Mac OS X or iPhone OS). A program could treat the stack as its storage. Ideally what we need is the ability to access storage via an offset from the stack pointer, at worse we would have to adjust the stack pointer to the appropriate offset, pop the value into a register and then reset the stack pointer; storing would involve a push.

Having performed some calculation we probably want to communicate one or more values to the outside world. A call to a library routine, such as printf, needs information on the parameter passing conventions (e.g., which parameters get stored in which registers or have storage allocated for them {a function returning a structure type usually has the necessary storage allocated by the calling function with the address passed as an extra parameter}) and the location of any return value. If ABI information is not available a bit of lateral thinking might be needed to come up with an alternative output method.

I have not said anything about making use of signals and exception handling. These are hard enough to get right when documentation is available. The Turing machine theory folk usually skip over these real-world issues and I will join them for the time being.

Secret instruction sets about to make a come-back

January 28, 2010 4 comments

Now it has happened Apple’s launch of a new processor, the A4, seems such an obvious thing to do. As the manufacturer of propriety products Apple wants to have complete control over the software that customers run on those products. Using a processor whose instruction set and electrical signals are not publicly available goes a very long way to ensuring that somebody else’s BIOS/OS/drivers/etc do not replace Apple’s or that the distributed software is not ‘usefully’ patched (Apple have yet to reveal their intentions on publishing instruction set details, this is an off-the-cuff prediction on my part).

Why are Apple funding the development of the LLVM C/C++ compiler? Because it enables them to write a back-end for the A4 without revealing the instruction set (using gcc for this purpose would require that the source be distributed, revealing the instruction set). So much for my prediction that Apple will stop funding LLVM.

The landscape of compute intensive cpus used to be populated with a wide variety of offerings from many manufacturers. The very high price of riding the crest of Moore’s law is one of the reasons that Intel’s x86 acquired such a huge chunk of this market; one by one processor companies decided not to make the investment needed to keep their products competitive. Now that the applicability of Moore’s law is drawing to an end the treadmill of continued processor performance upgrades is starting to fading away. Apple are not looking at a processor upgrade cycle, that they need to keep up with to be competitive, that stretches very far into the future.

Isn’t keeping an instruction set secret commercial suicide? The way to sell hardware is to make sure that lots of software runs on it and software developers want instruction set details (well ok, a small number do, but I’m sure lots of others get a warm fuzzy feeling knowing that the information is available should they need it). This is very much a last century view. The world is awash with an enormous variety of software, including lot of it very useful open source, and there is a lot more information available about the set of applications many customers want to use most of the time. Other than existing practice there is little reason why a manufacturer of a proprietary product, that is not a processor, needs to release instruction set details.

In the early 1980s I was at the technical launch of the Inmos Transputer and was told that the instruction set would not be published because the company did not want to be tied down to backwards compatibility. Perish the thought that customers would write assembler because the Inmos supplied compilers produced poor quality code or that a third party compiler would good enough to gain a significant market share. In marketing ability Inmos was the polar opposite of Apple. For a while HP were not very forthcoming on the instruction set behind their PA-RISC processor.

Will the A4 instruction set remain secret? On the whole, perhaps it can. The software based approaches to reverse engineering an instruction set require access to a compiler that generates it. For instance, changing a plus to a minus in one expression and looking for the small change in the generated executable; figuring out which status flags are set under which conditions is harder, but given time it can be done. Why would Apple release executable compilers if it wants to control the software that runs on the A4?

If the instruction set were cracked where would a developer go to obtain a compiler targeting the A4?

Given the FSF‘s history with Apple I would not be surprised if there was a fatwa against support for a proprietary instruction set in gcc. I suspect Apple would frown very heavily on such support ever being in the standard llvm distribution. I could see Apple’s lawyers sending letters to anybody making available a compiler that generated A4 code.

In the past manufacturers have tried to keep processor instruction sets secret and found that commercial reality required them to make this information freely available. Perhaps in the long term, like Moore’s law, the publication of detailed instruction set information may just be a passing phase.

Assessing my predictions for 2009

January 24, 2010 No comments

I have been rather remiss in revisiting the predictions I made for 2009 to see how they fared. Only two out of the six predictions were sufficiently precise to enable an assessment one year later; the other four talking more about trends. What of these two predictions?

The LLVM project will die. Ok, the project is still going and at the end of December the compiler could build itself but the build is not yet in a state to self host (i.e., llvm compiler creates an executable of itself from its own source and this executable can build an executable from source that is identical to itself {modulo date/time stamps etc}). Self hosting is a major event and on some of the projects I have worked on it has been a contractual payment milestone.

Is llvm competition for gcc? While there might not be much commercial competition as such (would Apple be providing funding to gcc if it were not involved in llvm?), I’m sure that developers working on both projects want their respective compiler to be the better one. According to some llvm benchmarks they compile code twice as fast as gcc. If this performance difference holds up across lots of source how might the gcc folk respond? Is gcc compiler time performance an issue that developers care about or is quality of generated code more important? For me the latter is more important and I have not been able to find a reliable, recent, performance comparison. I understand that almost all gcc funding is targeted at code generation related development, so if compile time is an issue gcc may be in a hole.

I still don’t see the logic behind Apple funding this project and continue to think that this funding will be withdrawn sometime, perhaps I was just a year off. I do wish the llvm developers well and look forward to them celebrating a self hosted compiler.

Static analysis will go mainstream. Ok, I was overly optimistic here. There do not seem to have been any casualties in 2009 among the commercial vendors selling static analysis tools and the growth of PHP has resulted in a number of companies selling tools that scan for source security vulnerabilities (.e.g, SQL injection attacks).

I was hoping that the availability of Treehydra would spark the development of static analysis plugins for gcc. Perhaps the neatness of the idea blinded me to what I already knew; many developers dislike the warnings generated by the -Wall option and therefore might be expected to dislike any related kind of warning. One important usability issue is that Treehydra operates on the abstract syntax tree representation used internally by gcc, GIMPLE. Learning about this representation requires a big investment before a plugin developer becomes productive.

One tool that did experience a lot of growth in 2009 was Coccinelle, at least judged by the traffic on its mailing list. The Coccinelle developers continue to improve it and are responsive the questions. Don’t be put off by the low version number (currently 0.2), it is much better than most tools with larger version numbers.

Why does Intel sell compilers?

January 5, 2010 No comments

Intel is a commercial company and the obvious answer to the question of why it sells compilers is: to make money. Does a company that makes billions of dollars in profits really have any interest in making a few million, I’m guessing, from selling compilers? Of course not, Intel’s interest in compilers is as a means of helping them sell more hardware.

How can a compiler be used to increase computer hardware sales? One possibility is improved performance and another is customer perception of improved performance. My company’s first product was a code optimizer and I was always surprised by the number of customers who bought the product without ever performing any before/after timing benchmarks; I learned that engineers are seduced by the image of performance and only a few are ever forced to get involved in measuring it (having been backed into a corner because what they currently have is not good enough).

Intel are not the only company selling x86 chips, AMD and VIA have their own Intel compatible x86 chips. Intel compatible? Doesn’t that mean that programs compiled using the Intel compiler will execute just as quickly on the equivalent chip sold by competitors? Technically the answer is no, but the performance differences are likely to be small in most cases. However, I’m sure there are many developers who have been seduced by Intel’s marketing machine into believing that they need to purchase x86 chips from Intel to make sure they receive this ‘worthwhile’ benefit.

Where do manufacturer performance differences, for the same sequence of instructions, come from? They are caused by the often substantial internal architectural difference between the processors sold by different manufacturers, also Intel and its competitors are continually introducing new processor architectures and processors from the same company will have differences performance characteristics. It is possible for an optimizer to make use of information on different processor characteristics to tune the machine code generated for a particular high-level language construct, with the developer selecting the desired optimization target processor via compiler switches.

Optimizing code for one particular processor architecture is a specialist market. But let’s not forget all those customers who will be seduced by the image of performance and ignore details such as their programs being executed on a wide variety of architectures.

The quality of a compiler’s runtime library can have a significant impact on a program’s performance. The x86 instruction set has grown over time and large performance gains can be had by a library function that adapts to the instructions available on the processor it is currently executing on. The CPUID instruction provides all of the necessary information.

As well as providing information on the kind of processor and its architectural features the CPUID instruction can return information about the claimed manufacturer of the chip (some manufacturers provide a mechanism that allows users to change the character sequence returned by this instruction).

The behavior of some of the functions in Intel’s runtime library depends on the
character sequence returned by the CPUID instruction, producing better performance for the sequence “GenuineIntel”. The US Federal Trade Commission have filed a complaint alleging that this is anti-competitive (more details) and requested that this manufacturer dependency be removed.

I think that removing this manufacturer dependency will have little impact on sales. Any Intel compiler user who is not targeting Intel chips and who is has a real interest in performance can patch the runtime library, the Supercomputer crowd will want to talk to the kind of sophisticated processor/compiler engineers that Intel makes available and for everybody else it is about the perception of performance. In fact Intel ought to agree to a ‘manufacturer free’ runtime library pronto before too many developers have their delusions shattered.

The changing shape of code in the next decade

December 29, 2009 No comments

I think there are two forces that will have a major impact on the shape of code in the next decade:

  • Asian developers. China and India each have a population that is more than twice as large as Europe and the US combined, and software development has been kick-started in these countries by a significant amount of IT out sourcing. I have one comparative data point for software developers who might be of the hacker ilk. A discussion of my C book on a Chinese blog resulted in a download volume that was 50% of the size of the one that occurred when the book appeared as a news item on Slashdot.
  • Scripting languages. Software is written to solve a problem, and there are only so many packaged applications (COTS or bespoke) that can profitably be supported. Scripting languages are generally designed to operate within one application domain, e.g., Bash, numerical analysis languages such as R and graphical plotting languages such as gnuplot.

While markup languages are very widely used they tend to be read and written by programs not people.

Having to read code containing non-alphabetic characters is always a shock the first time. Simply having to compare two sequences of symbols for equality is hard work. My first experience of having to do this in real time was checking train station names once I had travelled outside central Tokyo and the names were no longer also given in Romaji.

其中,ul分别是bootmap_size(bit map的size),start_pfn(开始的页框)
                                max_low_pfn(被内核直接映射的最后一个页框的页框号) ;

Developers based in China and India have many different cultural conventions compared to the West (and each other) and I suspect that these will affect the code they write (my favorite potential effect involves treating time vertically rather than horizontally). Many coding conventions used by a given programming language community exist because of the habits adopted by early users of that language, these being passed on to subsequent users. How many Chinese and Indian developers are being taught to use these conventions, are the influential teachers spreading different conventions? I don’t have a problem with different conventions being adopted, other than that having different communities using different conventions increases the cost for one community to adopt another community’s source.

Programs written in a scripting language tend to be much shorter (often being contained within a single file) and make use of much more application knowledge than programs written in general purpose languages. Their data flow tends to be relatively simple (e.g., some values are read/calculated and passed to a function that has some external effect), while the relative complexity of the control flow seems to depend on the language (I only have a few data points for both assertions).

Because of their specialized nature, most scripting languages will not have enough users to support any kind of third party support tool market, e.g., testing tools. Does this mean that programs written in a scripting language will contain proportionally more faults? Perhaps their small size means that only a small number of execution paths are possible, and these are quickly exercised by everyday usage (I don’t know of any research on this topic).

The Met Office ‘climategate’ Perl code

December 25, 2009 3 comments

In response to the Climategate goings on the UK Meteorological Office has released a subset of its land surface climate station records and some code to process it. The code consists of 397 lines of Perl (station_gridder.perl and pretty printer and kind of implies more than one person doing the editing. And why are some variables names capitalized and other not (the names in subroutine read_station are all lower case, while the names in the surrounding subroutines are mostly upper case)? More than one author is the simplest answer.

One Perl usage caught my eye, the construct unless is rarely used and often recommended against. Without a lot more code being available for analysis there are no obviously conclusions to draw from this usage (apart from it being an indicator of somebody who knows Perl well, most mainstream languages do not support this construct and developers have to use a ‘positive’ construct containing a negated condition rather than a ‘negative’ construct containing a positive condition).

Compiler writing in the next decade

December 22, 2009 No comments

What will be the big issues in compiler writing in the next decade? Compilers sit between languages and hardware, with the hardware side usually providing the economic incentive.

Before we set off to follow the money, what about the side that developers prefer to talk about. The last decade has not seen any convergence to a very small number of commonly used languages, if anything there seems to have been a divergence with more languages in widespread use. I will not attempt to predict whether there will be a new (in the sense of previously limited to a few research projects) concept that is widely integrated and used in many languages (i.e., the integrating of object-oriented features into languages in the 90s).

Where is hardware going?

  • Moore’s law stops being followed. Moore’s law is an economic one that has a number of technical consequences (e.g., less power consumed and until recently increasing clock rates). Will the x86 architecture evolution dramatically slow down once processor manufacturers are no longer able to cram more transistors onto the same amount of chip real estate? Perhaps processor designers will start looking to compiler writers to suggest functionality that could be made use of by compilers to generate more optimal code. To date, my experience of processor designers is that they look to Moore’s law to get a ‘free’ performance boost.

    There are a number of things a compiler code tell the processor, such as when a read or write to a cache line is the last one that will occur for a long time (enabling that line to be moved to the top of the reuse list).

  • Not plugged into the mains. When I made a living writing optimizers, the only two optimizations choices were code size and performance. There are a surprising number of functional areas in which a compiler, given processor support, can potentially generate code that consumes less power. More on this issue in a later post.
  • More than one processor. Figuring out how to distribute a program across multiple, loosely coupled, processors remains a difficult research topic. If anybody ever comes up with a solution to this problem, it might make more commercial sense for them to keep it secret, selling a compiling service rather than selling compilers.
  • Application Specific Instruction-set Processors. Most processors in embedded systems only ever run a single program. The idea of each program being executed on a processor optimized to its requirements sounds seductive. At the moment the economics are such that it is cheaper to take an existing, very low cost, processor and shoe-horn the application onto it. If the economics change, the compiler used for each processor is likely to be automatically generated.

Enough of the hardware, this site is supposed to be about code:

  • New implementation techniques. These include GLR parsing and genetic algorithms to improve the generated code quality. The general availability of development machines containing more than 4G of memory will make it worthwhile for compiler writers to implement more whole program optimizations (which are currently being hemmed in by storage limits)
  • gcc will continue its rise to world domination. The main force at work here is not the quality of gcc, but the disappearance of the competition. Compiler writing is not a big bucks business, and compiler companies are regularly bought up by much larger hardware outfits looking to gain some edge. A few years go by, plans change, the compiler group are not making enough profit to warrant the time being spent on them by upper management, and they are closed down. One less compiler vendor and a bunch of developers are forced to migrate to another compiler, which may or may not be gcc.
  • Figuring out what the developer meant to write based on what they actually wrote, and some mental model of software developers, is my own research interest. This is somewhat leading edge stuff, in other words, nothing major has been achieved so far. Knowledge of developer intent looks like it will open the door to whole classes of new optimization techniques.

Parsing Fortran 95

December 20, 2009 No comments

I have been looking at doing some dimensional analysis of the Climategate code and so needed a Fortran parser.

The last time I used Fortran in anger the modern compilers were claiming conformance to the 1977 standard and since then we have had Fortran 90 (with a minor revision in 95) and Fortran 03. I decided to take the opportunity to learn something about the new features by writing a Fortran parser that did not require a symbol table.

The Eli project had a Fortran 90 grammar that was close to having a form acceptable to bison and a few hours editing and debugging got me a grammar containing 6 shift/reduce conflicts and 1 reduce/reduce conflict. These conflicts looked like they could all be handled using glr parsing. The grammar contained 922 productions, somewhat large, but I was only interested in actively making use of parts of it.

For my lexer I planned to cut and paste an existing C/C++/Java lexer I have used for many projects. Now this sounds like a fundamental mistake, these languages treat whitespace as being significant while Fortran does not. This important difference is illustrated by the well known situation where a Fortran lexer needs to lookahead in the character stream to decide whether the next token is the keyword do or the identifier do5i (if 1 is followed by a comma it must be a keyword):

      do 5 i = 1 , 10
      do 5 i = 1 . 10        ! assign 1.10 to do5i
5     continue

In my experience developers don’t break up literals or identifier names with whitespace, and so I planned to mostly ignore the whitespace issue (it would simplify things if some adjacent keywords were merged to create a single keyword).

In Fortran the I/O is specified in the language syntax, while in C like languages it is a runtime library call involving a string whose contents are interpreted at runtime. I decided to ignore I/O statements by skipping to the end of line (Fortran is line oriented).

Then the number of keywords hit me, around 190. Even with the simplifications, I had made writing a Fortran lexer look like it would be a lot of work; some of the keywords only had this status when followed by a = and I kept uncovering new issues. Cutting and pasting somebody else’s lexer would probably also involve a lot of work.

I went back and looked at some of the Fortran front ends I had found on the Internet. The GNU Fortran front-end was a huge beast and would need serious cutting back to be of use. moware was written in Fortran and used the traditional six character abbreviated names seen in ‘old-style’ Fortran source and not a lot of commenting. The Eli project seemed a lot more interested in the formalism side of things, and Fortran was just one of the languages they claimed to support.

The Open Fortran Parser looked very interesting. It was designed to be used as a parsing skeleton that could be used to produce tools that processed source, and already contained hooks that output diagnostic output when each language production was reduced during a parse. Tests showed that it did a good job of parsing the source I had, although there was one vendor extension used quite often (and not documented in their manual). The tool source, in Java, looked straightforward to follow, and it was obvious where my code needed to be added. This tool was exactly what I needed 🙂