Various authors of the guidebooks for writing reports (Bowden, 2002; Silverman et al., 2005; Smith, 1997) are unanimous that the Introduction is one of the most important parts of the research. As Bowden (1999) puts it, the Introduction sets the reader’s expectations and outlines the primary issues that will be covered by the report. Turley (2000) assumes that the Introduction and Conclusion are the parts of the report that are read through, whereas the other sections of the report might receive much less attention. A good introduction should comprise the following information:
- Definition of the key notions that will be covered throughout the report;
- Major set objectives and means of meeting them;
- Brief coverage of the key areas of focus and the methodology that will be employed for the report purposes;
- Reasons for choosing a particular company, industry or business area;
- In cases where the company/industry/area is predefined, the introduction should state the reasons for choosing particular models to study the issue of concern;
- The report sequence;
- Key assumptions. The assumptions are especially important in case the information from a case study or from accessible sources is not comprehensive and fails to cover all the dimensions of the area being studied.
When writing the report, one of the key challenges that arises from the very start is the access to publicly available information. This issue is especially acute when it comes to private companies that do not make their operations data publicly available. Therefore, if there is a possibility that a company has to be selected, it is highly recommended to focus on a public entity. Once a company or an industry is specified, a good starting point would be the review of annual reports of the selected company or of the key industry players. Sections such as the CEO statement and profit & loss data might provide highly valuable information about the sources of competitive advantage, key performance,criteria, industry trends and the impact of environmental factors. This data, when used along with the reports of market intelligence firms (Mintel, Datamonitor, Keynotes) is very likely to contribute to the effective usage of various models such as PESTLE, SWOT, Five Forces, Porter’s Diamond Model and OLE to name a few.
Building the discussion sequence
Defining a robust structure is one of the most critical stages in writing a good report. What is important is to remember that the structure of the report plays the role of mapping the overall discussion into a single entity. That is why it is very important to group similar sections or argument points (see Fry, 2000). It will then be easier for the reader to track the pace of analysis.
It is also important to define the role of the different sections in the report. Bowden suggests building a discussion or an analysis around the key areas and most significant findings that have a critical impact on the progress of the issue of concern. According to the principle of Pareto, of the different forces, only 20% produce 80% of the overall impact, whereas the effect produced by the remaining 80% of factors, if considered separately, is minor. Therefore, according to Bowden (2002) the critical forces should become the basis for the skeletal framework of the whole report, whereas the minor forces might be put into appendices. The order of headings and subheadings should be consistent with the proposed analytical statements. To assess the fit of the developed structure, Smith (1997) suggests using the following selection/exclusion criteria:
- Necessity test: Is each component necessary? For example: Is the section “Background information” necessary? The possible answer is “Yes” because it outlines the key facts about the company/industry/area which is under investigation. Is the section “key assumptions” necessary? The possible answer is “No”, as the key assumptions are typically included in the Introduction. In this case this section should be removed from the skeletal framework since it would serve no useful purpose.
- Inclusion test: Considering all the report requirements and learning outcomes, are all appropriate items included?
- Exclusion test: Considering all the report requirements and learning outcomes, are all appropriate items included?
- Hierarchy test: Are the issues in the sections hierarchically parallel? As mentioned above, in this case the use of the Pareto principle could be very effective. The key factors might be reflected by separate sections, whereas minor factors might be grouped into a single section.
Usage of critics and comparative arguments
What makes a good piece of research, is the weighted use of critics. The use of the arguments of different academics and industry experts indicates the student’s depth of theoretical knowledge of the research area. Turley (2000) claims that the failure to interact with the ideas proposed by other people makes reports look introspective and incestuous. However, the mere use of secondary arguments, without synthesising them in a logical manner is futile. Turley (2000) suggests that an author needs to be in control over the usage of the ideas of other people, not letting them put the author’s arguments in the shade. Smith (1997) puts forward the following ways of controlling critics:
- Using critical arguments to support the argument put forward by an author;
- Using critical arguments to contrast the ideas put forward by the author against the stance of other academics; and
- Using critical arguments to prepare the basis for the introduction of new ideas or coming to a particular inference.
The similar rules might be applied to the usage of statistical data or findings from case studies. Certain case studies might be used to indicate the positive consequences of the recommended course of action, whereas other cases might be used to indicate the negative outcomes if the area/strategy is not improved. According to Schrecengost (2002) the different use of data might be a very powerful explanatory tool when used with analytical models. The model predicts the possible outcomes, whereas the cases and data ground these predictions.
Use of statistical data, financial information and marketing data
When writing reports about a strategic course of action of a company or when analysing the development of a particular industry, an author is highly recommended to use a wide range of statistical data. When using statistical or marketing data it is important to be consistent with the sequence of the analysis and the way it explains the point which an author is trying to elaborate.
What needs to be taken into consideration is the load of data and its digestibility by a potential reader. A lot of students fall into a trap of assuming that tutors have time and enough efforts to read between the lines, or undertake additional guesswork into what a student intended to write. In reality, the process of reading the report is very likely to be the following: an average tutor scans the report through the keywords and then reads it through. Bearing in mind that usually tutors have to read 10 to 15 papers in one go, one should not expect that his/her paper will receive preferential treatment. So, to convince a tutor that his/her report provides comprehensive coverage of the research area, an author is highly recommended to use visual tools like tables, graphs and figures. These tools enable the potential reader to develop a perceptual map of what an author is trying to explain. Additionally, it can be an excellent “springboard” and linking chain for comparative analysis. The guidebooks (see reference list, available in Netlibrary) provide explicit examples of using various presentation tools.
It is important to consider the likely effect on potential readers of using graphical tools. To assess the fit of graphical tools they might be tested against the following criteria:
- The ease of comprehending the presented data. If left unexplained, certain tools might add complexity rather then simplicity;
- Graphical load – it is important to limit the number of tools used because excessive use might create a negative impression; and
- Attractive appearance – the report’s appearance might win up to 10% of the overall mark.
When presenting market trends, financial development, ratios, progress, etc the use of graphical tools might be the only way to maintain a report’s consistency and the strength of its explanatory power. Besides, figures and tables are more likely to attract attention and be retained in the tutor’s memory.
Report weaknesses and limitations
Bowden (1999) notes that the author’s recognition of a report’s limitations is a strong winning factor. He claims that “experienced writers routinely acknowledge ‘weaknesses’ in their argument because they know that, paradoxically, this helps to authorize their views”. In other words, the recognition of the limitations demonstrates that the author undertakes the comprehensive approach and seeks to explore the wide range of dimensions related to the area of concern.
According to Bowden (2002) and Smith (1997) the value of the Conclusion should not be underestimated. It might contribute between 5% and 10% to the overall assignment rating. Provided there is likelihood that a tutor might omit critical points of the report, the Conclusion could be used to highlight essential findings. It makes it easier for the tutor to navigate across the report and process the report information. According to Smith (1997), the Conclusion should:
- Pull all findings together;
- Indicate the way the set objectives were met;
- Summarise the key inferences; and
- Indicate areas of further research.
Bowden, J., (1999), Writing Good Reports, Oxford How To Books, Ltd.
Bowden, J., (2002), Writing a Report: How to Prepare, Write and Present Effective Reports, Oxford How To Books, Ltd.
Fry, Ronald. W. (2000), Improve Your Writing, Franklin Lakes, NJ TheCareer Press Schrecengost, M. (2002), Researching Issues, FortAtkinson: Wis. Highsmith, Inc.
Silverman, J., Hughes, E., Wienbroer, D. R. (2005), Shortcuts for the Student Writer, New York: McGraw-Hill Professional,
Smith, P. (1997), Writing an Assignment: How to Improve Your Research and Presentation Skills, Oxford, U.K. How to Books, Ltd.
Turley, R. M. (2000), Writing Essays: A Guide for Students in English and the Humanities, New York: Routledge
Copyright © Insta Research Ltd. All rights reserved. All forms of copying, distribution or reproduction are strictly prohibited and will be prosecuted to the Full Extent of Law.
Parallel Computer Systems
Blocked Matrix Multiply
Tutor: Krister Dackland (email@example.com)
Nico, C95 (firstname.lastname@example.org)
Selander, C93 (email@example.com)
17th September 1998
An implementation of a matrix multiplication subroutine, optimized for use on a IBM RS/6000 25T, is presented. The optimizations includes matrix multiplication with scalar temporary variable, cache blocking, register blocking by loop unrolling, block copying and function inlining.
Acquisition and user guidance
This report covers an assignment on implementing a fast matrix multiplication subroutine on the IBM RS/6000 25T. Aside from a variety of compiler optimizations, we have used matrix multiplication with scalar temporary variables, cache blocking, register blocking by loop unrolling, block copying and function inlining to improve performance.
This is a compulsory exercise in the course Parallel Computer Systems (Parallelldatorsystem), given by the Department of Computing Science at Ume� University in 1998.
Acquisition and user guidance
The source-code files are located in the directory on the computer system of the Department of Computing Science. The files are:
| Contains our highly tuned matrix multiplication subroutine: |
The given test program. It is most advisable to use a raw version of the test program. However, since we are to achieve more than 30 MFLOPS on matrices larger than 500 times 500, we have modified it to include some huge matrices as the original program never goes beyond 256 in size.
The Makefile is quite ordinary. The only thing worth mentioning is the compiler options: . That is, aggressive optimizations, inlining where possible, no use of , target architecture Power PC, target processor 601, do inter-procedural analysis and unroll loops.
To perform a test, simply run . First some random matrices are multiplied and the results are checked to be correct. Then the performance for some even quad-sizes matrices and some arbitrary matrices are measured. The output is on the form .
Example:anaris ~/src/pds/lab1>mm_contest Checking for correctness on sizes: 238 65 31 78 109 39 181 184 180 248 Checking quad-word aligned sizes 16 52.390698 32 58.099291 64 48.415417 128 39.099444 256 42.182715 512 41.943040 1024 41.690616 Checking arbitrary sizes 23 52.614054 43 47.545714 61 49.667615 79 48.423473 99 47.870539 119 49.772629 151 50.209702 255 38.118103 257 36.864861 501 49.802575 633 49.106706
The figure below pretty much shows how our program works:
This figure depicts the cache blocking. The three outermost loops steps with block-size and thus determine the darker sub-matrices that the innermost loops performs the actual matrix multiplication on. This way, each C sub-matrix is used only once and the sub-matrices of A and B is visited less than in the case of the na�ve matrix multiplication algorithm. We have used a block-size of 32. This size was chosen after lots of empirical testing of different sizes around the size suggested by the formula in [IBM93] (see the section Cache Blocking below).
This figure also depicts the data copying. This is used to boost the performance of the otherwise problematic matrices even multiples of 32 in size. Each A and B sub-matrices are copied into a temporary array that also has room for the resulting C sub-matrix (the C sub-matrix is not copied into the array, just zeroed). The blocks of this temporary array is then fed to the same general function that handles all other sizes of matrices. After the current C sub-matrix is completely computed, it is copied into the corresponding place in the C matrix. See also the section Data Copying below.
We have used the same block-size on all levels of the program (the register blocking is achieved by loop unrolling to a depth of four, not by explicit loops). However, our code supports one block-size for the cache blocking in the general function and another for the data-copying. That is, they are determined by two separate defines in the source code, BLOCKSIZE and BSIZEPRIM.
We have been a bit troubled over how to present the algorithm we have used. We first tried to follow the examples found in the references, but that quickly turned into something more like the source code than a presentation of an algorithm. Instead we chose to cover each optimization trick independent of the others. Hopefully, this will make this report a lot more comprehensible.
Common Sense Optimizations
This is a small, but very important section. The optimizations of this section is not found in fancy papers on fast computations. Instead, they are sometimes found in the curriculum of good and thorough educational programmes, and most often only after long experience of programming. We are referring to small details like moving invariants out of loops, always precalculate loop conditions and such. Easily overlooked, they can give you that extra MFLOP you were looking for. ;-)
By replacing the expression C[i, j] with a scalar temporary variable the memory traffic may be reduced since the scalar is placed in a register and hopefully stays there.
Example:for i := 0 to N for j := 0 to N s := C[i, j] for k := 0 to N s += A[i, k] * B[k, j] end C[i, j] := s end end
The above example is a bit simple but carefully used together with loop unrolling you can achieve register blocking (good register reuse) which can give good performance, especially for small matrices.
In the innermost main-loop, we use 16 scalar temporary variables to hold values for a 4 by 4 sub-matrix of C (we unroll both i and j with a depth of 4). Whenever i or j is not divisible by 4 special cases arise were we use less registers (in the case of both i and j being indivisible by 4, our code resembles that of the example above).
The scheme of cache blocking is to keep blocks of the matrices in the cache for as long as possible, to maximize cache reuse and minimize memory traffic.
Example:# Cache blocking: for ib := 0 to N by BLOCKSIZE for jb := 0 to N by BLOCKSIZE for kb := 0 to N by BLOCKSIZE # Computations local to cache for i := ib to MIN(ib + BLOCKSIZE, N) for i := ib to MIN(ib + BLOCKSIZE, N) for i := ib to MIN(ib + BLOCKSIZE, N) C[i, j] += A[i, k] * B[k, j] end end end end end end
[IBM93] uses this formula to compute the block-size:
in our case these figures apply:
However, some of the cache should be saved for access of scalars. Empirically, we found 34 to be a good block-size and used for some time but now we get the best overall performance with a block-size of 32. We have also lost about one MFLOP of our peak performance and we cannot find out why. :-( (Now we think we have figured it out - see the trailing comment in the test section.)
Register blocking follows the same principle as cache blocking, except that the register is much smaller than the cache. The scheme is to maximize register reuse and minimize memory traffic.
There are two ways to achieve register blocking. You use loops as in cache blocking and a block-size small enough to fit a mini-block in the registers. You can also unroll the inner loops of the matrix multiplication and load scalar temporary variables by hand to get register blocking.
We have used unrolling to a depth of four and 16 temporary doubles to store a 4 by 4 sub-matrix of C. That is, we have used half of the 32 8-byte floating point registers of the IBM RS/6000 25T. We also tried to use an unrolling depth of 5 and 25 temporary doubles, but this did not give any extra performance.
By unrolling loops you can minimize loop overhead and create larger basics blocks for the compiler optimizer to work on. Example:for i := ... for j := 0 to BLOCKSIZE by DEPTH s0 := C[i, j ] s1 := C[i, j + 1] s2 := C[i, j + 2] s3 := C[i, j + 3] for k := ... ... end C[i, j ] := s0 C[i, j + 1] := s1 C[i, j + 2] := s2 C[i, j + 3] := s3 end end
This is a one-dimensional loop unrolling. We have used a two-dimensional unrolling. That is, we have also unrolled the outer i loop and used 16 temporary doubles to store a 4 by 4 sub-matrix of C. We have tried to also unroll the innermost k loop but this did not give any extra performance. Probably the compiler does a better job of unrolling innermost loops.
For some matrices ordinary cache blocking do not work so well. The case might be that the elements of the different matrices are stored in memory such that they map onto the same cache sets. This can cause performance to drop considerably. Another problem is that you do not get good cache line reuse when whole cache lines are exchanged between cache hits. One solution to these problems is to copy chunks of large matrices into temporary arrays that will fit into the cache. Carefully used, the overhead of block copying can be less than the performance drop of cache set interference.
We have found that the performance of our program drops for even multiples of 32 larger than 64, but for 32 by 32 matrices we get our peak performance. Therefore we use a block copying scheme in these cases were we copy 32 by 32 blocks out of A and B into a 3 * 32 * 32 temporary array. The remaining room is zeroed and reserved for the computed C matrix. Then we call our general function (redundantly inlined, of course) with local A, B and C pointing to parts of this array.
Example:for ib := 0 to N by 32 for jb := 0 to N by 32 zero_local_C for kb := 0 to N by 32 copy_block_of_A copy_block_of_B Call_general_function end copy_local_C_back_to_block_of_C end end
Invocation of sub-functions costs. They introduce a delay as current environment and parameters are pushed onto the stack, local variables created and flow of control is shifted to the new piece of code. Therefore you can gain performance when inlining functions. For small functions that do not call other functions, this can be done by the compiler. For more complex functions, you have to do it by hand. For the prize of very non-beautified code you can get a little extra performance.
Testsanaris ~/src/pds/lab1>mm_contest Checking for correctness on sizes: 238 65 31 78 109 39 181 184 180 248 Checking quad-word aligned sizes 16 52.390698 32 58.099291 64 48.415417 128 39.099444 256 42.182715 512 41.943040 1024 41.690616 Checking arbitrary sizes 23 52.614054 43 47.545714 61 49.667615 79 48.423473 99 47.870539 119 49.772629 151 50.209702 255 38.118103 257 36.864861 501 49.802575 633 49.106706
All tests are performed on the machine . Since this workstation is residing in the master thesis lab, our fellow students in the course have not found it and thus we have had it for ourselves! ;-)
Since the time of a floating point operation depends on the actual values of the operands, since these are randomized in the test program and since the current time is used as seed to the random-function of the test program, the output of the test vary up to one MFLOPS depending on when the test is run and which values happens to be used in the A and B matrices. This can really make a serious student feeling very sick as earlier achieved peak performances suddenly are impossible to duplicate...
- Many thanks to hamlet and Psycho (Fredrik Augustsson and Tomas Halvarsson) for pointing out to us our far from optimal wrapper condition.
- Thanks to the ma446-community for providing a nutritious environment for creative assignment solving.
- [IBM93] IBM manuals, Optimization and Tuning Guide for Fortran, C, and C++ (SC09-1705-00), IBM 1993
- [LAM91] Lam, M S, Rothberg, E E, Wolf, M E, The Cache Performance and Optimizations of Blocked Algorithms , Stanford University, 1991
This assignment was made with , , and .
I think we better leave the source out, don't you think?