Tips and Tricks
updated
Sunday, February 13, 2011 11:17 PM
Normal Quantile Plots in Excel (reference contributed by Rachael Goldenberg)
Making Histograms with Excel for the PC - by Robert Winters (added Feb 13, 2011)
(yes, it seems long, but it seems to work and cover all the bases)
There are (at least) two ways to do this, depending on whether your version has the Data Analysis toolpack (if so it will appear under the Tools menu). Recent versions of Excel inexplicably no longer have this, though this may change.
If you do have the Data Analysis toolpak, then simply list your data in a vertical array and your classes (bins) in another vertical array (making sure the values are evenly spaced). Then do the following:
1) Place the cursor in a free cell separate from your data and bins. Then from the top menu choose Tools/Data Analysis/Histogram. A dialog box will open.
2) In this dialog box, place the cursor in the "Input Range" box, then drag/select the data array. When you let go, the range of cells will appear in the "Input Range" box. If you included the label (field name), make sure you check the "Labels" option.
3) Now do the same things in the "Bin Range" box choosing the cells specifying the bin range.
4) For "Output Range", you can just select the cell you want at the upper left corner of the output.
5) Click OK and you should get two columns - one with the bins and the other with the number of data values in each bin (between the specified values. An additional row is automatically added giving the number of values, if any, above the highest value specified in your bins.
Now for the chart:
6) Highlight these two columns of data. You can include the last "More" line if you wish, but you probably will want to exclude it.
7) Now choose the Chart Wizard (or go to the "Insert" tab in more recent versions. Choose the simple column style bar graph option. A dialog box should appear with tabs for "Data Range" and "Series". A preview of your graph should appear. Don't worry if it looks wrong since we're about to fix it.
8) If your graph shows two sets of bars, probably with different colors, choose the "Series" tab, then click on the series name for your bins and click "Remove". This should leave you with just the frequencies. You should also then click in the box for "Category (X) axis labels" and the highlight the data array of your bins. This should insert these values along the horizontal axis. Then click "Next".
9) You should now see a dialog box with about 6 tabs for such things as "Titles", "Axes", "Gridlines", "Legend", "Data Labels", and "Data Table". Try some of the options such as typing in a Chart Title and putting names on the horizontal and vertical axes. You may wish to uncheck the "Show legend" in the "Legend" tab if it's unnecessary (which it most likely is). Then click "Next".
10) Choose whether you want your graph to appear in the current window or in its own sheet, then click "Finish".
11) Next, you'll probably want to change the formatting of your histogram. If you prefer to not have the background color, right-click anywhere in the background and choose "Format Plot Area..." Here you can play around with the border or click "None" under "Area" to eliminate the background. The click "OK".
12) If you don't want the excessive spacing between the bars in the bar graph that appear by default, right click on one of the bars and choose "Format Data Series...". Under "Patterns" you can change the color of the bars. Under "Options", you can change the "Gap Width" to a low value (such as 2 or 5) or 0. Many people prefer to leave a narrow gap between the bars of the histogram to distinguish the bars.
Now, if your version of Excel does not have the Data Analysis option, you can get the same results in steps 1 to 5 using the built in FREQUENCY function in Excel. The steps are as follows:
1) To the immediate right of your bins array (labeled at the top with perhaps "bins" or "classes"), select an array of cells to the right of the bin values (not including the label) with one extra cell at the bottom. For example, if you had 10 bins located in cells D2 through D11, you should select cells E2 through E12. The extra cell is for values greater than your bins. You may or may not need this.
2) With these cells still highlighted, click on the function fx button (or however you access the statistical functions) and navigate your way to the FREQUENCY function and click OK in the dialog box. You'll probably get an error message, but don't worry about it. The area you highlighted should still be highlighted, so place your cursor in the entry box between the parentheses of =FREQUENCY(). Alternatively, you can just type this in.
3) Now select the data values (not including the label) and let go, type a comma, then select the bin values (not including the label) and let go. You should see something like this: =FREQUENCY(A2:A30,D2:D8), though your cells will be different.
4) Finally, simultaneously press Control-Shift-Enter. The frequencies should appear in the area you originally selected. If you click in any of the cells of this output you should see something like: {=FREQUENCY(A2:A30,D2:D8)}. The curly brackets that have appeared indicate the fact that the whole range is governed by this function and individual cells cannot be edited (don't try it!).
5) Then use your bins column and this newly created frequencies column to make your bar graph as described in steps 6 through 12 above.
It's unfortunate that this is so indirect and requires so many steps, but it should work. There may be some simplification or alternatives. - RW
Making Histograms with Excel 2008 for the Mac - by Sohini Pillai
After playing around with Excel 2008 on a Mac for a bit I was able to make histograms. I am an including an explanation of how I made them.
1) Enter your data into one column of excel and enter the bins (classes) you want into the column next to it.
2) In the first cell of the column next to your "bins" column use the FREQUENCY function to enter your data array (all the numbers in your data column) and your bin array (all the numbers in your bin column).
3) A number will come up in this cell. Then highlight that cell and the number of cells that is equal to the number of bins you have below it.
4) Press CONTROL+U, and then press COMMAND+Z+RETURN. The frequencies you need should come up in the entire column.
5) Highlight the "bins" column and the "frequencies" columns and go to INSERT and select CHART and chose a COLUMN chart which is essentially a bar graph.
6) On your chart two overlapping bar graphs will show up. It should be obvious which bar graph is the histogram and you can delete the other wrong bar graph.
7) The numbers on your X axis most likely wont correspond correctly to your bins but you can fix that by clicking the X axis and pressing CONTROL and then selecting SELECT DATA. Then you can highlight the values in your "bins" column and make them the values of your X axis.
8) If you want to get rid of the spaces between the bars of your histogram you can just double click the bars and then select OPTIONS and make the gap width equal to 0.
This system worked very well for me and hopefully it can help other people in our class too.
Sohini Pillai, Class of 2012
Random Sampling of Large Datasets
What do you do when you have a very large text file with many records?
Often you can open the text file in a spreadsheet program like Excel (or its OpenOffice equivalent Calc) and get the data fields to open nicely in separate columns. Exactly how this is done depends on whether your data is "tab-separated", "comma-separated", or something else. However, most versions of spreadsheet programs have a limit to the number of records (rows or lines) they can handle (mine stops at 65536 rows). So what do you do when you have 71,076 records plus the first row with field names?
The ideal approach is to open the file in a database program such as Access (or its OpenOffice equivalent Base). In either Access or Base, the easiest way to do this is to use the "Open With ..." option in Windows or just try dragging the file icon to the program icon on a Mac (which hopefully will work in much the same way). You'll likely have to go through a brief dialog specifying whether your data is tab-separated, comma-separated, or something else, and you may be asked if the first row of the file has the field names, but you will then link to the original file and you won't have to repeat this protocol. Once you have the data in the database program, you can easily (and very quickly) select any record determined by your randomization scheme. [By the way, it's always a good idea to save a copy of your original data file using a different name - just in case you accidentally (or intentionally) edit the data.]
If you don't want to go the database software route, you can actually select the records within Microsoft Word, though it may be a bit less efficient than doing it as described above. First, open the text file in Word (either with the "Open With ..." option or by dragging the file icon to the Word program icon). You may have to answer a quick question, but just click on OK and you'll be fine. You'll then be looking at a very long text file spanning many pages. The software is capable of recognizing that the file may actually be a data file. To select specific records, choose View/Toolbars/Database from the menu. A toolbar should appear with several buttons. The first one is likely called "Data Form" and can be used to find individual records. At the bottom of the Data Form window, there's a bar that displays what record is shown in the window. You can type in the number of the record you want, then it "Enter," and the data should appear in the form. (This is very fast on a PC, but slower on a Mac.) On newer versions of Mac Excel, there are other, quicker options in the Data Form for finding a specific record.
To summarize, the best way to do this is in a database program, but Excel is good for datasets that are not excessively large and you can also do it with Word by using the Data Form option.
Normal Quantile Plot in Excel
A "normal quantile plot" also goes by the name of "normal probability plot" in DDXL. It's explained in detail on pages 80-83 of the text. Basically, it's this:
1) order the data in increasing order.
2) Determine the percentiles of each value in the list, i.e. what percentage of the way down the list is this value. For example, the median would be at the 50% mark.
3) Determine the corresponding z-values for these percentages from Table A (or using the Excel function NORMINV).
4) Graph (as is scatterplot) the original data values vs. the z-values you just found from the percentiles.
If the data were normally distributed, the values on the graph would all lie on a straight line. How close they are to a straight line is a measure of how close to normally distributed the data is. This is very simple to do in DDXL (Charts and Plots/Normal Probability Plot), but you can also do it using the Excel chart options by following the outline above.
The 68-95-99.7 Rule
In case this wasn't given sufficient air time in class, this rule basically says that if data of one quantitative variable is normally distributed, then approximately 68% of all values will lie within one standard deviation of the mean, approximately 95% of all values will lie within two standard deviations of the mean, and approximately 99.7% of all values will lie within three standard deviations of the mean. The actual values are 68.27%-95.45%-99.73%. Of course most actual data is at best only approximately normal, so there's not that much need for the extra precision.
You can easily derive this rule from Table A by taking the difference of the percentages associated with the appropriate z-values. Specifically, for z = +1 and z = –1, the difference of the corresponding percentages is 0.8413 – 0.1587 = 0.6826, or approximately 68%. For z = +2 and z = –2, the difference of the corresponding percentages is 0.9772 – 0.0228 = 0.9544, or approximately 95%. For z = +3 and z = –3, the difference of the corresponding percentages is 0.9987 – 0.0013 = 0.9974, or approximately 99.7%.
Open Office
You might try downloading and installing OpenOffice, a free alternative to Word, Excel, etc. The current version is OpenOffice 3.0.1 and if you try their spreadsheet program (called Calc), you'll likely find that it's very similar to Excel in almost every way. I tried doing a scatterplot with data from two separate categories (as in Exercise 2.8) and found it to be simpler than Excel with a lot more control of the output. You can't use the DDXL add-in with OpenOffice (though I won't be surprised if someone comes up with an extension that does some it what DDXL does) and there's no separate "Data Analysis" menu, but it has all of the statistical functions and graphing functions.
Making a Scatterplot with Data from Two Different Categories - Melanie Kaplan
Steps for Exercise 2.8
1) Copy the data and paste it somewhere else on the spread sheet, so you can delete the outlier
2) make a graph of the T's (without highlight the T column) and manually re-label the series (you do that by going to select data and then editing the label)
3) make a separate graph of the M's (without highlight the M column) and manually re-label the series
4) copy the second-now-labeled graph onto the first one
5) you did it !!
Got a tip or trick for your fellow Math 101 students? Let me know and I'll (check it out and then) post it here.