Homework 2,3,4,5

Take a working system (written in LUA) and write it in any other language.

A Small Task

The task is write some code to read a CSV file and generate summaries of columns (medians and standard deviation for numerics; mode and entropy for symbolic columns).

eg. when run on our test file, this code reports (see the test cases eg.stats:

xmid	{:Clndrs 4.0 :Model 76.0 :Volume 146.0 :origin 1.0}
xdiv	{:Clndrs 1.55 :Model 3.876 :Volume 100.775 :origin 0.775}
ymid	{:Acc+ 15.5 :Lbs- 2800.0 :Mpg+ 20.0}
ydiv	{:Acc+ 2.713 :Lbs- 887.209 :Mpg+ 7.752}

The task is very simple (lots of very small moving parts).

The real goal here is to get your group to practice working together.
You have five people in your group. Time to divide and conquer.

Note, write as much as you can from scratch. So, in Python, no Pandas or scikitlearn or argparse or optparse or their equivalent in other languages.

Homework	Task
HW2	Get this going for the `Num` and `Sym` class (below) and the tests cases `the`, `sym`, `num`, `bignum`.
HW3	Get this going for the `Cols`, `Row`, `Data` class and the test cases `eg.csv, eg.data, eg.stats`.
HW4	Add all the bling from HW1. Also, add post-commit hooks to auto run all the test cases, the code coverage checks (if your language supports then), and the documentation generators. For more on what kinds of documentation is acceptable, see the web site from lecture1.
HW5	For five other groups from cs510 (picked at random), apply the Project1 rubric. Important note: whatever scores you offer, these will not change the other group's scores.

Where to submit:

All submissions will be a repo url, posted to Moodle.
If you have outputs, or the text file with the rubrics add them to e.g. out1.txt in /docs/out.

Resources

Example implementation: source code
quick tutorial on LUA.
Sample input csv file

First steps

Read the quick tutorial on LUA (but you only have to read it, not write it)
- Three ideas cover much of LUA
  - In LUA, names are global are default (unless marked with local).
    - Or, in the function header, you add extra variables.
  - LUA is a language with only one associative array data structure
    - If the keys are all consecutive integers, then these are simple lists (with indexes 1..n)
      - Contents are accessed via list[1].
    - If the keys are symbols, then we use javascript/python style access:
      - Contents are accessed via list.slotNiame
  - LUA is a minimal "batteries not included" language. So the code given to you starts with a bunch of functions that are built-in to other languages.
Read the source code. Written in LUA. Your must not write in LUA.. Note that:

Column1 of that pdf has a header string showing help, from which I build my the settings object.
Column2 of that pdf = some misc utils
Column3 of that pdf = classes/methods.
Last column of that pdf = test cases for this system
Second last column of that pdf = shows code that exits to operating system with the number of test failures
- No failures = return zero.
HINT: when reading strange source code:
- FIRST look for the data structures (see column2)
- SECOND look at the tests (last column)
- THIRD look at the details.

Find some way to divide the functionality across may small files

e.g. one file per class
e.g. anything to do command-line options goes into its own file
e.g. misc utilities into its own file.
e.g. test cases in a separate file

Theory

The function Y=F(X) computes dependent variables Y from independent variables X. Some variables are Numeric (which we denote with a leading upper case letter) and some are Symbolic. Some dependent variables need to maximized (denoted with a trailing - sign) and other need to be maximized (denoted with a trailing +).

We say a CSV file contains lots of X,Y examples. Line one of the file has a name showing the column name and types. E.g. in our test file, the dependent variables are columns 4,5 and 8.

Clndrs,Volume,Hp:,Lbs-,Acc+,Model,origin,Mpg+
8,304.0,193,4732,18.5,70,1,10
8,360,215,4615,14,70,1,10
8,307,200,4376,15,70,1,10
8,318,210,4382,13.5,70,1,10
8,429,208,4633,11,72,1,10

Note also the ":" header on Hp: (above). This denotes a column to skip.

You have to report middle and diversity of each non-skipped column.

For numbers:

mid = median = sort numbers seen so far, return the middle value
div = standard deviation = sort numbers, find 90th, 10th percentile, return (90th-10th)/2.56
- why? well you know that 1 or 2 standard deviations captures 66 to 95% of the mass. So somewhere in-between 1 and 2 is some point where you catch 90% (that point is 1.28 standard deviations, so we used plus or minus 1.28=2.56)

For symbols:

mid = mode = most common symbol
div = entropy = for symbols occurring at probability p1,p2,... then
entropy= ∑ -p_i * log2(p_i).
- Why? well, entropy can be viewed as the effort required to recreate a signal.
- If a signal has parts that occur a probability p1,p2,...
- then the probability that we want to search for a signal is, wait for it, p1 + p2,....
- And the effort to find the signal is log2(p) (assuming a binary chop)
- So the probability of needed that search effort is -p_i * log2(p_i) (and the minus sign is added as convention).

Functionality

Your code must support the following (and outside of these bounds, feel free to NOT do things like csv):

Five classes (or more): Data, Cols, Sym, Num, Row. For notes on these, see column2 of the source code.
- Note that Nums is a reservoir sampler;
  i.e. it keeps N numbers and if you see more than N, some old number is replaced at randaom.
A help string that can be displayed with -h.
A cli function that can update the, its slots from the command-line
- e.g. if the is {name="Tim", age=20} then -n X and -a X can update name and age.
A csv function that reads each line of a text file, divides on some operator (e.g. comma), removes leading and trailing white space, then coerces the cells to ints, floats, booleans or (failing the rest) to strings.
A set of demo tasks that can be called with -e task (and -e ALL) runs all tests.
- The tests of source code (and perhaps your own, if you want).
- On our system, -e ALL returns the following (but [YMMV[(https://dictionary.cambridge.org/us/dictionary/english/ymmv)).
  - The following output is in alphabetically ordering of the tests. A better ordering would be simpler to more complex. To see that ordering, look at the last column of the source code.
  - Note that our tests can handle a crash (see "BAD"). To actually debug that one, you need the stack dump info that is hidden by the usual test output. So we have a


-----------------------------------
!!!!!!	CRASH	BAD	false

-----------------------------------
!!!!!!	FAIL	LIST	true

-----------------------------------

Examples lua csv -e ...
	ALL
	BAD
	LIST
	LS
	bignum
	csv
	data
	num
	stats
	sym
	the
!!!!!!	PASS	LS	true

-----------------------------------
{1 28 49 50 56 85 86 156 208 237 294 327 444 459 461 485 490 503 
 546 618 653 712 723 727 770 801 849 915 928 941 967 987}
!!!!!!	PASS	bignum	true

-----------------------------------
{Clndrs Volume Hp: Lbs- Acc+ Model origin Mpg+}
{8 304 193 4732 18.5 70 1 10}
{8 360 215 4615 14 70 1 10}
{8 307 200 4376 15 70 1 10}
{8 318 210 4382 13.5 70 1 10}
{8 429 208 4633 11 72 1 10}
{8 400 150 4997 14 73 1 10}
{8 350 180 3664 11 73 1 10}
{8 383 180 4955 11.5 71 1 10}
{8 350 160 4456 13.5 72 1 10}
!!!!!!	PASS	csv	true

-----------------------------------
{:at 4 :hi 5140 :isSorted false :lo 1613 :n 398 :name Lbs- :w -1}
{:at 5 :hi 24.8 :isSorted false :lo 8 :n 398 :name Acc+ :w 1}
{:at 8 :hi 50 :isSorted false :lo 10 :n 398 :name Mpg+ :w 1}
!!!!!!	PASS	data	true

-----------------------------------
50	31.007751937984
!!!!!!	PASS	num	true

-----------------------------------
xmid	{:Clndrs 4.0 :Model 76.0 :Volume 146.0 :origin 1.0}
xdiv	{:Clndrs 1.55 :Model 3.876 :Volume 100.775 :origin 0.775}
ymid	{:Acc+ 15.5 :Lbs- 2800.0 :Mpg+ 20.0}
ydiv	{:Acc+ 2.713 :Lbs- 887.209 :Mpg+ 7.752}
!!!!!!	PASS	stats	true

-----------------------------------
{:div 1.378 :mid a}
!!!!!!	PASS	sym	true

-----------------------------------
{:dump false :eg ALL :file ../data/auto93.csv :help false :nums 512 :seed 10019 :seperator ,}
!!!!!!	PASS	the	true
!!!!!!	PASS	ALL	true

Checklist

The task is done when we can see in your repo:

Your repo shows a record of 5 people doing (approx) equal work.
Something that can display a help string and where internal settings be updated from command-line.
Something that can read csv files whose first line has ":+-" and upper and lower case in column names.
Output like the above (and YMMV)
For hw2, you need classes and methods for Sym, Num
For hw2, you need to show output from eg.the, eg.sym, eg,num, eg.bigbum
For hw3, you need classes and methods also for Data, Cols, Row
For hw3, you need to should output from eg.csv, eg.data, eg.stats
For hw3, test engine that can call one, or all, tests and which returns the number of failed tests to the operating system (so returning zerop means "no errors)
- the test engine should be able to handle crashing tests (and keep running tests after the crash). For Python people, see try:catch:.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

hw2345.md

hw2345.md

Homework 2,3,4,5

A Small Task

Where to submit:

Resources

First steps

Theory

Functionality

Checklist

Files

hw2345.md

Latest commit

History

hw2345.md

File metadata and controls

Homework 2,3,4,5

A Small Task

Where to submit:

Resources

First steps

Theory

Functionality

Checklist