Skip to content

Latest commit

 

History

History
244 lines (206 loc) · 11 KB

hw2345.md

File metadata and controls

244 lines (206 loc) · 11 KB

 


 home   |   syllabus   |   groups   |   moodle   |   video   |   review   |   © 2022


Homework 2,3,4,5

Take a working system (written in LUA) and write it in any other language.

A Small Task

The task is write some code to read a CSV file and generate summaries of columns (medians and standard deviation for numerics; mode and entropy for symbolic columns).

eg. when run on our test file, this code reports (see the test cases eg.stats:

xmid	{:Clndrs 4.0 :Model 76.0 :Volume 146.0 :origin 1.0}
xdiv	{:Clndrs 1.55 :Model 3.876 :Volume 100.775 :origin 0.775}
ymid	{:Acc+ 15.5 :Lbs- 2800.0 :Mpg+ 20.0}
ydiv	{:Acc+ 2.713 :Lbs- 887.209 :Mpg+ 7.752}

The task is very simple (lots of very small moving parts).

  • The real goal here is to get your group to practice working together.
  • You have five people in your group. Time to divide and conquer.

Note, write as much as you can from scratch. So, in Python, no Pandas or scikitlearn or argparse or optparse or their equivalent in other languages.

Homework Task
HW2 Get this going for the Num and Sym class (below) and the tests cases the, sym, num, bignum.
HW3 Get this going for the Cols, Row, Data class and the test cases eg.csv, eg.data, eg.stats.
HW4 Add all the bling from HW1. Also, add post-commit hooks to auto run all the test cases, the code coverage checks (if your language supports then), and the documentation generators. For more on what kinds of documentation is acceptable, see the web site from lecture1.
HW5 For five other groups from cs510 (picked at random), apply the Project1 rubric. Important note: whatever scores you offer, these will not change the other group's scores.

Where to submit:

  • All submissions will be a repo url, posted to Moodle.
  • If you have outputs, or the text file with the rubrics add them to e.g. out1.txt in /docs/out.

Resources

  1. Example implementation: source code
  2. quick tutorial on LUA.
  3. Sample input csv file

First steps

  1. Read the quick tutorial on LUA (but you only have to read it, not write it)
    • Three ideas cover much of LUA
      • In LUA, names are global are default (unless marked with local).
        • Or, in the function header, you add extra variables.
      • LUA is a language with only one associative array data structure
        • If the keys are all consecutive integers, then these are simple lists (with indexes 1..n)
          • Contents are accessed via list[1].
        • If the keys are symbols, then we use javascript/python style access:
          • Contents are accessed via list.slotNiame
      • LUA is a minimal "batteries not included" language. So the code given to you starts with a bunch of functions that are built-in to other languages.
  2. Read the source code. Written in LUA. Your must not write in LUA.. Note that:
  • Column1 of that pdf has a header string showing help, from which I build my the settings object.
  • Column2 of that pdf = some misc utils
  • Column3 of that pdf = classes/methods.
  • Last column of that pdf = test cases for this system
  • Second last column of that pdf = shows code that exits to operating system with the number of test failures
    • No failures = return zero.
  • HINT: when reading strange source code:
    • FIRST look for the data structures (see column2)
    • SECOND look at the tests (last column)
    • THIRD look at the details.
  1. Find some way to divide the functionality across may small files
  • e.g. one file per class
  • e.g. anything to do command-line options goes into its own file
  • e.g. misc utilities into its own file.
  • e.g. test cases in a separate file

Theory

The function Y=F(X) computes dependent variables Y from independent variables X. Some variables are Numeric (which we denote with a leading upper case letter) and some are Symbolic. Some dependent variables need to maximized (denoted with a trailing - sign) and other need to be maximized (denoted with a trailing +).

We say a CSV file contains lots of X,Y examples. Line one of the file has a name showing the column name and types. E.g. in our test file, the dependent variables are columns 4,5 and 8.

Clndrs,Volume,Hp:,Lbs-,Acc+,Model,origin,Mpg+
8,304.0,193,4732,18.5,70,1,10
8,360,215,4615,14,70,1,10
8,307,200,4376,15,70,1,10
8,318,210,4382,13.5,70,1,10
8,429,208,4633,11,72,1,10

Note also the ":" header on Hp: (above). This denotes a column to skip.

You have to report middle and diversity of each non-skipped column.

For numbers:

  • mid = median = sort numbers seen so far, return the middle value
  • div = standard deviation = sort numbers, find 90th, 10th percentile, return (90th-10th)/2.56
    • why? well you know that 1 or 2 standard deviations captures 66 to 95% of the mass. So somewhere in-between 1 and 2 is some point where you catch 90% (that point is 1.28 standard deviations, so we used plus or minus 1.28=2.56)

For symbols:

  • mid = mode = most common symbol
  • div = entropy = for symbols occurring at probability p1,p2,... then
    entropy= ∑ -pi * log2(pi).
    • Why? well, entropy can be viewed as the effort required to recreate a signal.
    • If a signal has parts that occur a probability p1,p2,...
    • then the probability that we want to search for a signal is, wait for it, p1 + p2,....
    • And the effort to find the signal is log2(p) (assuming a binary chop)
    • So the probability of needed that search effort is -pi * log2(pi) (and the minus sign is added as convention).

Functionality

Your code must support the following (and outside of these bounds, feel free to NOT do things like csv):

  • Five classes (or more): Data, Cols, Sym, Num, Row. For notes on these, see column2 of the source code.
    • Note that Nums is a reservoir sampler;
      i.e. it keeps N numbers and if you see more than N, some old number is replaced at randaom.
  • A help string that can be displayed with -h.
  • A cli function that can update the, its slots from the command-line
    • e.g. if the is {name="Tim", age=20} then -n X and -a X can update name and age.
  • A csv function that reads each line of a text file, divides on some operator (e.g. comma), removes leading and trailing white space, then coerces the cells to ints, floats, booleans or (failing the rest) to strings.
  • A set of demo tasks that can be called with -e task (and -e ALL) runs all tests.
    • The tests of source code (and perhaps your own, if you want).
    • On our system, -e ALL returns the following (but [YMMV[(https://dictionary.cambridge.org/us/dictionary/english/ymmv)).
      • The following output is in alphabetically ordering of the tests. A better ordering would be simpler to more complex. To see that ordering, look at the last column of the source code.
      • Note that our tests can handle a crash (see "BAD"). To actually debug that one, you need the stack dump info that is hidden by the usual test output. So we have a

-----------------------------------
!!!!!!	CRASH	BAD	false

-----------------------------------
!!!!!!	FAIL	LIST	true

-----------------------------------

Examples lua csv -e ...
	ALL
	BAD
	LIST
	LS
	bignum
	csv
	data
	num
	stats
	sym
	the
!!!!!!	PASS	LS	true

-----------------------------------
{1 28 49 50 56 85 86 156 208 237 294 327 444 459 461 485 490 503 
 546 618 653 712 723 727 770 801 849 915 928 941 967 987}
!!!!!!	PASS	bignum	true

-----------------------------------
{Clndrs Volume Hp: Lbs- Acc+ Model origin Mpg+}
{8 304 193 4732 18.5 70 1 10}
{8 360 215 4615 14 70 1 10}
{8 307 200 4376 15 70 1 10}
{8 318 210 4382 13.5 70 1 10}
{8 429 208 4633 11 72 1 10}
{8 400 150 4997 14 73 1 10}
{8 350 180 3664 11 73 1 10}
{8 383 180 4955 11.5 71 1 10}
{8 350 160 4456 13.5 72 1 10}
!!!!!!	PASS	csv	true

-----------------------------------
{:at 4 :hi 5140 :isSorted false :lo 1613 :n 398 :name Lbs- :w -1}
{:at 5 :hi 24.8 :isSorted false :lo 8 :n 398 :name Acc+ :w 1}
{:at 8 :hi 50 :isSorted false :lo 10 :n 398 :name Mpg+ :w 1}
!!!!!!	PASS	data	true

-----------------------------------
50	31.007751937984
!!!!!!	PASS	num	true

-----------------------------------
xmid	{:Clndrs 4.0 :Model 76.0 :Volume 146.0 :origin 1.0}
xdiv	{:Clndrs 1.55 :Model 3.876 :Volume 100.775 :origin 0.775}
ymid	{:Acc+ 15.5 :Lbs- 2800.0 :Mpg+ 20.0}
ydiv	{:Acc+ 2.713 :Lbs- 887.209 :Mpg+ 7.752}
!!!!!!	PASS	stats	true

-----------------------------------
{:div 1.378 :mid a}
!!!!!!	PASS	sym	true

-----------------------------------
{:dump false :eg ALL :file ../data/auto93.csv :help false :nums 512 :seed 10019 :seperator ,}
!!!!!!	PASS	the	true
!!!!!!	PASS	ALL	true

Checklist

The task is done when we can see in your repo:

  • Your repo shows a record of 5 people doing (approx) equal work.
  • Something that can display a help string and where internal settings be updated from command-line.
  • Something that can read csv files whose first line has ":+-" and upper and lower case in column names.
  • Output like the above (and YMMV)
  • For hw2, you need classes and methods for Sym, Num
  • For hw2, you need to show output from eg.the, eg.sym, eg,num, eg.bigbum
  • For hw3, you need classes and methods also for Data, Cols, Row
  • For hw3, you need to should output from eg.csv, eg.data, eg.stats
  • For hw3, test engine that can call one, or all, tests and which returns the number of failed tests to the operating system (so returning zerop means "no errors)
    • the test engine should be able to handle crashing tests (and keep running tests after the crash). For Python people, see try:catch:.