-
Notifications
You must be signed in to change notification settings - Fork 28
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Docs for data oriented schema (#290)
* Started doc on schemas * First draft of schema doc * Proofread data-oriented schema doc * Made type-shorthand more consistent
- Loading branch information
1 parent
80f7835
commit bd71983
Showing
5 changed files
with
78 additions
and
31 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,57 @@ | ||
# Creating Spark Schemas | ||
|
||
Schema creation is typically required for [manual Dataset creation](manual_dataset_creation.md) and for having more control when loading a Dataset from file. | ||
|
||
One way to create a Spark schema is to use the Geni API that closely mimics the original Scala Spark API using **Spark DataTypes**. That is, the following Scala version: | ||
|
||
```scala | ||
StructType(Array( | ||
StructField("a", IntegerType, true), | ||
StructField("b", StringType, true), | ||
StructField("c", ArrayType(ShortType, true), true), | ||
StructField("d", MapType(StringType, IntegerType, true), true), | ||
StructField( | ||
"e", | ||
StructType(Array( | ||
StructField("x", FloatType, true), | ||
StructField("y", DoubleType, true) | ||
)), | ||
true | ||
) | ||
)) | ||
``` | ||
|
||
gets translated into: | ||
|
||
```clojure | ||
(g/struct-type | ||
(g/struct-field :a :int true) | ||
(g/struct-field :b :str true) | ||
(g/struct-field :c (g/array-type :short true) true) | ||
(g/struct-field :d (g/map-type :str :int) true) | ||
(g/struct-field :e | ||
(g/struct-type | ||
(g/struct-field :x :float true) | ||
(g/struct-field :y :float true)) | ||
true)) | ||
``` | ||
|
||
whilst the Clojure version may look cleaner than the original Scala version, Geni offers an even more concise way to specify complex schemas such as the example above and cut through the boilerplates. In particular, we can use Geni's **data-oriented schemas**: | ||
|
||
```clojure | ||
{:a :int | ||
:b :str | ||
:c [:short] | ||
:d [:str :int] | ||
:z {:a :float :b :double}} | ||
``` | ||
|
||
The conversion rules are simple: | ||
|
||
* all fields and types default to nullable; | ||
* a vector of count one is interpreted as an `ArrayType`; | ||
* a vector of count two is interpreted as a `MapType`; | ||
* a map is interpreted as a nested `StructType`; and | ||
* everything else is left as is. | ||
|
||
In particular, the last rule allows us to mix and match the data-oriented style with the Spark DataType style for specifying nested types. |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters