# lsql-csv
`lsql-csv` is a tool for CSV file data querying from the shell with short queries. It makes it possible to work with small CSV files like with a read-only relational database.

The tool implements a new language LSQL similar to SQL, specifically designed for working with CSV files in a shell. 

## Installation
It is necessary, you had GHC (`>=8 <9.29`) and Haskell packages Parsec (`>=3.1 <3.2`), Glob (`>=0.10 <0.11`), base (`>=4.9 <4.20`), text (`>=1.2 <1.3`) and containers (`>=0.5 <0.7`)
 installed. (The package boundaries given are identical to cabal boundaries.) Run then:

    make
    sudo make install
    
Now the lsql-csv is installed in `/usr/local/bin`. If you want, you can specify `INSTALL_DIR` like:

    sudo make INSTALL_DIR=/custom/install-folder install

This will install the package into `INSTALL_DIR`.

If you have installed `cabal`, you can alternatively run:

    cabal install
   
It will also install the dependencies for you.    

The package is also published at https://hackage.haskell.org/package/lsql-csv in the Hackage public repository. You can therefore also install it directly without the repository cloned with:

    cabal install lsql-csv

### Running the unit tests
If you want to verify, that the package has been compiled correctly, it is possible to test it by running:

    make test

This will run all unit tests for you.


## lsql-csv - quick introduction 
LSQL, the language of `lsql-csv`, aims to be a more lapidary language than SQL. Its design purpose is to enable its user to quickly write simple queries directly to the terminal - its design purpose is therefore different from SQL, where the readability of queries is more taken into account than in LSQL.


### Examples
One way to learn a new programming language is by understanding concrete examples of its usage. The following examples are written explicitly for the purpose of teaching a reader, how to use the tool `lsql-csv` by showing him many examples of its usage. 

The following examples might not be enough for readers, who don't know enough Unix/Linux scripting. If this is the case, please consider learning Unix/Linux scripting first before LSQL.

It is also advantageous to know SQL.

The following examples will be mainly about parsing of `/etc/passwd` and parsing of `/etc/group`. To make example reading more comfortable, we have added `/etc/passwd` and `/etc/group` column descriptions from man pages to the text.

File `/etc/passwd` has the following columns:
* login name;
* optional encrypted password;
* numerical user ID;
* numerical group ID;
* user name or comment field;
* user home directory;
* optional user command interpreter.

File `/etc/group` has the following columns:
* group name;
* password;
* numerical group ID;
* user list.

#### Hello World

    lsql-csv '-, &1.2 &1.1'

This will print the second (`&1.2`) and the first column (`&1.1`) of the CSV file on stdin. If you know SQL, you can read it like `from stdio S select S.second, S.first`. 

Commands are split by commas into blocks. The first block is (*and always is*) the from block. There are file names or `-` (stdin) separated by space. The second block is the select block, also separated by space.

For example:

    lsql-csv '-, &1.2 &1.1' <<- EOF
    World,Hello
    EOF
    
It returns:
    
    Hello,World

#### Simple filtering 
    lsql-csv -d: '-, &1.*, if &1.3>=1000' < /etc/passwd

This will print lines of users whose UID >=1000. It can also be written as:
  
    lsql-csv -d: 'p=/etc/passwd, p.*, if p.3 >= 1000'
    
    lsql-csv -d: 'p=/etc/passwd, &1.*, if &1.3 >= 1000'

    lsql-csv -d: '/etc/passwd, &1.*, if &1.3 >= 1000'
    
The `-d:` optional argument means the primary delimiter is `:`. In previous examples we used overnaming, which allows us to give a data source file `/etc/passwd` a name `p`.

If you know SQL, you can read it as `from /etc/passwd P select * where P.UID >= 1000`. As you can see, the lsql style is more compressed than standard SQL.
    
The output might be:

    nobody:x:65534:65534:nobody:/var/empty:/bin/false
    me:x:1000:1000::/home/me:/bin/bash

If you specify delimiter specifically for `/etc/passwd`, the output will be a comma delimited.
    
    lsql-csv '/etc/passwd -d:, &1.*, if &1.3 >= 1000'

It might return:

    nobody,x,65534,65534,nobody,/var/empty,/bin/false
    me,x,1000,1000,,/home/me,/bin/bash


#### Named columns
Let's suppose we have a file people.csv:
   
    name,age
    Adam,21
    Petra,23
    Karel,25

Now, let's get all the names of people in people.csv using the `-n` named switch:

    lsql-csv -n 'people.csv, &1.name'

The output will be:

    Adam
    Petra
    Karel

As you can see, we can reference named columns by a name. Named switch `-n` enables first-line headers.
If named columns are enabled, each column has two names under &X - the number name &X.Y and the actual name &X.NAME.

Now, we can select all columns with wildcard `&1.*`:

    lsql-csv -n 'people.csv, &1.*'

As the output, we get

    Adam,21,21,Adam
    Petra,23,23,Petra
    Karel,25,25,Karel

The output contains each column twice because wildcard `&1.*` was evaluated to `&1.1, &1.2, &1.age, &1.name`.
How to fix it?

    lsql-csv -n 'people.csv, &1.[1-9]*'

The output is now:

    Adam,21
    Petra,23
    Karel,25

The command can also be written as

    lsql-csv -n 'people.csv, &1.{1,2}'
    lsql-csv -n 'people.csv, &1.{1..2}'

The output will be in both cases still the same.


#### Simple join

Let's say, I am interested in the default group names of users. We need to join tables `/etc/passwd` and `/etc/group`. Let's do it.

    lsql-csv -d: '/etc/{passwd,group}, &1.1 &2.1, if &1.4 == &2.3'
    
What does `/etc/{passwd,group}` mean? Basically, there are three types of expressions. Select, from, and arithmetic expression. In all select and from expressions, you can use expansion and wildcards just like in bash.
    
Finally, the output can be something like this:

    root:root
    bin:bin
    daemon:daemon
    me:me

The first column is the name of a user and the second column is the name of its default group.
    
#### Basic grouping
Let's say, I want to count users using the same shell. 

    lsql-csv -d: 'p=/etc/passwd, p.7 count(p.3), by p.7'
    
And the output?

    /bin/bash:7
    /bin/false:7
    /bin/sh:1
    /bin/sync:1
    /sbin/halt:1
    /sbin/nologin:46
    /sbin/shutdown:1
    
You can see here the first usage of `by` block, which is equivalent to `group by` in SQL. 

#### Basic sorting
Let's say, you want to sort your users by UID with UID greater than or equal to 1000 ascendingly.

    lsql-csv -d: '/etc/passwd, &1.*, if &1.3 >= 1000, sort &1.3'

The output might look like:
  
    me1:x:1000:1000::/home/me1:/bin/bash
    me2:x:1001:1001::/home/me2:/bin/bash
    me3:x:1002:1002::/home/me3:/bin/bash
    nobody:x:65534:65534:nobody:/var/empty:/bin/false

The sort block is the equivalent of `order by` in SQL.

If we wanted descendingly sorted output, we might create a pipe to the `tac` command - the `tac` command prints the lines in reverse order:
    
    lsql-csv -d: '/etc/passwd, &1.*, if &1.3 >= 1000, sort &1.3' | tac

    
#### About nice outputs
There is a trick, how to concatenate two values in a select expression: Write them without space.

But how does the interpreter know the ends of the value name or value expression? You must use quotes for it - quotes themselves can't be part of the value name.
As an example, let's try to format our basic grouping example.

Let's try it!

    lsql-csv -d: 'p=/etc/passwd, "The number of users of "p.7" is "count(p.3)".", by p.7
    
The output might be:
 
    The number of users of /bin/bash is 7.
    The number of users of /bin/false is 7.
    The number of users of /bin/sh is 1.
    The number of users of /bin/sync is 1.
    The number of users of /sbin/halt is 1.
    The number of users of /sbin/nologin is 46.
    The number of users of /sbin/shutdown is 1.
   
As you can see, string formatting is sometimes very simple with LSQL.


#### Arithmetic expression

So far, we just met all kinds of blocks, and only if block accepts an arithmetic expression and the other accepts a select expression. 
What if we needed to run an arithmetic expression inside a select expression? There is a special syntax `$(...)` for it.

For example:

    lsql-csv -d: '/etc/passwd, $(sin(&1.3)^2 + cos(&1.3)^2)'
    
It returns something like:
    
    1.0
    1.0
    1.0
    0.9999999999999999
    ...
    1.0

If we run:
    
    lsql-csv -d: '/etc/passwd, $(&1.3 >= 1000), sort $(&1.3 >= 1000)'

We get something like:

    false
    false
    ...
    false
    true
    true
    ...
    true

#### More complicated join

Let's see more complicated examples.

    lsql-csv -d: 'p=/etc/passwd g=/etc/group, p.1 g.1, if p.1 in g.4'
    
This will print all pairs of users and its group excluding the default group. If you know SQL, you can read it as `from /etc/passwd P, /etc/group G select P.1, G.1 where P.1 in G.4`.

How does `in` work? It's one of the basic string level "consist". If A is a substring of B, then `A in B` is true. Otherwise, it is false.

And the output?

    root:root
    root:wheel
    root:floppy
    root:tape
    lp:lp
    halt:root
    halt:wheel

#### More complicated...

The previous example doesn't give a very readable output. We can use `group by` to improve it (shortened as `by`).

    lsql-csv -d: 'p=/etc/passwd g=/etc/group, p.1 cat(g.1","), if p.1 in g.4, by p.1'

The output will be something like:
    
    adm:adm,disk,sys,
    bin:bin,daemon,sys,
    daemon:adm,bin,daemon,
    lp:lp,
    mythtv:audio,cdrom,tty,video,
    news:news,
    
It groups all non-default groups of a user to one line and concatenates it delimited by ",". 

How can we add default groups too?

    lsql-csv -d: 'p=/etc/passwd g=/etc/group, p.1 cat(g.1","), if p.1 in g.4, by p.1' |
    lsql-csv -d: '- /etc/passwd /etc/group, &1.1 &1.2""&3.1, if &1.1 == &2.1 && &2.4 == &3.3'
    
This will output something like:

    adm:adm,disk,sys,adm
    bin:bin,daemon,sys,bin
    daemon:adm,bin,daemon,daemon
    lp:lp,lp
    mythtv:audio,cdrom,tty,video,mythtv
    news:news,news

The first part of the command is the same as in the previous example. The second part inner joins the output
of the first part with `/etc/passwd` on the username and `/etc/group` on the default GID number and prints
the output of the first part with an added default group name.

## Usage
Now, if you understand the examples, it is time to move forward to a more abstract description of the language and tool usage.

### Options

    -h
    --help

Shows short command line help and exits before doing anything else.

    -n
    --named

Enables first-line naming convention in CSV files. This works only on input files. 
Output is always without first-line column names.
    
    -dCHAR
    --delimiter=CHAR

Changes default primary delimiter. The default value is `,`.

    -sCHAR
    --secondary-delimiter=CHAR
    
Changes default quote char (secondary delimiter). The default value is `"`.

### Datatypes
There are 4 datatypes considered: `Bool`, `Int`, `Double` and `String`. 
`Bool` is either true/false, `Int` is at least a 30-bit integer, `Double` double-precision floating point number, and `String` is an ordinary char string.

During CSV data parsing, the following logic of datatype selection is used: 
* `Bool`, if `true` or `false`;
* `Int`, if `[0-9]+` matches;
* `Double`, if `[0-9]+.[0-9]+(e[0-9]+)?` matches;
* `String`, if none of the above matches.

### Joins
Join means, that you put multiple input files into from block.

Joins always have the time complexity O(nm). There is no optimization made based on if conditions when you put multiple files into from block.

### Documentation of language

    lsql-csv [OPTIONS] COMMAND
    
    Description of the grammar
    
      COMMAND -> FROM_BLOCK, REST
    
      REST -> SELECT_BLOCK, REST
      REST -> BY_BLOCK, REST
      REST -> SORT_BLOCK, REST
      REST -> IF_BLOCK, REST
      REST ->
    
      FROM_BLOCK -> FROM_SELECTOR FROM_BLOCK
      FROM_SELECTOR ~~> FROM ... FROM //Wildcard and brace expansion

      FROM -> FROM_NAME=FROM_FILE OPTIONS
      FROM -> FROM_FILE OPTIONS

      OPTIONS -> -dCHAR OPTIONS
      OPTIONS -> --delimiter=CHAR OPTIONS
      OPTIONS -> -sCHAR OPTIONS
      OPTIONS -> --secondary-delimiter=CHAR OPTIONS
      OPTIONS -> -n OPTIONS
      OPTIONS -> --named OPTIONS
      OPTIONS -> -N OPTIONS
      OPTIONS -> --not-named OPTIONS
      OPTIONS ->

      
      SELECT_BLOCK -> SELECT_EXPR
      BY_BLOCK -> by SELECT_EXPR
      SORT_BLOCK -> sort SELECT_EXPR
      IF_BLOCK -> if ARITHMETIC_EXPR
    
 
      ARITHMETIC_EXPR -> ATOM
      ARITHMETIC_EXPR -> ONEARG_FUNCTION(ARITHMETIC_EXPR)
      ARITHMETIC_EXPR -> ARITHMETIC_EXPR OPERATOR ARITHMETIC_EXPR
      ARITHMETIC_EXPR -> (ARITHMETIC_EXPR)
      // Logical negation
      ARITHMETIC_EXPR -> ! ARITHMETIC_EXPR
      ARITHMETIC_EXPR -> - ARITHMETIC_EXPR
      
      SELECT_EXPR -> ATOM_SELECTOR SELECT_EXPR
      SELECT_EXPR ->
      
      ATOM_SELECTOR ~~> ATOM ... ATOM //Wildcard and brace expansion
      
      // e.g. 1.0, "text", 'text', 1
      ATOM -> CONSTANT
      // e.g. &1.1
      ATOM -> COLUMN_NAME
      ATOM -> pi
      ATOM -> e
      ATOM -> true
      ATOM -> false
      ATOM -> $(ARITHMETIC_EXPR)
      ATOM -> AGGREGATE_FUNCTION(SELECT_EXPR)
      ATOM -> ONEARG_FUNCTION(ARITHMETIC_EXPR)

      // # is not really char:
      // two atoms can be written without space 
      // and will be (string) appended, 
      // if they are separated using quote chars:
      //   left atom must end or right atom must begin with quote char
      // This rule doesn't apply inside ARITHMETIC_EXPR
      ATOM ~~> ATOM#ATOM     

      
      AGGREGATE_FUNCTION -> cat
      AGGREGATE_FUNCTION -> sum
      AGGREGATE_FUNCTION -> count
      AGGREGATE_FUNCTION -> max
      AGGREGATE_FUNCTION -> min
      AGGREGATE_FUNCTION -> avg
      
      //All trigonometric functions in radian
      ONEARG_FUNCTION -> sin
      ONEARG_FUNCTION -> cos
      ONEARG_FUNCTION -> tan
      
      ONEARG_FUNCTION -> asin
      ONEARG_FUNCTION -> acos
      ONEARG_FUNCTION -> atan
      
      ONEARG_FUNCTION -> sinh
      ONEARG_FUNCTION -> cosh
      ONEARG_FUNCTION -> tanh
      
      ONEARG_FUNCTION -> asinh
      ONEARG_FUNCTION -> acosh
      ONEARG_FUNCTION -> atanh
      
      ONEARG_FUNCTION -> exp
      ONEARG_FUNCTION -> sqrt
      
      //The length of the string
      ONEARG_FUNCTION -> size
      ONEARG_FUNCTION -> to_string
      
      ONEARG_FUNCTION -> negate
      ONEARG_FUNCTION -> abs
      ONEARG_FUNCTION -> signum
      
      ONEARG_FUNCTION -> truncate
      ONEARG_FUNCTION -> ceiling
      ONEARG_FUNCTION -> floor
      
      ONEARG_FUNCTION -> even
      ONEARG_FUNCTION -> odd
      
      // A in B means A is a substring of B
      OPERATOR -> in
      
      OPERATOR -> *
      OPERATOR -> **    //general power
      OPERATOR -> ^     //natural power
      OPERATOR -> /
      
      // Integer division truncated towards minus infinity
      // (x div y)*y + (x mod y) == x
      OPERATOR -> div
      OPERATOR -> mod
      
      //Integer division truncated towards 0
      // (x quot y)*y + (x rem y) == x  
      OPERATOR -> quot
      OPERATOR -> rem

      // greatest common divisor
      OPERATOR -> gcd
      // least common multiple
      OPERATOR -> lcm
      
      OPERATOR -> ++    //append
      
      OPERATOR -> +
      OPERATOR -> -
      
      OPERATOR -> <=
      OPERATOR -> >=
      OPERATOR -> <
      OPERATOR -> >
      OPERATOR -> !=
      OPERATOR -> ==
      
      OPERATOR -> ||
      OPERATOR -> &&

Each command is made from blocks separated by a comma. There are these types of blocks.
* From block
* Select block
* If block
* By block
* Sort block

The first block is always from block. If the block after the first block is without a specifier (`if`, `by`, or `sort`), then it is a select block. Otherwise, it is a block specified by the specifier.

From block accept specific grammar (as specified in the grammar description), select, by, and sort block select expression (`SELECT_EXPR` in the grammar), and if block arithmetic expression (`ARITHMETIC_EXPR` in the grammar).

Every source file has a number and may have multiple names - assign name, the name given to the source file by `ASSIGN_NAME=FILE_PATH` syntax in from block, and 
default name, which is given the path to the file or `-` in case of stdin in from block.

Each column of a source file has a number and may have a name (if the named option is enabled for the given source file). 

If the source file with index M (numbering input files from 1) has been given a name XXX, its columns can be addressed by &M.N or XXX.N, where N is the index of column (numbering columns from 1). 
If the named option is enabled and a column has the name `NAME`, it can also be addressed by &M.NAME or XXX.NAME.

If there is a collision in naming (two source files have the same name or two columns under the same source file have the same name), then the behavior is undefined.


#### Exotic chars
Some chars cannot be in symbol names (column names). For simplicity, we can suppose, they are everything but alphanumerical chars excluding `-`, `.`, `&`, and `_`. 
Also first char of a symbol name must be non-numerical to not be considered as an exotic char.
Referencing names containing exotic chars without quotes is unsupported.

It is possible to reference columns with names with exotic chars using \` quote - like \`EXOTIC NAME\`. The source file name is always part of the column name from the syntax perspective of language - it must be inside the quotes.

#### Quote chars
There are 3 quotes (\`, " and ') used in Lsql. " and ' are always quoting a string. The \` quote is used for quoting symbol names.

These chars can be used for fast appending. If two atoms inside SELECT_EXPR are written without space and are separated using the quotes, they will be appended. For example, `abc"abc"` means: append column abc to the string abc.

#### Constants
There are 3 types of constants. String, Double, and Int. Everything quoted in " or ' is always String constant. Numbers without `[0-9]+` are considered Int constant and numbers `[0-9]+.[0-9]+` Double constant.

#### Operator precedence and associativity
The following list outlines the precedence and associativity of lsql-csv infix operators. The lower the precedence number, the higher the priority.
* 1: `in`, `**`, `^`
* 2: `*`, `/`, `div`, `quot`, `rem`, `mod`, `gcd`, `lcm`
* 3: `++`, `+`, `-`
* 4: `<=`, `>=`, `<`, `>`, `!=`, `==`
* 5: `||`, `&&`

All operators are right-to-left associative.


#### Select expression
They are similar to bash expressions. They are made by atom selector expressions separated by whitespaces. These expressions are expanded, evaluated, and matched to column names, constants, aggregate functions, or arithmetic expressions.

Every atom selector expression can consist:
* Wildcard (Each wildcard will be expanded to multiple statements during processing)
* Bash brace expansion (e.g. {22..25} -> 22 23 24 25)
* Arithmetic expression in `$(expr)` format
* Quotes \`anything\` to prevent wildcards and expansions
* Quotes " or ' to insert string
* Call of aggregate function `AGGREGATE_FUNCTION(next select block)` - there cannot be any space after FUNCTION
* Call of single arg function `ONEARG_FUNCTION(arithmetic expression)` - there cannot be any space after FUNCTION
* Constants
* Reference to a column name

If you want to concatenate strings without `++` operator, you can write: `a.1","a.2`.

Please, keep in mind, that operators must be put inside arithmetic expressions, or they will be matched to a column name or aggregate function.

#### Arithmetic expression
The statement uses mainly classical awk logic. You can use keywords `>`, `<`, `<=`, `>=`, `==`, `||`, `&&`, `+`, `-`, `*`, `/`... 

#### Select blocks
These blocks determine output. They accept select expressions and are evaluated and printed in a delimitered format. 

Every select block must contain at least one reference to the column name, or the behavior is undefined.

Examples of select blocks:

    &1.[3-6]

This will print columns 3, 4, 5, and 6 from the first file.

    ax*.{6..4} 
    
This will print the 6th, 5th, and 4th of all files whose name begins with ax.


#### From blocks
There must be exactly one from block at the beginning of the command. The block can contain any files (and `-` specifies standard input). You can use any syntax you would otherwise use in bash to select these files (wildcards, expansion...). You can also overname the file using `NAME=stmt`. If there is more than 1 matching of stmt, the files will be named `(NAME, NAME1, NAME2...)`.

Example:

    /etc/{passwd,group}
    
This will select `/etc/passwd` and `/etc/group` files. They can be addressed either as `&1` or `/etc/passwd`, and `&2` or `/etc/group`.

If `filename` is put inside \` quotes, no wildcard or expansion logic will apply to it.

You can also add custom attributes to files in the format `FILE -aX --attribute=X -b`. The attributes will be applied to all files which will be matched using `FILE` bash expression.

##### Possible attributes
    
    -n
    --named

It means that the CSV file has the first line with the names of the columns

    -N
    --not-named

You can also set the exact opposite. This can be useful if you change the default behavior.

    -dCHAR
    --delimiter=CHAR
    
This changes the primary delimiter.

    -sCHAR
    --secondary-delimiter=CHAR
    
This changes the secondary delimiter char.

Example:

    /etc/passwd -d:

Currently, commas and CHARs, which are also quotes in Lsql, are not supported as delimiters.
    
#### If block
This block always begins with `if`. They accept arithmetic expressions, which should be convertible to Bool - either String `false`/`true`, Int (`0` `false`, anything else `true`), or Bool. 
Rows with true are printed or aggregated, and rows with false are skipped.

Filtering is done before the aggregation.

You can imagine if block as where clause in SQL.

#### By block
This statement always begins with `by` and the rest of the block is a select expression. There can be only one By block in the whole command.

The by block is used to group the resulting set by the given atoms.

You can imagine by block as the group by clause in SQL. 

There must be at least one aggregate function in the select block if by block is present. Otherwise, behavior is undefined.

If there is an aggregate function present without by block present, aggregation runs over all rows at once.

#### Sort block
This block can be at the end of the command. It begins with the `sort` keyword and the rest is a select expression.

The sort block determines the order of the final output - given atoms are sorted in ascending order.

You can imagine a sort block as the order by clause in SQL.

There can be only one Sort block in the whole command.