Init

These are my notes on using SQL and PostgresSQL?.

TODO SQL Overview

This will be a collection of notes on best practices and summaries of content found will be added along with references as I find them.

Extracts from reference : <https://data36.com/sql-best-practices-data-analysts/>

Order in which the queries are processed
FROM
WHERE
GROUP BY
SELECT
ORDER BY
LIMIT
Spaces and line breaks do not affect the output of a query. However, it is good to stick to a convention throughout the program and use the minimal breaks necessary to enhance readability.
Always try to use the actual column name rather than the column number in ORDER BY and GROUP BY as the column order can change down the line, and this point of failure will not be easy to troubleshoot in a longer query.
Avoid SELECT * , or atleast use the LIMIT for rows greater than 5000 entries. This is almost never used in practice as it is inefficient especially for large databases.
Additional Reference SQL Style Guide
Employ google data studio or tableau or other tools to visualise data.
Break down the query into smaller steps and go at it in portions.
After the join statements - it is helpful to view the table as a 'joined table, i.e all further where, group and order commands act on the joined table, however, it is good practice to have the columns specified as from a particular table. Only the select portions
Use the table-name.column-name format for specifying columns explicitly belonging to a particular table.

Notes on Postgres setup

Installing postgres on ubuntu / debian

sudo apt-get install postgres postgresql-contrib

When installed, a superuser named postgres is automatically created. Use the following to login

sudo -i -u postgres

This will login using the postgres user, and then sql at the command line will be accessible. Note: the latter can be directly accessed using the command psql to log into a created user and database. Therefore it is necessary to create a user first.

CREATE USER shrysr WITH PASSWORD abcd

Exiting the postgres user is done withy \q.

Now directly login with the created user and to the specified database

psql -U shrysr -d postgres

SQL commands are now available at the command line.

pgadmin4 and sqlworkbench are admin tools to connect to databases. sqlworkbench will require the specific jdbc driver installed (example postgres

In general, pgdmin is less user friendly and has higher tendency to crash in general. There does not appear to be any major difference in terms of capability of the applications.

Licensing note of sql-workbench: Refer <http://sql-workbench.eu/manual/license.html> This explicitly mentions that the Governments of several countries and organisations related to them are not allowed to use this software. Canada is included in this list. I am surprised to see such a restriction and it is actually not mentioned anywhere in the data36 course. I would actually go to the extent of saying that the data36 courses are more suitable for beginners in data science and it is not truly suitable to even intermediate, let alone advanced users.

Creating a table and adding content

CREATE TABLE test(col1 TEXT, col2 INT);

Here test is the table name, and col1 and col2 are the column names, with the datatype being specified as text and integer.

The types could be for example text, int, bool, date, timestamp, decimal. Once the datatype is specified, this has to be adhered to.

There are additional parameters which can be specified while creating columns. Example from <https://data36.com/create-table-sql/> :

CREATE TABLE test_results
(
  name         TEXT,
  student_id   INTEGER   PRIMARY KEY,
  birth_date   DATE,
  test_result  DECIMAL   NOT NULL,
  grade        TEXT      NOT NULL,
  passed       BOOLEAN   NOT NULL
);

Inserting values into the created table:

INSERT INTO test VALUES ('hello', 001);
INSERT INTO test VALUES ('olo', 002);

Tables can be assigned nicknames when referenced (for convenience, and for tables with similarly named columns). Syntax would be FROM tableX tx, FROM tableY ty. The nickname is placed after a space. Specific columns from each table is referenced using the dot operators, i.e tx.column_id.

Similarly, columns can also assigned a nickname when called SELECT student_id as "Stu ID". Note that if there are no spaces in the desired name then the double quote is not required.

The available tables can be viewed using

SHOW TABLE

Commit if changes are made using pgadmin4 or sqlworkbench

If any changes to the table or data is being made via pgadmin4 or workbench, then the commit command has to be used to have the changes applied or formally published to the database. There are options to set an 'autocommit' in both pgadmin4 and sqlworkbench.

COMMIT;

Inserting data into a table from other sources

After creating a table with the necessary headers and data types specified, the following command can be used to copy data from a csv/excel/txt file into the table.

COPY accidents FROM '/home/shrysr/sql_practice/accidents.csv' WITH CSV HEADER DELIMITER ',';

Here 'accidents' is the name of a created table, the path is the location of hte csv file, and the delimiter being commas is specified.

The copy command cannot be used without superuser privileges.

The user can be converted to a superuser after logging in as the superuser and converting the desired user into a superuser. Alternately, /copy can be used in place of copy if the user is not a superuser.

Converting the user to a superuser is done by logging in as super user:

sudo -u postgres -i

And then, on psql - use the sql command:

ALTER USER shrysr WITH superuser;

Alternately, it may be necessary to populate a new table using the query results from another table. Then the INSERT INTO command is used, and the SQL query within parenthesis.

INSERT INTO accidents
(SELECT name FROM test_results);

Emptying table without deleting the table

TRUNCATE TABLE accidents;

The truncate command combined with the table keyword will empty the table accidents of all the rows, but retain the table itself.

Deleting a table : DROP

The DROP command can be used to delete a table.

DROP TABLE accidents;

Listing all the tables and relations `\dt`

Quit the psql interface : `\q`

Distinct is not a function

DISTINCT is a method of selection and is not a function that transforms the data, like count or sum.

Max function

Max function can be run on only one column at a time and the column has to be specified.

Example of median with subquery

The main SELECT function cannot be used as a calculation for a subsequent subquery. Each subquery will need it's own calculation as shown below.

SELECT damage
FROM accidents
WHERE damage > 0
ORDER BY damage
LIMIT 2
OFFSET (SELECT COUNT(damage) / 2
        FROM accidents
        WHERE damage > 0);

Appears not possible to sum up a boolean datatype in SQL

Refer to Ex 29 and 30 (7 da SQL course) in counting the number of insurance paid entries. Using a count just counts all the entries. And using a sum on a boolean throws up an error.

Joins

Joins can be a basic join, full join, left join and right join.

SQL