ray.voelker@gmail.com
https://docs.python.org/3/library/sqlite3.html
SQLite is a C library that provides a lightweight disk-based database that doesn’t require a separate server process and allows accessing the database using a nonstandard variant of the SQL query language. Some applications can use SQLite for internal data storage. It’s also possible to prototype an application using SQLite and then port the code to a larger database such as PostgreSQL or Oracle.
...
To use the module, you must first create a Connection object that represents the database. Here the data will be stored in the example.db file:
import sqlite3
conn = sqlite3.connect('example.db')
You can also supply the special name :memory: to create a database in RAM.
conn = sqlite3.connect(":memory:")
# or, to enable shared cache, connect to the database like this...
conn = sqlite3.connect("file::memory:?cache=shared")
# https://sqlite.org/sharedcache.html
c = conn.cursor()
# Create table
c.execute('''CREATE TABLE stocks
(date text, trans text, symbol text, qty real, price real)''')
# Insert a row of data
c.execute("INSERT INTO stocks VALUES ('2006-01-05','BUY','RHAT',100,35.14)")
# Save (commit) the changes
conn.commit()
# We can also close the connection if we are done with it.
# Just be sure any changes have been committed or they will be lost.
conn.close()
Usually your SQL operations will need to use values from Python variables. You shouldn’t assemble your query using Python’s string operations because doing so is insecure; it makes your program vulnerable to an SQL injection attack (see https://xkcd.com/327/ for humorous example of what can go wrong).
Instead, use the DB-API’s parameter substitution. Put ? as a placeholder wherever you want to use a value, and then provide a tuple of values as the second argument to the cursor’s execute() method. (Other database modules may use a different placeholder, such as %s or :1.) For example:
# Never do this -- insecure!
symbol = 'RHAT'
c.execute("SELECT * FROM stocks WHERE symbol = '%s'" % symbol)
# Do this instead
t = ('RHAT',)
c.execute('SELECT * FROM stocks WHERE symbol=?', t)
print(c.fetchone())
# Larger example that inserts many records at a time
purchases = [('2006-03-28', 'BUY', 'IBM', 1000, 45.00),
('2006-04-05', 'BUY', 'MSFT', 1000, 72.00),
('2006-04-06', 'SELL', 'IBM', 500, 53.00),
]
c.executemany('INSERT INTO stocks VALUES (?,?,?,?,?)', purchases)
Log file analysis:
...web proxy server used by libraries to give access from outside the library's computer network to restricted-access websites that authenticate users by IP address. This allows library patrons at home or elsewhere to log in through their library's EZproxy server and gain access to resources to which their library subscribes, such as bibliographic databases. wikipedia)
Objective: Convert the (apache style) ezproxy log file into .csv files for easier analysis. Current EZproxy config for log output: (https://help.oclc.org/Library_Management/EZproxy/Configure_resources/LogFormat)
LogFormat %h %{ezproxy-session}i %u %t "%r" %s %b
EZproxy Log Directive | Description |
---|---|
%h | The IP address of the Host accessing EZproxy |
%{ezproxy-session}i | EZproxy identifier for the user's current session |
%u | Username used to log into EZproxy if Option LogUser appears in config.txt; session identifier if Option LogSession appears in config.txt; - otherwise. To log both username and session, add only Option LogUser to config.txt, then use %u for the username and %{ezproxy-session}i for the session identifier. |
%t | Date/time of request; may also appear as %{format}t to specify a strftime time format. |
%r | Complete request (e.g. GET http://www.somedb.com HTTP/1.0) |
%s | HTTP numeric status of request |
%b | Number of bytes transferred |
Example line:
66.42.221.177 xiPT7OYWGCpRcAS 12345678 [20/May/2019:00:08:24 -0400] "GET https://research.cincinnatilibrary.org:443 HTTP/1.1" 302 0
Description | Value |
---|---|
IP addresss | "66.42.221.177" |
Users session id | "xiPT7OYWGCpRcAS" |
Username used to log into EZproxy | "12345678" |
Date/time of request | "2019-05-20T00:08:24-04:00" |
Complete request | "GET https://research.cincinnatilibrary.org:443 HTTP/1.1" |
HTTP numeric status | 302 |
Number of bytes transferred | 0 |
import re
import sqlite3
import csv
from dateutil import parser
# data file to analyze
data_file = 'ezproxy.log.20190805.000001'
ezproxy_regex = re.compile(
r"^([^\s]{1,})\s{1}" # first capture group: %h IP address
r"([^\s]{1,})\s{1}" # second capture group: %{ezproxy-session}i users current session
r"([^\s]{1,})\s{1}" # third capture group: %u Username used to log into EZproxy
r"(?:\[)(.*?)(?:\])\s{1}" # fourth capture group: %t Date/time of request
r"(?:\")(.*?)(?:\")\s{1}" # fifth capture group: %r Complete request
r"([^\s]{1,})\s{1}" # sixth capture group: %s HTTP numeric status
r"([^\s]{1,})" # seventh capture group:%b Number of bytes transferred
)
# db = sqlite3.connect('ezproxy.db')
db = sqlite3.connect("file::memory:?cache=shared")
sql = """
-- note: SQLite does not have a storage class set aside for storing dates and/or times.
-- https://www.sqlite.org/datatype3.html
CREATE TABLE IF NOT EXISTS
`data` (
`ip` TEXT,
`ezproxy_session` TEXT,
`patron_username` TEXT,
`iso_timestamp` TEXT,
`completed_request` TEXT,
`http_status` INTEGER,
`num_bytes` INTEGER
);
CREATE INDEX IF NOT EXISTS
`idx_patron_search` ON `data`(
`ezproxy_session`,
`patron_username`,
`iso_timestamp`
);
CREATE TABLE IF NOT EXISTS
`patron_nicknames` (
`patron_username` TEXT,
`nickname` TEXT
);
CREATE INDEX IF NOT EXISTS
`idx_patron_nickname` ON `patron_nicknames`(
`patron_username`
);
"""
db.executescript(sql)
# insert sql command
sql = """
INSERT INTO data(
ip,
ezproxy_session,
patron_username,
iso_timestamp,
completed_request,
http_status,
num_bytes
)
VALUES(?, ?, ?, ?, ?, ?, ?)
"""
cursor = db.cursor()
# text file contains the exproxy-formatted log data as defined above
count = 0
with open(data_file, 'rt') as data_input:
for line in data_input:
# add a counter and output per 10,000 lines processed
count += 1
if (count % 10000 == 1):
print('.', end='', flush=True)
x = ezproxy_regex.search(line)
if (x != None):
x_groups = list(x.groups())
# convert the time string to ISO format (replace the first ':' with a ' ')
x_groups[3] = parser.parse(x_groups[3].replace(':', ' ', 1)).isoformat()
cursor.execute(sql, x_groups)
print("\nProcessed lines: {}".format(count))
cursor.close()
db.commit()
# if we want to store the logfile in the converted .csv then uncomment the following:
#
# # csv file for output
# with open('temp.csv', 'w') as csvfile:
# csv_writer = csv.writer(csvfile, quoting=csv.QUOTE_ALL, dialect='excel')
# # write the csv header
# csv_writer.writerow(('ip', 'ezproxy_session', 'patron_username', 'iso_timestamp', 'completed_request', 'http_status', 'num_bytes'))
# for line in data_input:
# x = ezproxy_regex.search(line)
# if (x != None):
# x_groups = list(x.groups())
# # convert the time string to ISO format (replace the first ':' with a ' ')
# x_groups[3] = parser.parse(x_groups[3].replace(':', ' ', 1)).isoformat()
# csv_writer.writerow(x_groups)
# this is just to obfuscate the usernames for display later
from random import sample, randint
import nouns
import adjectives
def gen_nickname():
name = ''.join(sample(adjectives.ADJECTIVES, 1)) + '-' + \
''.join(sample(nouns.ANIMALS, 1)).lower() + \
'-' + str(randint(1,100))
return name
cursor = db.cursor()
sql = """
SELECT
DISTINCT data.patron_username
FROM
data
"""
results = cursor.execute(sql)
sql = """
INSERT INTO patron_nicknames(`patron_username`, `nickname`) VALUES (?, ?)
"""
for result in results:
# insert_values = (result[0], gen_nickname() )
# print(insert_values)
db.execute(sql, (result[0], gen_nickname()))
cursor.close()
db.commit()
To start looking at the data and get a sense of what's going on, the following queries on the table are useful for determining usage during time periods / by user.
sql = """
SELECT
COUNT(*) AS count,
MIN(data.iso_timestamp) AS min_date,
MAX(data.iso_timestamp) AS max_date,
COUNT( DISTINCT data.patron_username) as count_users,
COUNT( DISTINCT data.ezproxy_session ) as count_sessions,
SUM(num_bytes) / 1048576 AS sum_mebibyte
FROM
data
"""
cursor = db.cursor()
cursor.execute(sql)
print(cursor.fetchone())
cursor.close()
# ... or, if we want to get fancy, we can do this to access the columns by name!
# https://docs.python.org/3.7/library/sqlite3.html#accessing-columns-by-name-instead-of-by-index
cursor = db.cursor()
cursor.execute(sql)
cursor.row_factory = sqlite3.Row
result = cursor.fetchone()
print('count rows:\t\t\t{}\n\
min_date:\t\t\t{}\n\
max_date:\t\t\t{}\n\
count distinct users:\t\t{}\n\
count distinct sessions:\t{} ({:.2f} sessions per user)\n\
sum mebibytes transferred:\t{} ({:.2f} Terabytes)'\
.format(result['count'], \
result['min_date'], \
result['max_date'], \
result['count_users'], \
result['count_sessions'],\
result['count_sessions']/result['count_users'],\
result['sum_mebibyte'], result['sum_mebibyte']/953674))
... BTW, a cool trick to print a float and limit to 2 digits: use the Python string format() method:
https://docs.python.org/3.7/library/string.html#formatspec
from math import pi
print(pi)
# 3.141592653589793
print('{:.2f}'.format(pi))
# 3.14
# -- find sum of data downloaded by date, ezproxy_session, and patron_username
# -- order the results by date, and amount downloaded per aggregated group
# -- filter out sessions with fewer than 10 mebibytes total
sql = """
SELECT
date(iso_timestamp) as date,
ezproxy_session,
-- patron_username,
nickname,
(SUM(num_bytes) / 1048576) AS sum_mebibyte
FROM
data
JOIN
patron_nicknames
ON
patron_nicknames.patron_username = data.patron_username
GROUP BY
date,
ezproxy_session,
-- patron_username
nickname
HAVING
sum_mebibyte >= 10
ORDER BY
date,
sum_mebibyte DESC
"""
results = db.execute(sql)
for result in results:
print(result)
# -- target a specific resource name and specific date range per user
# -- (top 10 users by amount of data transferred)
sql = """
SELECT
nickname,
(SUM(data.num_bytes) / 1048576) AS sum_mebibyte
FROM
data
JOIN
patron_nicknames
ON
patron_nicknames.patron_username = data.patron_username
WHERE
data.completed_request LIKE '%proquest.com%'
GROUP BY
nickname
ORDER BY
sum_mebibyte DESC
LIMIT 10
"""
results = db.execute(sql)
for result in results:
print(result)
# we'll store the first in that list as the `target` to use in the query below...
cursor = db.cursor()
cursor.execute(sql)
target = cursor.fetchone()[0]
cursor.close()
# how about we target the username over a 2 day span
sql = """
SELECT
-- strftime ('%Y-%m-%d %H', data.iso_timestamp) AS hour,
strftime ('%d %H', data.iso_timestamp) AS hour,
(SUM(data.num_bytes) / 1048576) AS sum_mebibyte
FROM
data
JOIN
patron_nicknames
ON
patron_nicknames.patron_username = data.patron_username
----
-- CHANGE THIS TO THE USER MATCHING THE TOP RESULT ABOVE
----
WHERE
nickname = ?
AND date(data.iso_timestamp) >= '2019-07-29'
AND date(data.iso_timestamp) <= '2019-07-30'
GROUP BY
hour
ORDER BY
hour
"""
print(target + '\n---')
results = db.execute(sql, [(target)])
for result in results:
print(result)
# let's read the results of this query into a pandas dataframe...
import pandas as pd
import numpy as np
# we can use the SQL query we previously had.
# Note: we set the `params` argument to to pass the paramaters to the SQL query ...
df = pd.read_sql_query(sql, db, params=[(target)])
df.describe
# if we found it useful, we could also do other dataframe like operations on the data
mean = df.loc[:,"sum_mebibyte"].mean()
median = df.loc[:,"sum_mebibyte"].median()
minimum = df.loc[:,"sum_mebibyte"].min()
maximum = df.loc[:,"sum_mebibyte"].max()
print('mean: {}\nmedian: {}\nminimum: {}\nmaximum: {}'.format(mean, median, minimum, maximum))
# Also, let's see what that looks like in a plot!
%matplotlib inline
import pandas as pd
import matplotlib
df = pd.read_sql_query(sql, db, params=[(target)])
df.plot(x='hour', y='sum_mebibyte', kind='line')
Advent of Code 2019, Day 3:
https://github.com/rayvoelker/adventofcode2019/blob/master/day_03_sql.py
Summary: Given two wires defined below laid out on an grid:
Question: What is the Manhattan distance from the central port to the closest intersection?
Part 2: minimize the signal delay (sum of both wires' steps is lowest)
Wire 1: R8,U5,L5,D3
Wire 2: U7,R6,D4,L4
...........
.+-----+...
.|.....|...
.|..+--x-+.
.|..|..|.|.
.|.-X--+.|.
.|..|....|.
.|.......|.
.o-------+.
...........
Answer: 6
# puzzle input:
input_1 = 'R1004,U518,R309,D991,R436,D360,L322,U627,R94,D636,L846,D385,R563,U220,L312,D605,L612,D843,R848,U193,L671,D852,L129,D680,L946,D261,L804,D482,R196,U960,L234,U577,R206,D973,R407,D400,R44,D103,R463,U907,L972,U628,L962,U856,L564,D25,L425,U332,R931,U837,R556,U435,R88,U860,L982,D393,R793,D86,R647,D337,R514,D361,L777,U640,R833,D674,L817,D260,R382,U168,R161,U449,L670,U814,L42,U461,R570,U855,L111,U734,L699,U602,R628,D79,L982,D494,L616,D484,R259,U429,L917,D321,R429,U854,R735,D373,L508,D59,L207,D192,L120,D943,R648,U245,L670,D571,L46,D195,L989,U589,L34,D177,L682,U468,L783,D143,L940,U412,R875,D604,R867,D951,L82,U851,L550,D21,L425,D81,L659,D231,R92,D232,R27,D269,L351,D369,R622,U737,R531,U693,R295,U217,R249,U994,R635,U267,L863,U690,L398,U576,R982,U252,L649,U321,L814,U516,R827,U74,L80,U624,L802,D620,L544,U249,R983,U424,R564,D217,R151,U8,L813,D311,R203,U478,R999,U495,R957,U641,R40,U431,L830,U67,L31,U532,R345,U878,L996,D223,L76,D264,R823,U27,L776,U936,L614,U421,L398,U168,L90,U525,R640,U95,L761,U938,R296,D463,L349,D709,R428,U818,L376,D444,L748,D527,L755,U750,R175,U495,R587,D767,L332,U665,L84,D747,L183,D969,R37,D514,R949,U985,R548,U939,L170,U415,R857,D480,R836,D363,R763,D997,R721,D140,R699,U673,L724,U375,R55,U758,R634,D590,L608,U674,R809,U308,L681,D957,R30,D913,L633,D939,L474,D567,R290,D615,L646,D478,L822,D471,L952,D937,R306,U380,R695,U788,R555,D64,R769,D785,R115,U474,R232,U353,R534,D268,L434,U790,L777,D223,L168,U21,L411,D524,R862,D43,L979,U65,R771,U872,L983,U765,R162'
input_2 = 'L998,U952,R204,U266,R353,U227,L209,D718,L28,D989,R535,U517,L934,D711,R878,U268,L895,D766,L423,U543,L636,D808,L176,U493,R22,D222,R956,U347,R953,U468,R657,D907,R464,U875,L162,U225,L410,U704,R76,D985,L711,U176,R496,D720,L395,U907,R223,D144,R292,D523,R514,D942,R838,U551,L487,D518,L159,D880,R53,D519,L173,D449,R525,U645,L65,D568,R327,U667,R790,U131,R402,U869,R287,D411,R576,D265,R639,D783,R629,U107,L571,D247,L61,D548,L916,D397,R715,U138,R399,D159,L523,U2,R794,U699,R854,U731,L234,D135,L98,U702,L179,D364,R123,D900,L548,U880,R560,D648,L701,D928,R256,D970,L396,U201,L47,U156,R723,D759,R663,D306,L436,U508,R371,D494,L147,U131,R946,D207,L516,U514,R992,D592,L356,D869,L299,U10,R744,D13,L52,U749,R400,D146,L193,U720,L226,U973,R971,U691,R657,D604,L984,U652,L378,D811,L325,D714,R131,D428,R418,U750,L706,D855,L947,U557,L985,D688,L615,D114,R202,D746,R987,U353,R268,U14,R709,U595,R982,U332,R84,D620,L75,D885,L269,D544,L137,U124,R361,U502,L290,D710,L108,D254,R278,U47,R74,U293,R237,U83,L80,U661,R550,U886,L201,D527,L351,U668,R366,D384,L937,D768,L906,D388,L604,U515,R632,D486,L404,D980,L652,U404,L224,U957,L197,D496,R690,U407,L448,U953,R391,U446,L964,U372,R351,D786,L187,D643,L911,D557,R254,D135,L150,U833,R876,U114,R688,D654,L991,U717,R649,U464,R551,U886,L780,U293,L656,U681,L532,U184,L903,D42,L417,D917,L8,U910,L600,D872,L632,D221,R980,U438,R183,D973,L321,D652,L540,D163,R796,U404,L507,D495,R707,U322,R16,U59,L421,D255,L463,U462,L524,D703,L702,D904,L597,D385,L374,U411,L702,U804,R706,D56,L288'
print(
len(input_1.split(',')),
len(input_2.split(','))
)
Visualization thanks to https://github.com/lindseymhensley/Advent_Of_Code_2019/blob/master/Day03.py
import matplotlib.pyplot as plt
line_one_path = input_1.split(",")
line_two_path = input_2.split(",")
class Wire:
def __init__(self):
self.x_points = [0]
self.y_points = [0]
def PlotPoints(self, command: list, origin=False):
y_plane = 0
x_plane = 0
for cm in command:
direction = cm[0]
distance = int(cm[1:])
if direction == "R":
x_plane += distance
elif direction == "U":
y_plane += distance
elif direction == "L":
x_plane += -distance
elif direction == "D":
y_plane += -distance
else:
raise Exception("Bad command given.")
self.y_points.append(y_plane)
self.x_points.append(x_plane)
if origin:
plt.plot(self.x_points, self.y_points, "bo")
else:
plt.plot(self.x_points, self.y_points)
wire = [Wire(), Wire()]
central_port = Wire()
wire[0].PlotPoints(line_one_path) # converts list commands to plotted points
wire[1].PlotPoints(line_two_path) # converts list commands to plotted points
central_port.PlotPoints([], True)
# creates graph window
plt.xlabel('x - axis')
plt.ylabel('y - axis')
plt.title('--- Crossed Wires ---')
plt.rcParams['figure.dpi'] = 300
plt.show()
import sqlite3
import re
# db = sqlite3.connect('data.db')
db = sqlite3.connect("file::memory:?cache=shared")
sql = """
BEGIN TRANSACTION;
DROP TABLE IF EXISTS `cords`;
CREATE TABLE IF NOT EXISTS `cords` (
`id` INTEGER PRIMARY KEY AUTOINCREMENT,
`wire` TEXT,
`x_cord` INTEGER,
`y_cord` INTEGER,
`running_distance` INTEGER
);
DROP INDEX IF EXISTS `idx_cords`;
CREATE INDEX IF NOT EXISTS `idx_cords` ON `cords` (
`wire`,
`x_cord`,
`y_cord`
);
DROP VIEW IF EXISTS `wire_one`;
DROP VIEW IF EXISTS `wire_two`;
CREATE VIEW `wire_one` AS select * from cords WHERE cords.wire = 'wire_one';
CREATE VIEW `wire_two` AS select * from cords WHERE cords.wire = 'wire_two';
COMMIT;
"""
db.executescript(sql)
def fill_wire_cords(input_directions, cord_name=None):
"""
will expand the directions into sets of coordinates in the
database table `cords` -- keeping track of the running distance
of the wire (cord_name)
"""
cursor = db.cursor()
sql = """INSERT INTO cords (wire, x_cord, y_cord, running_distance) VALUES (?,?,?,?)"""
current_x = 0
current_y = 0
running_distance = 0
for value in input_directions.split(','):
# right
if re.match('^R{1}', value):
distance = int(re.split('[A-Z]{1}([0-9]{1,}$)', value)[1])
for i in range(current_x+1, current_x+distance+1, 1):
running_distance += 1
cursor.execute(sql, (cord_name, i, current_y, running_distance))
current_x += distance
# left
if re.match('^L{1}', value):
distance = int(re.split('[A-Z]{1}([0-9]{1,}$)', value)[1])
for i in range(current_x-1, current_x-distance-1, -1):
running_distance += 1
cursor.execute(sql, (cord_name, i, current_y, running_distance))
current_x -= distance
# up
if re.match('^U{1}', value):
distance = int(re.split('[A-Z]{1}([0-9]{1,}$)', value)[1])
for i in range(current_y+1, current_y+distance+1, 1):
running_distance += 1
cursor.execute(sql, (cord_name, current_x, i, running_distance))
current_y += distance
# down
if re.match('^D{1}', value):
distance = int(re.split('[A-Z]{1}([0-9]{1,}$)', value)[1])
for i in range(current_y-1, current_y-distance-1, -1):
running_distance += 1
cursor.execute(sql, (cord_name, current_x, i, running_distance))
current_y -= distance
db.commit()
cursor.close()
# Use the function to insert the wire paths
fill_wire_cords(input_1, 'wire_one')
fill_wire_cords(input_2, 'wire_two')
cursor = db.cursor()
result = cursor.execute("""
SELECT
abs(wire_one.x_cord) + abs(wire_one.y_cord) as mh_dist,
wire_one.running_distance + wire_two.running_distance as total_dist
FROM
wire_one
JOIN
wire_two
ON
wire_two.x_cord = wire_one.x_cord
AND wire_two.y_cord = wire_one.y_cord
ORDER BY
total_dist,
mh_dist
""")
for mh_dist, total_dist in result:
print('total_dist: {}\tmh_distance: {}'.format(total_dist, mh_dist))