I write a lot of Telegram bots using the library python-telegram-bot. Writing Telegram bots is fun, but you will also need someplace to host them.
I personally like the new Google Cloud Run; or run, for short, is perfect because it has a “gorgeous” free quota that should be mostly sufficient to host your bots, also, and is it super simple to deploy and get running.
To create Telegram bots, first, you need to talk to BotFather and get a TOKEN.
Secondly, you need some coding. As I mentioned before, you can use python-telegram-bot to do your bots. Here is the documentation.
Here is the base code that you will need to run on Cloud Run.
main.py
import os
import http
from flask import Flask, request
from werkzeug.wrappers import Response
from telegram import Bot, Update
from telegram.ext import Dispatcher, Filters, MessageHandler, CallbackContext
app = Flask(__name__)
def echo(update: Update, context: CallbackContext) -> None:
update.message.reply_text(update.message.text)
bot = Bot(token=os.environ["TOKEN"])
dispatcher = Dispatcher(bot=bot, update_queue=None, workers=0)
dispatcher.add_handler(MessageHandler(Filters.text & ~Filters.command, echo))
@app.route("/", methods=["POST"])
def index() -> Response:
dispatcher.process_update(
Update.de_json(request.get_json(force=True), bot))
return "", http.HTTPStatus.NO_CONTENT
requirements.txt
flask==1.1.2
gunicorn==20.0.4
python-telegram-bot==13.1
Dockerfile
FROM python:3.8-slim
ENV PYTHONUNBUFFERED True
WORKDIR /app
COPY *.txt .
RUN pip install --no-cache-dir --upgrade pip -r requirements.txt
COPY . ./
CMD exec gunicorn --bind :$PORT --workers 1 --threads 8 --timeout 0 main:app
Finally, you need to deploy. You can do it in a single step, but first, let’s run the command below to set the default region (optionally).
gcloud config set run/region us-central1
Then deploy to Cloud Run:
gcloud beta run deploy your-bot-name \
--source . \
--set-env-vars TOKEN=your-telegram-bot-token \
--platform managed \
--allow-unauthenticated \
--project your-project-name
After this, you will receive a public URL of your run, and you will need to set the Telegram bot webHook
using cURL
curl "https://api.telegram.org/botYOUR-BOT:TOKEN/setWebhook?url=https://your-bot-name-uuid-uc.a.run.app"
You should replace the YOUR-BOT:TOKEN
by the bot’s token and the public URL of your Cloud Run.
This should be enough.
por Rodrigo Delduca (rodrigodelduca@gmail.com) em 08 de January de 2021 às 00:00
Google Cloud Storage is cheaper, and you pay only for what you use than Google One. Also, you can erase any photo, and you still have a copy of that.
Create a Compute Engine (a VM).
If you choose Ubuntu, first of all, remove snap
sudo apt autoremove --purge snapd
sudo rm -rf /var/cache/snapd/
rm -rf ~/snap
Install gcsfuse
or follow the official instructions.
export GCSFUSE_REPO=gcsfuse-`lsb_release -c -s`
echo "deb http://packages.cloud.google.com/apt $GCSFUSE_REPO main" | sudo tee /etc/apt/sources.list.d/gcsfuse.list
curl https://packages.cloud.google.com/apt/doc/apt-key.gpg | sudo apt-key add -
sudo apt-get update
sudo apt-get install gcsfuse
On Google Cloud console create a bucket of the type Nearline
, in my case the name of the bucket is tank1
, then back to your VM and create a dir with the same name of the bucket.
mkdir name-of-your-bucket
Now install gphotos-sync
.
sudo apt install -y python3-pip
pip3 install gphotos-sync
I created a small Python script to deal with multiple Google accounts. I’ll explain later how it works.
cat <<EOF > /home/ubuntu/synchronize.py
#!/usr/bin/env python3
import os
import sys
import subprocess
from pathlib import Path
import requests
home = Path(os.path.expanduser("~")) / "tank1/photos"
args = [
"--ntfs",
"--retry-download",
"--skip-albums",
"--photos-path", ".",
"--log-level", "DEBUG",
]
env = os.environ.copy()
env["LC_ALL"] = "en_US.UTF-8"
for p in home.glob("*/*"):
subprocess.run(["/home/ubuntu/.local/bin/gphotos-sync", *args, str(p.relative_to(home))], check=True, cwd=home, env=env, stdout=sys.stdout, stderr=subprocess.STDOUT)
# I use healthchecks.io to alert me if the script has stopped work
url = "https://hc-ping.com/uuid4"
response = requests.get(url, timeout=60)
response.raise_for_status()
EOF
Give execute permission.
chmod u+x synchronize.py
Now let’s create some systemd scripts.
sudo su
Let’s create a service to gcsfuse, responsible to mount the bucket locally using the FUSE.
cat <<EOF >/etc/systemd/system/gcsfuse.service
# Script stolen from https://gist.github.com/craigafinch/292f98618f8eadc33e9633e6e3b54c05
[Unit]
Description=Google Cloud Storage FUSE mounter
After=local-fs.target network-online.target google.service sys-fs-fuse-connections.mount
Before=shutdown.target
[Service]
Type=forking
User=ubuntu
ExecStart=/bin/gcsfuse tank1 /home/ubuntu/tank1
ExecStop=/bin/fusermount -u /home/ubuntu/tank1
Restart=always
[Install]
WantedBy=multi-user.target
EOF
Enable and start the service:
systemctl enable gcsfuse.service
systemctl start gcsfuse.service
cat <<EOF >/etc/systemd/system/gphotos-sync.service
[Unit]
Description=Run gphotos-sync for each account
[Service]
User=ubuntu
ExecStart=/home/ubuntu/synchronize.py
EOF
And enable the service.
systemctl enable gphotos-sync.service
Now let’s create a timer to run 1 minute after the boot the gphotos-sync.service
with gcsfuse.service
as dependency.
cat <<EOF >/etc/systemd/system/gphotos-sync.timer
[Unit]
Description=Run gphotos sync service weekly
Requires=gcsfuse.service
[Timer]
OnBootSec=1min
Unit=gphotos-sync.service
[Install]
WantedBy=timers.target
EOF
systemctl enable gphotos-sync.timer
systemctl start gphotos-sync.timer
exit
(back to ubuntu user)
Now follow https://docs.google.com/document/d/1ck1679H8ifmZ_4eVbDeD_-jezIcZ-j6MlaNaeQiz7y0/edit to get a client_secret.json
to use with gphotos-sync
.
mkdir -p /home/ubuntu/.config/gphotos-sync/
# Copy the contents of the json to the file bellow
vim /home/ubuntu/.config/gphotos-sync/client_secret.json
Due to an issue with gcsfuse
, I was unable to create the backup dir directly on the bucket. The workaround is to create a temp directory and start the gphotos-sync
manually first.
mkdir -p ~/temp/username/0
cd ~/temp
gphotos-sync --ntfs --skip-albums --photos-path . username/0
# gphotos-sync will ask for a token, paste it and CTRL-C to stop the download of photos.
cp ~/temp/username/ ~/tank1/photos/username
Verify if it is working.
./synchronize.py
After executing the command above, the script should start the backup. You can wait until it finishes or continue to the steps below.
The content below is based on and simplified version of Scheduling compute instances with Cloud Scheduler by Google
Back to your VM and add the label runtime
with the value weekly
, this is needed by the function below to know which instances should be started or shutdown.
Create a new directory, in my case, I will call functions
and add two files:
index.js
const Compute = require('@google-cloud/compute');
const compute = new Compute();
exports.startInstancePubSub = async (event, context, callback) => {
try {
const payload = JSON.parse(Buffer.from(event.data, 'base64').toString());
const options = {filter: `labels.${payload.label}`};
const [vms] = await compute.getVMs(options);
await Promise.all(
vms.map(async instance => {
if (payload.zone === instance.zone.id) {
const [operation] = await compute
.zone(payload.zone)
.vm(instance.name)
.start();
return operation.promise();
}
})
);
const message = 'Successfully started instance(s)';
console.log(message);
callback(null, message);
} catch (err) {
console.log(err);
callback(err);
}
};
exports.stopInstancePubSub = async (event, context, callback) => {
try {
const payload = JSON.parse(Buffer.from(event.data, 'base64').toString());
const options = {filter: `labels.${payload.label}`};
const [vms] = await compute.getVMs(options);
await Promise.all(
vms.map(async instance => {
if (payload.zone === instance.zone.id) {
const [operation] = await compute
.zone(payload.zone)
.vm(instance.name)
.stop();
return operation.promise();
} else {
return Promise.resolve();
}
})
);
const message = 'Successfully stopped instance(s)';
console.log(message);
callback(null, message);
} catch (err) {
console.log(err);
callback(err);
}
};
And
package.json
{
"main": "index.js",
"private": true,
"dependencies": {
"@google-cloud/compute": "^2.4.1"
}
}
Create a PubSub topic to start the instance.
gcloud pubsub topics create start-instance-event
Now deploy the startInstancePubSub
function
gcloud functions deploy startInstancePubSub \
--trigger-topic start-instance-event \
--runtime nodejs12 \
--allow-unauthenticated
And another PubSub topic to stop the instance.
gcloud pubsub topics create stop-instance-event
And the stopInstancePubSub
function
gcloud functions deploy stopInstancePubSub \
--trigger-topic stop-instance-event \
--runtime nodejs12 \
--allow-unauthenticated
And finally, let’s create two Cloud Scheduler to publish on the topics on Sunday and Monday at midnight.
gcloud beta scheduler jobs create pubsub startup-weekly-instances \
--schedule '0 0 * * SUN' \
--topic start-instance-event \
--message-body '{"zone":"us-central1-a", "label":"runtime=weekly"}' \
--time-zone 'America/Sao_Paulo'
gcloud beta scheduler jobs create pubsub shutdown-weekly-instances \
--schedule '0 0 * * MON' \
--topic stop-instance-event \
--message-body '{"zone":"us-central1-a", "label":"runtime=weekly"}' \
--time-zone 'America/Sao_Paulo'
After this setup, your VM will start every Sunday, backup all your photos of all accounts and shutdown on Monday.
por Rodrigo Delduca (rodrigodelduca@gmail.com) em 31 de December de 2020 às 00:00
I was using Scrapy to crawl some websites and mirror their content into a new one and at the same time, generate beautiful and unique URLs based on the title, but the title can appear repeated! So I added part of the original URL in base36 as uniqueness guarantees.
In the URL I wanted the title without special symbols, only ASCII and at the end a unique and short inditifier, and part of the result of the SHA-256 of the URL in base36.
class PreparePipeline():
def process_item(self, item, spider):
title = item.get("title")
if title is None:
raise DropItem(f"No title were found on item: {item}.")
url = item["url"]
N = 4
sha256 = hashlib.sha256(url.encode()).digest()
sliced = int.from_bytes(
memoryview(sha256)[:N].tobytes(), byteorder=sys.byteorder)
uid = base36.dumps(sliced)
strip = str.strip
lower = str.lower
split = str.split
deunicode = lambda n: normalize("NFD", n).encode("ascii", "ignore").decode("utf-8")
trashout = lambda n: re.sub(r"[.,-@/\\|*]", " ", n)
functions = [strip, deunicode, trashout, lower, split]
fragments = [
*functools.reduce(
lambda x, f: f(x), functions, title),
uid,
]
item["uid"] = "-".join(fragments)
return item
For example, with the URL https://en.wikipedia.org/wiki/Déjà_vu
and title Déjà vu - Wikipedia
will result in: deja-vu-wikipedia-1q9i86k
. Which is perfect for my use case.
por Rodrigo Delduca (rodrigodelduca@gmail.com) em 14 de December de 2020 às 00:00
I am starting a new series of small snippets of code which I think that maybe useful or inspiring for others.
Let’s suppose you have a pandas’ dataframe
with a column named URL which one do you want to download.
The code below takes the advantage of the multi-core processing using the ThreadPoolExecutor with requests.
import multiprocessing
import concurrent.futures
from requests import Session
from requests.adapters import HTTPAdapter
from requests.packages.urllib3.util.retry import Retry
session = Session()
retry = Retry(connect=8, backoff_factor=0.5)
adapter = HTTPAdapter(max_retries=retry)
session.mount("http://", adapter)
session.mount("https://", adapter)
def download(url):
filename = "/".join(["subdir", url.split("/")[-1]])
with session.get(url, stream=True) as r:
if not r.ok:
return
with open(filename, "wb") as f:
for chunk in r.iter_content(chunk_size=8192):
f.write(chunk)
def run(df, processes=multiprocessing.cpu_count() * 2):
with concurrent.futures.ThreadPoolExecutor(processes) as pool:
list(pool.map(download, df["url"]))
if __name__ == '__main__':
df = pd.read_csv("download.csv")
run(df)
por Rodrigo Delduca (rodrigodelduca@gmail.com) em 13 de December de 2020 às 00:00
Em algum momento, durante o seu processo de desenvolvimento com Django, pode ser que surja a necessidade de criar e restaurar o banco de dados da aplicação. Pensando nisso, resolvi fazer um pequeno tutorial, básico, de como realizar essa operação.
Nesse tutorial, usaremos o django-dbbackup, um pacote desenvolvido especificamente para isso.
Primeiro, partindo do início, vamos criar uma pasta para o nosso projeto e, nela, isolar o nosso ambiente de desenvolvimento usando uma virtualenv:
mkdir projeto_db && cd projeto_db #criando a pasta do nosso projeto
virtualenv -p python3.8 env && source env/bin/activate #criando e ativando a nossa virtualenv
Depois disso e com o nosso ambiente já ativo, vamos realizar os seguintes procedimentos:
pip install -U pip #com isso, atualizamos a verão do pip instalado
Agora, vamos instalar o Django e o pacote que usaremos para fazer nossos backups.
pip install Django==3.1.2 #instalando o Django
pip install django-dbbackup #instalando o django-dbbackup
Depois de instaladas nossas dependências, vamos criar o nosso projeto e configurar o nosso pacote nas configurações do Django.
django-admin startproject django_db . #dentro da nossa pasta projeto_db, criamos um projeto Django com o nome de django_db.
Depois de criado nosso projeto, vamos criar e popular o nosso banco de dados.
python manage.py migrate #com isso, sincronizamos o estado do banco de dados com o conjunto atual de modelos e migrações.
Criado nosso banco de dados, vamos criar um superusuário para podemos o painel admin do nosso projeto.
python manage.py createsuperuser
Perfeito. Já temos tudo que precisamos para executar nosso projeto. Para execução dele, é só fazermos:
python manage.py runserver
Você terá uma imagem assim do seu projeto:
Dentro do seu projeto, vamos acessar o arquivo settings.py, como expresso abaixo:
django_db/
├── settings.py
Dentro desse arquivos iremos, primeiro, adiconar o django-dbbackup às apps do projeto:
INSTALLED_APPS = (
...
'dbbackup', # adicionando django-dbbackup
)
Depois de adicionado às apps, vamos dizer para o Django o que vamos salvar no backup e, depois, indicar a pasta para onde será encaminhado esse arquivo. Essa inserção deve ou pode ser feita no final do arquivo settings.py:
DBBACKUP_STORAGE = 'django.core.files.storage.FileSystemStorage' #o que salvar
DBBACKUP_STORAGE_OPTIONS = {'location': 'backups/'} # onde salvar
Percebam que dissemos para o Django salvar o backup na pasta backups, mas essa pasta ainda não existe no nosso projeto. Por isso, precisamos criá-la [fora da pasta do projeto]:
mkdir backups
Já temos tudo pronto. Agora, vamos criar o nosso primeiro backup:
python manage.py dbbackup
Depois de exetudado, será criado um arquivo -- no nosso exemplo, esse arquivo terá uma extensão .dump --, salvo na pasta backups. Esse arquivo contem todo backup do nosso banco de dados.
Para recuperarmos nosso banco, vamos supor que migramos nosso sistema de um servidor antigo para um novo e, por algum motivo, nossa base de dados foi corrompida, inviabilizando seu uso. Ou seja, estamos com o sistema/projeto sem banco de dados -- ou seja, exlua ou mova a a sua base dados .sqlite3 para que esse exemplo seja útil --, mas temos os backups. Com isso, vamos restaurar o banco:
python manage.py dbrestore
Prontinho, restauramos nosso banco de dados. O interessante do django-dbbackup, dentre outras coisas, é que ele gera os backups com datas e horários específicos, facilitando o processo de recuperação das informações mais recentes.
Por hoje é isso, pessoal. Até a próxima. ;)
(Ok a piada com seqtembro funciona melhor na versão em inglês, seqtember, mas simbora)
Por uma grande coincidência, obra do destino, ou nada disso, teremos um Setembro de 2020 repleto de eventos virtuais e gratuitos de alta qualidade sobre Qt e KDE.
Começando de 4 à 11 do referido mês teremos o Akademy 2020, o grande encontro mundial da comunidade KDE que esse ano, por motivos que todos sabemos, acontecerá de forma virtual. A programação do Akademy traz palestras, treinamentos, hacking sessions, discussões com foco em aplicações KDE específicas, e mais, reunindo hackers, designers, gerentes de projetos, tradutores, e colabores dos mais diversos segmentos para discutir e planejar o KDE e seus futuros passos.
E como falamos em KDE, por extensão, também falamos em Qt – afinal, grande parte das aplicações é escrita nesse framework. Portanto, mesmo que você trabalhe com Qt mas não use nada do KDE, vale a pena participar do evento – e também se perguntar “porque diabos não estou usando e desenvolvendo aplicações do KDE?”.
Um incentivo extra é que durante o Akademy, entre 7 e 11, acontecerá o Qt Desktop Days, evento da KDAB voltado para Qt no desktop (surpresa?). A programação preliminar já está disponível e será muito interessante ver os avanços da tecnologia em um campo que pode parecer menos sexy hoje em dia, por conta da muita atenção dada a projetos mobile ou embarcados, mas que pelo contrário, continua vibrante e recebendo muito investimento.
Após uma rápida pausa para um respiro, temos a Maratona Qt. Nosso amigo Sandro Andrade, professor do IFBA e colaborador de longa data do KDE, resolveu dedicar uma semana inteira, de 14 à 18 de setembro, para apresentar 5 tópicos sobre o Qt tratando de seus fundamentos e passos iniciais de cada um. O programa cobre QML, C++ e Qt, Qt no Android, no iOS, na web (sim!), computação gráfica e mesmo jogos! Extremamente recomendada pra todo mundo que conhece ou quer conhecer o framework.
A Maratona Qt vai servir como um esquenta para a QtCon Brasil 2020, esse ano também virtual. Em 26 e 27 de setembro o pessoal da qmob.solutions reunirá desenvolvedores Qt de vários países para apresentarem, entre outras coisas, trabalhos com Wayland, visão computacional e IA, análise de dados, Python, containers, prototipagem, embarcados, e outros, tudo envolvendo Qt! E também haverá uma apresentação sobre a próxima versão major da ferramenta, Qt 6.
Portanto pessoal, reservem este mês para uma grande imersão nos vários aspectos e possibilidades disponibilizadas pelo Qt.
Last time I described a way to search MAGs in metagenomes, and teased about interesting results. Let's dig in some of them!
I prepared a repo with the data and a notebook with the analysis I did in this
post.
You can also follow along in Binder,
as well as do your own analysis!
The supplemental materials for Tully et al include more details about each MAG, so let's download them. I prepared a small snakemake workflow to do that, as well as downloading information about the SRA datasets from Tara Oceans (the dataset used to generate the MAGs), as well as from Parks et al, which also generated MAGs from Tara Oceans. Feel free to include them in your analysis, but I was curious to find matches in other metagenomes.
The results from the MAG search are in a CSV file, with a column for the MAG name, another for the SRA dataset ID for the metagenome and a third column for the containment of the MAG in the metagenome. I also fixed the names to make it easier to query, and finally removed the Tara and Parks metagenomes (because we already knew they contained these MAGs).
This left us with 23,644 SRA metagenomes with matches, covering 2,291 of the 2,631 MAGs. These are results for a fairly low containment (10%), so if we limit to MAGs with more than 50% containment we still have 1,407 MAGs and 2,938 metagenomes left.
That's still a lot, so I decided to pick a candidate to check before doing any large scale analysis. I chose TOBG_NP-110 because there were many matches above 50% containment, and even some at 99%. Turns out it is also an Archaeal MAG that failed to be classified further than Phylum level (Euryarchaeota), with a 70.3% complete score in the original analysis. Oh, let me dissect the name a bit: TOBG is "Tara Ocean Binned Genome" and "NP" is North Pacific.
And so I went checking where the other metagenome matches came from. 5 of the 12 matches above 50% containment come from one study, SRP044185, with samples collected from a column of water in a station in Manzanillo, Mexico. Other 3 matches come from SRP003331, in the South Pacific ocean (in northern Chile). Another match, ERR3256923, also comes from the South Pacific.
I'm curious to follow the refining MAGs tutorial from the Meren Lab and see where this goes,
and especially in using spacegraphcats
to extract neighborhoods from the MAG and better evaluate what is missing or if there are other interesting bits that
the MAG generation methods ended up discarding.
So, for now that's it. But more important, I didn't want to sit on these results until there is a publication in press, especially when there are people that can do so much more with these, so I decided to make it all public. It is way more exciting to see this being used to know more about these organisms than me being the only one with access to this info.
And yesterday I saw this tweet by @DrJonathanRosa, saying:
I don’t know who told students that the goal of research is to find some previously undiscovered research topic, claim individual ownership over it, & fiercely protect it from theft, but that almost sounds like, well, colonialism, capitalism, & policing
Amen.
Next time. But we will have a discussion about scientific infrastructure and sustainability first =]
(or: Top-down and bottom-up approaches for working around sourmash limitations)
In the last month I updated wort,
the system I developed for computing sourmash signature for public genomic databases,
and started calculating signatures
for the metagenomes in the Sequence Read Archive.
This is a more challenging subset than the microbial datasets I was doing previously,
since there are around 534k datasets from metagenomic sources in the SRA,
totalling 447 TB of data.
Another problem is the size of the datasets,
ranging from a couple of MB to 170 GB.
Turns out that the workers I have in wort
are very good for small-ish datasets,
but I still need to figure out how to pull large datasets faster from the SRA,
because the large ones take forever to process...
The good news is that I managed to calculate signatures for almost 402k of them 1, which already let us work on some pretty exciting problems =]
Metagenome-assembled genomes are essential for studying organisms that are hard to isolate and culture in lab, especially for environmental metagenomes. Tully et al published 2,631 draft MAGs from 234 samples collected during the Tara Oceans expedition, and I wanted to check if they can also be found in other metagenomes besides the Tara Oceans ones. The idea is to extract the reads from these other matches and evaluate how the MAG can be improved, or at least evaluate what is missing in them. I choose to use environmental samples under the assumption they are easier to deposit on the SRA and have public access, but there are many human gut microbiomes in the SRA and this MAG search would work just fine with those too.
Moreover, I want to search for containment, and not similarity. The distinction is subtle, but similarity takes into account both datasets sizes (well, the size of the union of all elements in both datasets), while containment only considers the size of the query. This is relevant because the similarity of a MAG and a metagenome is going to be very small (and is symmetrical), but the containment of the MAG in the metagenome might be large (and is asymmetrical, since the containment of the metagenome in the MAG is likely very small because the metagenome is so much larger than the MAG).
sourmash signatures are a small fraction of the original size of the datasets,
but when you have hundreds of thousands of them the collection ends up being pretty large too.
More precisely, 825 GB large.
That is way bigger than any index I ever built for sourmash,
and it would also have pretty distinct characteristics than what we usually do:
we tend to index genomes and run search
(to find similar genomes) or gather
(to decompose metagenomes into their constituent genomes),
but for this MAG search I want to find which metagenomes have my MAG query above a certain containment threshold.
Sort of a sourmash search --containment
,
but over thousands of metagenome signatures.
The main benefit of an SBT index in this context is to avoid checking all signatures because we can prune the search early,
but currently SBT indices need to be totally loaded in memory during sourmash index
.
I will have to do this in the medium term,
but I want a solution NOW! =]
sourmash 3.4.0 introduced --from-file
in many commands,
and since I can't build an index I decided to use it to load signatures for the metagenomes.
But... sourmash search
tries to load all signatures in memory,
and while I might be able to find a cluster machine with hundreds of GBs of RAM available,
that's not very practical.
So, what to do?
I don't want to modify sourmash now,
so why not make a workflow and use snakemake to run one sourmash search --containment
for each metagenome?
That means 402k tasks,
but at least I can use batches and SLURM job arrays to submit reasonably-sized jobs to our HPC queue.
After running all batches I summarized results for each task,
and it worked well for a proof of concept.
But... it was still pretty resource intensive: each task was running one query MAG against one metagenome, and so each task needed to do all the overhead of starting the Python interpreter and parsing the query signature, which is exactly the same for all tasks. Extending it to support multiple queries to the same metagenome would involve duplicating tasks, and 402k metagenomes times 2,631 MAGs is... a very large number of jobs.
I also wanted to avoid clogging the job queues, which is not very nice to the other researchers using the cluster. This limited how many batches I could run in parallel...
Thinking a bit more about the problem, here is another solution: what if we load all the MAGs in memory (as they will be queried frequently and are not that large), and then for each metagenome signature load it, perform all MAG queries, and then unload the metagenome signature from memory? This way we can control memory consumption (it's going to be proportional to all the MAG sizes plus the size of the largest metagenome) and can also efficiently parallelize the code because each task/metagenome is independent and the MAG signatures can be shared freely (since they are read-only).
This could be done with the sourmash Python API plus multiprocessing
or some
other parallelization approach (maybe dask?),
but turns out that everything we need comes from the Rust API.
Why not enjoy a bit of the fearless concurrency that is one of the major Rust goals?
The whole code ended up being 176 lines long, including command line parsing using strucopt and parallelizing the search using rayon and a multiple-producer, single-consumer channel to write results to an output (either the terminal or a file). This version took 11 hours to run, using less than 5GB of RAM and 32 processors, to search 2k MAGs against 402k metagenomes. And, bonus! It can also be parallelized again if you have multiple machines, so it potentially takes a bit more than an hour to run if you can allocate 10 batch jobs, with each batch 1/10 of the metagenome signatures.
I would like to answer "Yes!", but bioinformatics software tends to be organized as command line interfaces, not as libraries. Libraries also tend to have even less documentation than CLIs, and this particular case is not a fair comparison because... Well, I wrote most of the library, and the Rust API is not that well documented for general use.
But I'm pretty happy with how the sourmash CLI is viable both for the top-down approach (and whatever workflow software you want to use) as well as how the Rust core worked for the bottom-up approach. I think the most important is having the option to choose which way to go, especially because now I can use the bottom-up approach to make the sourmash CLI and Python API better. The top-down approach is also way more accessible in general, because you can pick your favorite workflow software and use all the tricks you're comfortable with.
Next time. But I did find MAGs with over 90% containment in very different locations, which is pretty exciting!
I also need to find a better way of distributing all these signature, because storing 4 TB of data in S3 is somewhat cheap, but transferring data is very expensive. All signatures are also available on IPFS, but I need more people to host them and share. Get in contact if you're interested in helping =]
And while I'm asking for help, any tips on pulling data faster from the SRA are greatly appreciated!
pulling about a 100 TB in 3 days, which was pretty fun to see because I ended up DDoS myself because I couldn't download the generated sigs fast enough from the S3 bucket where they are temporarily stored =P ↩
sourmash 3.3 was released last week, and it is the first version supporting zipped databases. Here is my personal account of how that came to be =]
A sourmash database contains signatures (typically Scaled MinHash sketches built from genomic datasets) and an index for allowing efficient similarity and containment queries over these signatures. The two types of index are SBT, a hierarchical index that uses less memory by keeping data on disk, and LCA, an inverted index that uses more memory but is potentially faster. Indices are described as JSON files, with LCA storing all the data in one JSON file and SBT opting for saving a description of the index structure in JSON, and all the data into a hidden directory with many files.
We distribute some prepared databases (with SBT indices) for Genbank and RefSeq as compressed TAR files. The compressed file is ~8GB, but after decompressing it turns into almost 200k files in a hidden directory, using about 40 GB of disk space.
The initial issue in this saga is dib-lab/sourmash#490,
and the idea was to take the existing support for multiple data storages
(hidden dir,
TAR files,
IPFS and Redis) and save the index description in the storage,
allowing loading everything from the storage.
Since we already had the databases as TAR files,
the first test tried to use them but it didn't take long to see it was a doomed approach:
TAR files are terrible from random access
(or at least the tarfile
module in Python is).
Zip files showed up as a better alternative,
and it helps that Python has the zipfile
module already available in the
standard library.
Initial tests were promising,
and led to dib-lab/sourmash#648.
The main issue was performance:
compressing and decompressing was slow,
but there was also another limitation...
Another challenge was efficiently loading the data from a storage.
The two core methods in a storage are save(location, content)
,
where content
is a bytes buffer,
and load(location)
,
which returns a bytes buffer that was previously saved.
This didn't interact well with the khmer
Nodegraph
s (the Bloom Filter we use for SBTs),
since khmer
only loads data from files,
not from memory buffers.
We ended up doing a temporary file dance,
which made things slower for the default storage (hidden dir),
where it could have been optimized to work directly with files,
and involved interacting with the filesystem for the other storages
(IPFS and Redis could be pulling data directly from the network,
for example).
This one could be fixed in khmer
by exposing C++ stream methods,
and I did a small PoC to test the idea.
While doable,
this is something that was happening while the sourmash conversion to Rust was underway,
and depending on khmer
was a problem for my Webassembly aspirations...
so,
having the Nodegraph implemented in Rust seemed like a better direction,
That has actually been quietly living in the sourmash codebase for quite some time,
but it was never exposed to the Python (and it was also lacking more extensive
tests).
After the release of sourmash 3 and the replacement of the C++ for the Rust implementation, all the pieces for exposing the Nodegraph where in place, so dib-lab/sourmash#799 was the next step. It wasn't a priority at first because other optimizations (that were released in 3.1 and 3.2) were more important, but then it was time to check how this would perform. And...
Turns out that my Nodegraph loading code was way slower than khmer
.
The Nodegraph binary format is well documented,
and doing an initial implementation wasn't so hard by using the byteorder
crate
to read binary data with the right endianess,
and then setting the appropriate bits in the internal fixedbitset
in memory.
But the khmer code doesn't parse bit by bit:
it reads a long char
buffer directly,
and that is many orders of magnitude faster than setting bit by bit.
And there was no way to replicate this behavior directly with fixedbitset
.
At this point I could either bit-indexing into a large buffer
and lose all the useful methods that fixedbitset
provides,
or try to find a way to support loading the data directly into fixedbitset
and
open a PR.
I chose the PR (and even got #42! =]).
It was more straightforward than I expected,
but it did expose the internal representation of fixedbitset
,
so I was a bit nervous it wasn't going to be merged.
But bluss was super nice,
and his suggestions made the PR way better!
This simplified the final Nodegraph
code,
and actually was more correct
(because I was messing a few corner cases when doing the bit-by-bit parsing before).
Win-win!
Being able to save and load Nodegraph
s in Rust allowed using memory buffers,
but also opened the way to support other operations not supported in khmer Nodegraph
s.
One example is loading/saving compressed files,
which is supported for Countgraph
(another khmer data structure,
based on Count-Min Sketch)
but not in Nodegraph
.
If only there was an easy way to support working with compressed files...
Oh wait, there is! niffler is a crate that I made with Pierre Marijon based
on some functionality I saw in one of his projects,
and we iterated a bit on the API and documented everything to make it more
useful for a larger audience.
niffler
tries to be as transparent as possible,
with very little boilerplate when using it but with useful features nonetheless
(like auto detection of the compression format).
If you want more about the motivation and how it happened,
check this Twitter thread.
The cool thing is that adding compressed files support in sourmash
was mostly
one-line changes for loading
(and a bit more for saving,
but mostly because converting compression levels could use some refactoring).
With all these other pieces in places,
it's time to go back to dib-lab/sourmash#648.
Compressing and decompressing with the Python zipfile
module is slow,
but Zip files can also be used just for storage,
handing back the data without extracting it.
And since we have compression/decompression implemented in Rust with niffler
,
that's what the zipped sourmash databases are:
data is loaded and saved into the Zip file without using the Python module
compression/decompression,
and all the work is done before (or after) in the Rust side.
This allows keeping the Zip file with similar sizes to the original TAR files we started with, but with very low overhead for decompression. For compression we opted for using Gzip level 1, which doesn't compress perfectly but also doesn't take much longer to run:
Level | Size | Time |
---|---|---|
0 | 407 MB | 16s |
1 | 252 MB | 21s |
5 | 250 MB | 39s |
9 | 246 MB | 1m48s |
In this table, 0
is without compression,
while 9
is the best compression.
The size difference from 1
to 9
is only 6 MB (~2% difference)
but runs 5x faster,
and it's only 30% slower than saving the uncompressed data.
The last challenge was updating an existing Zip file.
It's easy to support appending new data,
but if any of the already existing data in the file changes
(which happens when internal nodes change in the SBT,
after a new dataset is inserted) then there is no easy way to replace the data in the Zip file.
Worse,
the Python zipfile
will add the new data while keeping the old one around,
leading to ginormous files over time1
So, what to do?
I ended up opting for dealing with the complexity and complicating the ZipStorage implementation a bit, by keeping a buffer for new data. If it's a new file or it already exists but there are no insertions the buffer is ignored and all works as before.
If the file exists and new data is inserted, then it is first stored in the buffer (where it might also replace a previous entry with the same name). In this case we also need to check the buffer when trying to load some data (because it might exist only in the buffer, and not in the original file).
Finally,
when the ZipStorage
is closed it needs to verify if there are new items in the buffer.
If not,
it is safe just to close the original file.
If there are new items but they were not present in the original file,
then we can append the new data to the original file.
The final case is if there are new items that were also in the original file,
and in this case a new Zip file is created and all the content from buffer and
original file are copied to it,
prioritizing items from the buffer.
The original file is replaced by the new Zip file.
Turns out this worked quite well! And so the PR was merged =]
Zipped databases open the possibility of distributing extra data that might be useful for some kinds of analysis. One thing we are already considering is adding taxonomy information, let's see what else shows up.
Having Nodegraph
in Rust is also pretty exciting,
because now we can change the internal representation for something that uses
less memory (maybe using RRR encoding?),
but more importantly:
now they can also be used with Webassembly,
which opens many possibilities for running not only signature computation but
also search
and gather
in the browser,
since now we have all the pieces to build it.
The zipfile
module does throw a UserWarning
pointing that duplicated files were inserted,
which is useful during development but generally doesn't show during regular usage... ↩
Em novembro passado, colaboradores latinoamericanos do KDE desembarcaram em Salvador/Brasil para participarem de mais uma edição do LaKademy – o Latin American Akademy. Aquela foi a sétima edição do evento (ou oitava, se você contar o Akademy-BR como o primeiro LaKademy) e a segunda com Salvador como a cidade que hospedou o evento. Sem problemas para mim: na verdade, adoraria me mudar e viver ao menos alguns anos em Salvador, cidade que gosto baste.
Foto em grupo do LaKademy 2019
Minhas principais tarefas durante o evento foram em 2 projetos: Cantor e Sprat, um “editor de rascunhos de artigos acadêmicos”. Além deles, ajudei também com tarefas de promoção como o site do LaKademy.
Nos trabalhos sobre o Cantor me foquei naqueles relacionados com organização. Por exemplo, pedi aos sysadmins que migrassem o repositório para o Gitlab do KDE e crei um site específico para o Cantor em cantor.kde.org usando o novo template em Jekyll para projetos do KDE.
O novo site é uma boa adição ao Cantor porque nós queremos comunicar melhor e mais diretamente com nossa comunidade de usuários. O site tem um blog próprio e uma seção de changelog para tornar mais fácil à comunidde seguir as notícias e principais mudanças no software.
A migração para o Gitlab nos permite utilizar o Gitlab CI como uma alternativa para integração contínua no Cantor. Eu orientei o trabalho do Rafael Gomes (que ainda não teve merge) para termos isso disponível pro projeto.
Além dos trabalhos no Cantor, desenvolvi algumas atividades relacionadas ao Sprat, um editor de rascunhos de artigos científicos em inglês. Este softwar usa katepart para implementar e metodologia de escrita de artigos científicos em inglês conhecida como PROMETHEUS, conforme descrita neste livro, como uma tentativa de auxiliar estudantes e pesquisadores em geral na tarefa de escrever artigos científicos. Durante o LaKademy finalizei o port para Qt5 e, tomara, espero lançar o projeto este ano.
Nas atividades mais sociais, participei da famosíssima reunião de promo, que discute as ações futuras do KDE para a América Latina. Nossa principal decisão foi organizar e participar mais de eventos pequenos e distribuídos em várias cidades, marcando a presença do KDE em eventos consolidados como o FLISoL e o Software Freedom Day, e mais – mas agora, em tempos de COVID-19, isso não é mais viável. Outra decisão foi mover a organização do KDE Brasil do Phabricator para o Gitlab.
Contribuidores do KDE trabalhando pesado
Para além da parte técnica, este LaKademy foi uma oportunidade para encontrar velhos e novos amigos, beber algumas cervejas, saborear a maravilhosa cozinha bahiana, e se divertir entre um commit e outro.
Gostaria de agradecer ao KDE e.V. por apoiar o LaKademy, e Caio e Icaro por terem organizado essa edição do evento. Não vejo a hora de participar do próximo LaKademy e que isso seja o mais rápido possível!
Em setembro de 2019 a cidade italiana de Milão sediou o principal encontro mundial dos colaboradores do KDE – o Akademy, onde membros de diferentes áreas como tradutores, desenvolvedores, artistas, pessoal de promo e mais se reúnem por alguns dias para pensar e construir o futuro dos projetos e comunidade(s) do KDE
Antes de chegar ao Akademy tomei um voo do Brasil para Portugal a fim de participar do EPIA, uma conferência sobre inteligência artificial que aconteceu na pequena cidade de Vila Real, região do Porto. Após essa atividade acadêmica, voei do Porto à Milão e iniciei minha participação no Akademy 2019.
Infelizmente pousei no final da primeira manhã do evento, o que me fez perder apresentações interessantes sobre Qt 6 e os novos KDE’ Goals. Pela tarde pude algumas palestras sobre temas que também me chamam atenção, como Plasma para dispositivos móveis, MyCroft para a indústria automotiva, relatório do KDE e.V. e um showcase de estudantes do Google Summer of Code e do Season of KDE – muito legal ver projetos incríveis desenvolvidos pelos novatos.
No segundo dia me chamou atenção as palestras sobre o KPublicTransportation, LibreOffice para o Plasma, Get Hot New Stuffs – que imagino utilizarei em um futuro projeto – e a apresentação do Caio sobre o kpmcore.
Após a festa do evento (comentários sobre ela apenas pessoalmente), os dias seguintes foram preenchidos pelos BoFs, para mim a parte mais interessante do Akademy.
O workshop do Gitlab foi interessante porque pudemos discutir temas específicos sobre a migração do KDE para esta ferramenta. Estou adorando esse movimento e espero que todos os projetos do KDE façam essa migração o quanto antes. Cantor já está por lá há algum tempo.
No BoF do KDE websites, pude entender um pouco melhor sobre o novo tema do Jekyll utilizado pelos nossos sites. Em adição, aguardo que logo mais possamos aplicar internacionalização nessas páginas, tornando-as traduzíveis para quaisquer idiomas. Após participar e tomar informações nesse evento, criei um novo website pro Cantor durante o LaKademy 2019.
O BoF do KDE Craft foi interessante para ver como compilar e distribuir nosso software na loja do Windows (pois é, vejam só o que estou escrevendo…). Espero trabalhar com esse tema durante o ano de forma a disponibilizar um pacote do Cantor naquela loja ((pois é)²).
Também participei do workshop sobre QML e Kirigami realizado pelo pessoal do projeto Maui. Kirigami é algo que tenho mantido o olho para futuros projetos.
Finalmente, participei do BoF “All About the Apps Kick Off”. Pessoalmente, penso que esse é o futuro do KDE: uma comunidade internacional que produz software livre de alta qualidade e seguro para diferentes plataformas, do desktop ao mobile. De fato, isto é como o KDE está atualmente organizado e funcionando, mas nós não conseguimos comunicar isso muito bem para o público. Talvez, com as mudanças em nossa forma de lançamentos, somados aos websites para projetos específicos e distribuição em diferentes lojas de aplicativos, possamos mudar a maneira como o público vê a nossa comunidade.
O day trip do Akademy 2019 foi no lago Como, na cidade de Varenna. Uma viagem linda, passei o tempo todo imaginando que lá poderia ser um bom lugar para eu passar uma lua de mel :D. Espero voltar lá no futuro próximo, e passar alguns dias viajando entre cidades como ela.
Eu em Varenna
Gostaria de agradecer a todo o time local, Riccardo e seus amigos, por organizarem essa edição incrível do Akademy. Milão é uma cidade muito bonita, com comida deliciosa (carbonara!), lugares históricos para visitar e descobrir mais sobre os italianos e a sofisticada capital da alta moda.
Finalmente, meus agradecimentos ao KDE e.V. por patrocinar minha participação no Akademy.
Neste link estão disponíveis vídeos das apresentações e BoFs do Akademy 2019.
Fala pessoal, tudo bom?
Nos vídeo abaixo vou mostrar como podemos configurar um CI de uma aplicação Django usando Github Actions.
sourmash 3 was released last week, finally landing the Rust backend. But, what changes when developing new features in sourmash? I was thinking about how to best document this process, and since PR #826 is a short example touching all the layers I decided to do a small walkthrough.
Shall we?
The first step is describing the problem, and trying to convince reviewers (and yourself) that the changes bring enough benefits to justify a merge. This is the description I put in the PR:
Calling
.add_hash()
on a MinHash sketch is fine, but if you're calling it all the time it's better to pass a list of hashes and call.add_many()
instead. Before this PRadd_many
just calledadd_hash
for each hash it was passed, but now it will pass the full list to Rust (and that's way faster).No changes for public APIs, and I changed the
_signatures
method in LCA to accumulate hashes for each sig first, and then set them all at once. This is way faster, but might use more intermediate memory (I'll evaluate this now).
There are many details that sound like jargon for someone not familiar with the codebase, but if I write something too long I'll probably be wasting the reviewers time too. The benefit of a very detailed description is extending the knowledge for other people (not necessarily the maintainers), but that also takes effort that might be better allocated to solve other problems. Or, more realistically, putting out other fires =P
Nonetheless, some points I like to add in PR descriptions: - why is there a problem with the current approach? - is this the minimal viable change, or is it trying to change too many things at once? The former is way better, in general. - what are the trade-offs? This PR is using more memory to lower the runtime, but I hadn't measure it yet when I opened it. - Not changing public APIs is always good to convince reviewers. If the project follows a semantic versioning scheme, changes to the public APIs are major version bumps, and that can brings other consequences for users.
If this was a bug fix PR,
the first thing I would do is write a new test triggering the bug,
and then proceed to fix it in the code
(Hmm, maybe that would be another good walkthrough?).
But this PR is making performance claims ("it's going to be faster"),
and that's a bit hard to codify in tests.
1
Since it's also proposing to change a method (_signatures
in LCA indices) that is better to benchmark with a real index (and not a toy example),
I used the same data and command I run in sourmash_resources to check how memory consumption and runtime changed.
For reference, this is the command:
sourmash search -o out.csv --scaled 2000 -k 51 HSMA33OT.fastq.gz.sig genbank-k51.lca.json.gz
I'm using the benchmark
feature from snakemake in sourmash_resources to
track how much memory, runtime and I/O is used for each command (and version) of sourmash,
and generate the plots in the README in that repo.
That is fine for a high-level view ("what's the maximum memory used?"),
but not so useful for digging into details ("what method is consuming most memory?").
Another additional problem is the dual2 language nature of sourmash, where we have Python calling into Rust code (via CFFI). There are great tools for measuring and profiling Python code, but they tend to not work with extension code...
So, let's bring two of my favorite tools to help!
heaptrack is a heap profiler, and I first heard about it from Vincent Prouillet.
Its main advantage over other solutions (like valgrind's massif) is the low
overhead and... how easy it is to use:
just stick heaptrack
in front of your command,
and you're good to go!
Example output:
$ heaptrack sourmash search -o out.csv --scaled 2000 -k 51 HSMA33OT.fastq.gz.sig genbank-k51.lca.json.gz
heaptrack stats:
allocations: 1379353
leaked allocations: 1660
temporary allocations: 168984
Heaptrack finished! Now run the following to investigate the data:
heaptrack --analyze heaptrack.sourmash.66565.gz
heaptrack --analyze
is a very nice graphical interface for analyzing the results,
but for this PR I'm mostly focusing on the Summary page (and overall memory consumption).
Tracking allocations in Python doesn't give many details,
because it shows the CPython functions being called,
but the ability to track into the extension code (Rust) allocations is amazing
for finding bottlenecks (and memory leaks =P).
3
Just as other solutions exist for profiling memory,
there are many for profiling CPU usage in Python,
including profile
and cProfile
in the standard library.
Again, the issue is being able to analyze extension code,
and bringing the cannon (the perf
command in Linux, for example) looses the
benefit of tracking Python code properly (because we get back the CPython
functions, not what you defined in your Python code).
Enters py-spy by Ben Frederickson, based on the rbspy project by Julia Evans. Both use a great idea: read the process maps for the interpreters and resolve the full stack trace information, with low overhead (because it uses sampling). py-spy also goes further and resolves native Python extensions stack traces, meaning we can get the complete picture all the way from the Python CLI to the Rust core library!4
py-spy
is also easy to use:
stick py-spy record --output search.svg -n --
in front of the command,
and it will generate a flamegraph in search.svg
.
The full command for this PR is
py-spy record --output search.svg -n -- sourmash search -o out.csv --scaled 2000 -k 51 HSMA.fastq.sig genbank-k51.lca.json.gz
OK, OK, sheesh. But it's worth repeating: the code is important, but there are many other aspects that are just as important =]
add_hash
calls with one add_many
Let's start at the _signatures()
method on LCA indices.
This is the original method:
@cached_property
def _signatures(self):
"Create a _signatures member dictionary that contains {idx: minhash}."
from .. import MinHash
minhash = MinHash(n=0, ksize=self.ksize, scaled=self.scaled)
debug('creating signatures for LCA DB...')
sigd = defaultdict(minhash.copy_and_clear)
for (k, v) in self.hashval_to_idx.items():
for vv in v:
sigd[vv].add_hash(k)
debug('=> {} signatures!', len(sigd))
return sigd
sigd[vv].add_hash(k)
is the culprit.
Each call to .add_hash
has to go thru CFFI to reach the extension code,
and the overhead is significant.
It is a similar situation to accessing array elements in NumPy:
it works,
but it is way slower than using operations that avoid crossing from Python to
the extension code.
What we want to do instead is call .add_many(hashes)
,
which takes a list of hashes and process it entirely in Rust
(ideally. We will get there).
But, to have a list of hashes, there is another issue with this code.
for (k, v) in self.hashval_to_idx.items():
for vv in v:
sigd[vv].add_hash(k)
There are two nested for loops,
and add_hash
is being called with values from the inner loop.
So... we don't have the list of hashes beforehand.
But we can change the code a bit to save the hashes for each signature
in a temporary list,
and then call add_many
on the temporary list.
Like this:
temp_vals = defaultdict(list)
for (k, v) in self.hashval_to_idx.items():
for vv in v:
temp_vals[vv].append(k)
for sig, vals in temp_vals.items():
sigd[sig].add_many(vals)
There is a trade-off here:
if we save the hashes in temporary lists,
will the memory consumption be so high that the runtime gains of calling
add_many
in these temporary lists be cancelled?
Time to measure it =]
version | mem | time |
---|---|---|
original | 1.5 GB | 160s |
list |
1.7GB | 173s |
Wait, it got worse?!?! Building temporary lists only takes time and memory, and bring no benefits!
This mystery goes away when you look at the add_many method:
def add_many(self, hashes):
"Add many hashes in at once."
if isinstance(hashes, MinHash):
self._methodcall(lib.kmerminhash_add_from, hashes._objptr)
else:
for hash in hashes:
self._methodcall(lib.kmerminhash_add_hash, hash)
The first check in the if
statement is a shortcut for adding hashes from
another MinHash
, so let's focus on else
part...
And turns out that add_many
is lying!
It doesn't process the hashes
in the Rust extension,
but just loops and call add_hash
for each hash
in the list.
That's not going to be any faster than what we were doing in _signatures
.
Time to fix add_many
!
add_many
The idea is to change this loop in add_many
:
for hash in hashes:
self._methodcall(lib.kmerminhash_add_hash, hash)
with a call to a Rust extension function:
self._methodcall(lib.kmerminhash_add_many, list(hashes), len(hashes))
self._methodcall
is a convenience method defined in RustObject
which translates a method-like call into a function call,
since our C layer only has functions.
This is the C prototype for this function:
void kmerminhash_add_many(
KmerMinHash *ptr,
const uint64_t *hashes_ptr,
uintptr_t insize
);
You can almost read it as a Python method declaration,
where KmerMinHash *ptr
means the same as the self
in Python methods.
The other two arguments are a common idiom when passing pointers to data in C,
with insize
being how many elements we have in the list.
5.
CFFI
is very good at converting Python lists into pointers of a specific type,
as long as the type is of a primitive type
(uint64_t
in our case, since each hash is a 64-bit unsigned integer number).
And the Rust code with the implementation of the function:
ffi_fn! {
unsafe fn kmerminhash_add_many(
ptr: *mut KmerMinHash,
hashes_ptr: *const u64,
insize: usize,
) -> Result<()> {
let mh = {
assert!(!ptr.is_null());
&mut *ptr
};
let hashes = {
assert!(!hashes_ptr.is_null());
slice::from_raw_parts(hashes_ptr as *mut u64, insize)
};
for hash in hashes {
mh.add_hash(*hash);
}
Ok(())
}
}
Let's break what's happening here into smaller pieces. Starting with the function signature:
ffi_fn! {
unsafe fn kmerminhash_add_many(
ptr: *mut KmerMinHash,
hashes_ptr: *const u64,
insize: usize,
) -> Result<()>
The weird ffi_fn! {}
syntax around the function is a macro in Rust:
it changes the final generated code to convert the return value (Result<()>
) into something that is valid C code (in this case, void
).
What happens if there is an error, then?
The Rust extension has code for passing back an error code and message to Python,
as well as capturing panics (when things go horrible bad and the program can't recover)
in a way that Python can then deal with (raising exceptions and cleaning up).
It also sets the #[no_mangle]
attribute in the function,
meaning that the final name of the function will follow C semantics (instead of Rust semantics),
and can be called more easily from C and other languages.
This ffi_fn!
macro comes from symbolic,
a big influence on the design of the Python/Rust bridge in sourmash.
unsafe
is the keyword in Rust to disable some checks in the code to allow
potentially dangerous things (like dereferencing a pointer),
and it is required to interact with C code.
unsafe
doesn't mean that the code is always unsafe to use:
it's up to whoever is calling this to verify that valid data is being passed and invariants are being preserved.
If we remove the ffi_fn!
macro and the unsafe
keyword,
we have
fn kmerminhash_add_many(
ptr: *mut KmerMinHash,
hashes_ptr: *const u64,
insize: usize
);
At this point we can pretty much map between Rust and the C function prototype:
void kmerminhash_add_many(
KmerMinHash *ptr,
const uint64_t *hashes_ptr,
uintptr_t insize
);
Some interesting points:
fn
to declare a function in Rust.-> ()
, equivalent to a void
return type in C).KmerMinHash
item: *mut KmerMinHash
).
In C everything is mutable by default.u64
in Rust -> uint64_t
in Cusize
in Rust -> uintptr_t
in CLet's check the implementation of the function now.
We start by converting the ptr
argument (a raw pointer to a KmerMinHash
struct)
into a regular Rust struct:
let mh = {
assert!(!ptr.is_null());
&mut *ptr
};
This block is asserting that ptr
is not a null pointer,
and if so it dereferences it and store in a mutable reference.
If it was a null pointer the assert!
would panic (which might sound extreme,
but is way better than continue running because dereferencing a null pointer is
BAD).
Note that functions always need all the types in arguments and return values,
but for variables in the body of the function
Rust can figure out types most of the time,
so no need to specify them.
The next block prepares our list of hashes for use:
let hashes = {
assert!(!hashes_ptr.is_null());
slice::from_raw_parts(hashes_ptr as *mut u64, insize)
};
We are again asserting that the hashes_ptr
is not a null pointer,
but instead of dereferencing the pointer like before we use it to create a slice
,
a dynamically-sized view into a contiguous sequence.
The list we got from Python is a contiguous sequence of size insize
,
and the slice::from_raw_parts
function creates a slice from a pointer to data and a size.
Oh, and can you spot the bug?
I created the slice using *mut u64
,
but the data is declared as *const u64
.
Because we are in an unsafe
block Rust let me change the mutability,
but I shouldn't be doing that,
since we don't need to mutate the slice.
Oops.
Finally, let's add hashes to our MinHash!
We need a for
loop, and call add_hash
for each hash
:
for hash in hashes {
mh.add_hash(*hash);
}
Ok(())
We finish the function with Ok(())
to indicate no errors occurred.
Why is calling add_hash
here faster than what we were doing before in Python?
Rust can optimize these calls and generate very efficient native code,
while Python is an interpreted language and most of the time don't have the same
guarantees that Rust can leverage to generate the code.
And, again,
calling add_hash
here doesn't need to cross FFI boundaries or,
in fact,
do any dynamic evaluation during runtime,
because it is all statically analyzed during compilation.
And... that's the PR code. There are some other unrelated changes that should have been in new PRs, but since they were so small it would be more work than necessary. OK, that's a lame excuse: it's confusing for reviewers to see these changes here, so avoid doing that if possible!
But, did it work?
version | mem | time |
---|---|---|
original | 1.5 GB | 160s |
list |
1.7GB | 73s |
We are using 200 MB of extra memory, but taking less than half the time it was taking before. I think this is a good trade-off, and so did the reviewer and the PR was approved.
Hopefully this was useful, 'til next time!
list
or set
?The first version of the PR used a set
instead of a list
to accumulate hashes.
Since a set
doesn't have repeated elements,
this could potentially use less memory.
The code:
temp_vals = defaultdict(set)
for (k, v) in self.hashval_to_idx.items():
for vv in v:
temp_vals[vv].add(k)
for sig, vals in temp_vals.items():
sigd[sig].add_many(vals)
The runtime was again half of the original, but...
version | mem | time |
---|---|---|
original | 1.5 GB | 160s |
set |
3.8GB | 80s |
list |
1.7GB | 73s |
... memory consumption was almost 2.5 times the original! WAT
The culprit this time? The new kmerminhash_add_many
call in the add_many
method.
This one:
self._methodcall(lib.kmerminhash_add_many, list(hashes), len(hashes))
CFFI
doesn't know how to convert a set
into something that C understands,
so we need to call list(hashes)
to convert it into a list.
Since Python (and CFFI
) can't know if the data is going to be used later
6
it needs to keep it around
(and be eventually deallocated by the garbage collector).
And that's how we get at least double the memory being allocated...
There is another lesson here.
If we look at the for
loop again:
for (k, v) in self.hashval_to_idx.items():
for vv in v:
temp_vals[vv].add(k)
each k
is already unique because they are keys in the hashval_to_idx
dictionary,
so the initial assumption
(that a set
might save memory because it doesn't have repeated elements)
is... irrelevant for the problem =]
We do have https://asv.readthedocs.io/ set up for micro-benchmarks,
and now that I think about it...
I could have started by writing a benchmark for add_many
,
and then showing that it is faster.
I will add this approach to the sourmash PR checklist =] ↩
or triple, if you count C ↩
It would be super cool to have the unwinding code from py-spy in heaptrack, and be able to see exactly what Python methods/lines of code were calling the Rust parts... ↩
Even if py-spy doesn't talk explicitly about Rust, it works very very well, woohoo! ↩
Let's not talk about lack of array bounds checks in C... ↩
something that the memory ownership model in Rust does, BTW ↩
por fsouza (noreply@blogger.com) em 10 de December de 2019 às 03:42
/opt/nginx
, and use it with the nginx system user. So, let’s download and extract the latest stable version (1.0.9) from nginx website: Once you have extracted it, just configure, compile and install:
% curl -O http://nginx.org/download/nginx-1.0.9.tar.gz
% tar -xzf nginx-1.0.9.tar.gz
As you can see, we provided the
% ./configure --prefix=/opt/nginx --user=nginx --group=nginx
% make
% [sudo] make install
/opt/nginx
to configure, make sure the /opt
directory exists. Also, make sure that there is a user and a group called nginx, if they don’t exist, add them: % [sudo] adduser --system --no-create-home --disabled-login --disabled-password --group nginxAfter that, you can start nginx using the command line below:
% [sudo] /opt/nginx/sbin/nginx
Linode provides an init script that uses start-stop-daemon, you might want to use it.
nginx.conf
file, let’s change it to reflect the following configuration requirements: nginx
user/opt/nginx/log/nginx.pid
file/opt/nginx/logs/access.log
nginx.conf
file (assume that the library project is in the directory /opt/projects
).nginx.conf
for the requirements above: Now we just need to write the configuration for our Django project. I’m using an old sample project written while I was working at Giran: the name is lojas giranianas, a nonsense portuguese joke with a famous brazilian store. It’s an unfinished showcase of products, it’s like an e-commerce project, but it can’t sell, so it’s just a product catalog. The code is available at Github. The
user nginx;
worker_processes 2;
pid logs/nginx.pid;
events {
worker_connections 1024;
}
http {
include mime.types;
default_type application/octet-stream;
log_format main '$remote_addr - $remote_user [$time_local] "$request" '
'$status $body_bytes_sent "$http_referer" '
'"$http_user_agent" "$http_x_forwarded_for"';
access_log logs/access.log main;
sendfile on;
keepalive_timeout 65;
include /opt/projects/showcase/nginx.conf;
}
nginx.conf
file for the repository is here: The server listens on port
server {
listen 80;
server_name localhost;
charset utf-8;
location / {
proxy_set_header X-Real-IP $remote_addr;
proxy_set_header Host $http_host;
proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
proxy_pass http://localhost:8000;
}
location /static {
root /opt/projects/showcase/;
expires 1d;
}
}
80
, responds for the localhost
hostname (read more about the Host header). The location /static
directive says that nginx will serve the static files of the project. It also includes an expires
directive for caching control. The location /
directive makes a proxy_pass
, forwarding all requisitions to an upstream server listening on port 8000, this server is the subject of the next post of the series: the Green Unicorn (gunicorn) server. Host
header is forwarded so gunicorn can treat different requests for different hosts. Without this header, it will be impossible to Gunicorn to have these constraintsproxy_cache
directive and integrating Django, nginx and memcached). por fsouza (noreply@blogger.com) em 10 de December de 2019 às 03:42
por fsouza (noreply@blogger.com) em 10 de December de 2019 às 03:42
As you can see it’s very simple. If you’re familiar with RST syntax, you can guess what landslide does: it converts the entire content to HTML and then split it by
Python
======
--------------
If
==
* Please don't use ()
* Never forget the ``:`` at the end of the line
Check this code:
.. sourcecode:: python
x, y = 1, 2
if x > y:
print 'x is greater'
--------------
For
===
* ``for`` iterates over a sequence
* Never forget the ``:`` at the end of the line
Check this code:
.. sourcecode:: python
numbers = [1, 2, 3, 4, 5,]
for number in numbers:
print number
--------------
While
=====
* ``while`` is like ``if``, but executes while the codition is ``True``
* please don't use ()
* never forget the ``:`` at the end of the line
Check this code:
.. sourcecode:: python
from random import randint
args = (1, 10,)
x = randint(*args)
while x != 6:
x = randint(*args)
--------------
Thank you!
==========
<hr />
tag. Each slide will contain two sections: a header and a body. The header contains only an <h1></h1>
element and the body contains everything. % landslide python.rstTo use
landslide
command, you need to install it. I suggest you do this via pip: % [sudo] pip install landslidelandslide supports theming, so you can customize it by creating your own theme. Your theme should contain two CSS files: screen.css (for the HTML version of slides) and print.css (for the PDF version of the slides). You might also customize the HTML (base.html) and JS files (slides.js), but you have to customize the CSS files in your theme. You specify the theme using the
--theme
directive. You might want to check all options available in the command line utility using --help
: % landslide --helpIt’s quite easy to extend landslide changing its theme or adding new macros. Check the official repository at Github. This example, and a markdown version for the same example are available in a repository in my github profile.
por fsouza (noreply@blogger.com) em 10 de December de 2019 às 03:42
por fsouza (noreply@blogger.com) em 10 de December de 2019 às 03:42
You can create the files jquery.zebrafy.js and ZebrafySpec.js, but remember: it is BDD, we need to describe the behavior first, then write the code. So let’s start writing the specs in ZebrafySpec.js file using Jasmine. If you are familiar with RSpec syntax, it’s easy to understand how to write spec withs Jasmine, if you aren’t, here is the clue: Jasmine is a lib with some functions used for writing tests in an easier way. I’m going to explain each function “on demmand”, when we need something, we learn how to use it! ;)
.
├── SpecRunner.html
├── lib
│ ├── jasmine-1.0.2
│ │ ├── MIT.LICENSE
│ │ ├── jasmine-html.js
│ │ ├── jasmine.css
│ │ └── jasmine.js
│ └── jquery-1.6.1.min.js
├── spec
│ └── ZebrafySpec.js
└── src
└── jquery.zebrafy.js
describe
function for that, this function receives a string and another function (a callback). The string describes the test suite and the function is a callback that delimites the scope of the test suite. Here is the Zebrafy
suite: Let’s start describing the behavior we want to get from the plugin. The most basic is: we want different CSS classes for odd an even lines in a table. Jasmine provides the
describe('Zebrafy', function () {
});
it
function for writing the tests. It also receives a string and a callback: the string is a description for the test and the callback is the function executed as test. Here is the very first test: Okay, here we go: in the first line of the callback, we are using jQuery to select a table using the
it('should apply classes zebrafy-odd and zebrafy-even to each other table lines', function () {
var table = $("#zebra-table");
table.zebrafy();
expect(table).toBeZebrafyied();
});
#zebra-table
selector, which will look up for a table with the ID attribute equals to “zebra-table”, but we don’t have this table in the DOM. What about add a new table to the DOM in a hook executed before the test run and remove the table in another hook that runs after the test? Jasmine provide two functions: beforeEach
and afterEach
. Both functions receive a callback function to be executed and, as the names suggest, the beforeEach
callback is called before each test run, and the afterEach
callback is called after the test run. Here are the hooks: The
beforeEach(function () {
$('<table id="zebra-table"></table>').appendTo('body');
for (var i=0; i < 10; i++) {
$('<tr></tr>').append('<td></td>').append('<td></td>').append('<td></td>').appendTo('#zebra-table');
};
});
afterEach(function () {
$("#zebra-table").remove();
});
beforeEach
callback uses jQuery to create a table with 10 rows and 3 columns and add it to the DOM. In afterEach
callback, we just remove that table using jQuery again. Okay, now the table exists, let’s go back to the test: In the second line, we call our plugin, that is not ready yet, so let’s forward to the next line, where we used the
it('should apply classes zebrafy-odd and zebrafy-even to each other table lines', function () {
var table = $("#zebra-table");
table.zebrafy();
expect(table).toBeZebrafyied();
});
expect
function. Jasmine provides this function, that receives an object and executes a matcher against it, there is a lot of built-in matchers on Jasmine, but toBeZebrafyied
is not a built-in matcher. Here is where we know another Jasmine feature: the capability to write custom matchers, but how to do this? You can call the beforeEach
again, and use the addMatcher
method of Jasmine object: The method
beforeEach(function () {
this.addMatchers({
toBeZebrafyied: function() {
var isZebrafyied = true;
this.actual.find("tr:even").each(function (index, tr) {
isZebrafyied = $(tr).hasClass('zebrafy-odd') === false && $(tr).hasClass('zebrafy-even');
if (!isZebrafyied) {
return;
};
});
this.actual.find("tr:odd").each(function (index, tr) {
isZebrafyied = $(tr).hasClass('zebrafy-odd') && $(tr).hasClass('zebrafy-even') === false;
if (!isZebrafyied) {
return;
};
});
return isZebrafyied;
}
});
});
addMatchers
receives an object where each property is a matcher. Your matcher can receive arguments if you want. The object being matched can be accessed using this.actual
, so here is what the method above does: it takes all odd <tr>
elements of the table (this.actual
) and check if them have the CSS class zebrafy-odd
and don’t have the CSS class zebrafy-even
, then do the same checking with even <tr>
lines. I’m not going to explain how to implement a jQuery plugin neither what are those brackets on function, this post aims to show how to use Jasmine to test jQuery plugins.
(function ($) {
$.fn.zebrafy = function () {
this.find("tr:even").addClass("zebrafy-even");
this.find("tr:odd").addClass("zebrafy-odd");
};
})(jQuery);
As you can see, we used the built-in matcher
it('zebrafy should be chainable', function() {
var table = $("#zebra-table");
table.zebrafy().addClass('black-bg');
expect(table.hasClass('black-bg')).toBeTruthy();
});
toBeTruthy
, which asserts that an object or expression is true
. All we need to do is return the jQuery object in the plugin and the test will pass: So, the plugin is tested and ready to release! :) You can check the entire code and test with more spec in a Github repository.
(function ($) {
$.fn.zebrafy = function () {
return this.each(function (index, table) {
$(table).find("tr:even").addClass("zebrafy-even");
$(table).find("tr:odd").addClass("zebrafy-odd");
});
};
})(jQuery);
por fsouza (noreply@blogger.com) em 10 de December de 2019 às 03:42
browser = Browser("firefox")
Browser
is a class and its constructor receives the driver to be used with that instance. Nowadays, there are three drivers for Splinter: firefox, chrome and zope.testbrowser. We are using Firefox, and you can easily use Chrome by simply changing the driver from firefox
to chrome
. It’s also very simple to add another driver to Splinter, and I plan to cover how to do that in another blog post here. browser
object, and this is the object used for Firefox interactions. Let's start a new event on Facebook, the Splinter Sprint. First of all, we need to visit the Facebook homepage. There is a visit
method on Browser class, so we can use it: browser.visit("https://www.facebook.com")
visit
is a blocking operation: it waits for page to load, then we can navigate, click on links, fill forms, etc. Now we have Facebook homepage opened on browser, and you probably know that we need to login on Facebook page, but what if we are already logged in? So, let's create a method that login on Facebook with provided authentication data only the user is not logged in (imagine we are on a TestCase class): def do_login_if_need(self, username, password):What was made here? First of all, the method checks if there is an element present on the page, using a CSS selector. It checks for a
if self.browser.is_element_present_by_css('div.menu_login_container'):
self.browser.fill('email', username)
self.browser.fill('pass', password)
self.browser.find_by_css('div.menu_login_container input[type="submit"]').first.click()
assert self.browser.is_element_present_by_css('li#navAccount')
div
that contains the username and password fields. If that div is present, we tell the browser object to fill those fields, then find the submit
button and click on it. The last line is an assert to guarantee that the login was successful and the current page is the Facebook homepage (by checking the presence of “Account” li
). find
the link and click
on it: browser.find_by_css('li#navItem_events a').first.click()The
find_by_css
method takes a CSS selector and returns an ElementList. So, we get the first element of the list (even when the selector returns only an element, the return type is still a list) and click on it. Like visit
method, click
is a blocking operation: the driver will only listen for new actions when the request is finished (the page is loaded). browser.fill('event_startIntlDisplay', '5/21/2011')That is it: the event is going to happen on May 21th 2011, at 8:00 in the morning (480 minutes). As we know, the event name is Splinter sprint, and we are going to join some guys down here in Brazil. We filled out the form using
browser.select('start_time_min', '480')
browser.fill('name', 'Splinter sprint')
browser.fill('location', 'Rio de Janeiro, Brazil')
browser.fill('desc', 'For more info, check out the #cobratem channel on freenode!')
fill
and select
methods. fill
method is used to fill a "fillable" field (a textarea, an input, etc.). It receives two strings: the first is the name of the field to fill and the second is the value that will fill the field. select
is used to select an option in a select element (a “combo box”). It also receives two string parameters: the first is the name of the select element, and the second is the value of the option being selected. <select name="gender">To select “Male”, you would call the select method this way:
<option value="m">Male</option>
<option value="f">Female</option>
</select>
browser.select("gender", "m")The last action before click on “Create Event” button is upload a picture for the event. On new event page, Facebook loads the file field for picture uploading inside an
iframe
, so we need to switch to this frame and interact with the form present inside the frame. To show the frame, we need to click on “Add Event Photo” button and then switch to it, we already know how click on a link: browser.find_by_css('div.eventEditUpload a.uiButton').first.click()When we click this link, Facebook makes an asynchronous request, which means the driver does not stay blocked waiting the end of the request, so if we try to interact with the frame BEFORE it appears, we will get an
ElementDoesNotExist
exception. Splinter provides the is_element_present
method that receives an argument called wait_time
, which is the time Splinter will wait for the element to appear on the screen. If the element does not appear on screen, we can’t go on, so we can assume the test failed (remember we are testing a Facebook feature): if not browser.is_element_present_by_css('iframe#upload_pic_frame', wait_time=10):The
fail("The upload pic iframe did'n't appear :(")
is_element_present_by_css
method takes a CSS selector and tries to find an element using it. It also receives a wait_time
parameter that indicates a time out for the search of the element. So, if the iframe
element with ID=”upload_pic_frame” is not present or doesn’t appear in the screen after 10 seconds, the method returns False
, otherwise it returns True
. Important:Now we see thefail
is a pseudocode sample and doesn’t exist (if you’re usingunittest
library, you can invokeself.fail
in a TestCase, exactly what I did in complete snippet for this example, available at Github).
iframe
element on screen and we can finally upload the picture. Imagine we have a variable that contains the path of the picture (and not a file object, StringIO
, or something like this), and this variable name is picture_path
, this is the code we need: with browser.get_iframe('upload_pic_frame') as frame:Splinter provides the
frame.attach_file('pic', picture_path)
time.sleep(10)
get_iframe
method that changes the context and returns another objet to interact with the content of the frame. So we call the attach_file
method, who also receives two strings: the first is the name of the input element and the second is the absolute path to the file being sent. Facebook also uploads the picture asynchronously, but there’s no way to wait some element to appear on screen, so I just put Python to sleep 10 seconds on last line. browser.find_by_css('label.uiButton input[type="submit"]').first.click()After create an event, Facebook redirects the browser to the event page, so we can check if it really happened by asserting the header of the page. That’s what the code above does: in the new event page, it click on submit button, and after the redirect, get the text of a span element and asserts that this text equals to “Splinter sprint”.
title = browser.find_by_css('h1 span').first.text
assert title == 'Splinter sprint'
por fsouza (noreply@blogger.com) em 10 de December de 2019 às 03:42
memc_module
source from Github and then built nginx with it. Here is the commands for compiling nginx with memcached module: % ./configure --prefix=/opt/nginx --user=nginx --group=nginx --with-http_ssl_module --add-module={your memc_module source path}After install nginx and create an init script for it, we can work on its settings for integration with Tomcat. Just for working with separate settings, we changed the nginx.conf file (located in /opt/nginx/conf directory), and it now looks like this:
% make
% sudo make install
user nginx;See the last line inside
worker_processes 1;
error_log logs/error.log;
events {
worker_connections 1024;
}
http {
include mime.types;
default_type application/octet-stream;
log_format main '$remote_addr - $remote_user [$time_local] "$request" '
'$status $body_bytes_sent "$http_referer" '
'"$http_user_agent" "$http_x_forwarded_for"';
access_log logs/access.log main;
sendfile on;
#tcp_nopush on;
#keepalive_timeout 0;
keepalive_timeout 65;
#gzip on;
include /opt/nginx/sites-enabled/*;
}
http
section: this line tells nginx to include all settings present in the /opt/nginx/sites-enabled
directory. So, now, let’s create a default file in this directory, with this content: server {Some stuffs must be explained here: the
listen 80;
server_name localhost;
default_type text/html;
location / {
proxy_set_header X-Real-IP $remote_addr;
proxy_set_header Host $http_host;
proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
if ($request_method = POST) {
proxy_pass http://localhost:8080;
break;
}
set $memcached_key "$uri";
memcached_pass 127.0.0.1:11211;
error_page 501 404 502 = /fallback$uri;
}
location /fallback/ {
internal;
proxy_set_header X-Real-IP $remote_addr;
proxy_set_header Host $http_host;
proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
proxy_redirect off;
proxy_pass http://localhost:8080;
}
}
default_type
directive is necessary for proper serving of cached responses (if you are cache other content types like application/json
or application/xml
, you should take a look at nginx documentation and deal conditionally with content types). The location /
scope defines some settings for proxy, like IP and host. We just did it because we need to pass the right information to our backend (Tomcat or memcached). See more about proxy_set_header
at nginx documentation. After that, there is a simple verification oF the request method. We don’t want to cache POST requests. $memcached_key
and then we use the memcached_pass
directive, the $memcached_key
is the URI. memcached_pass
is very similar to proxy_pass
, nginx “proxies” the request to memcached
, so we can get some HTTP status code, like 200, 404 or 502. We define error handlers for two status codes: fallback
, an internal location that builds a proxy between nginx and Tomcat (listening on port 8080). Everything is set up with nginx. As you can see in the picture or in the nginx configuration file, nginx doesn’t write anything to memcached, it only reads from memcached. The application should write to memcached. Let’s do it. First, the dependency: for memcached communication, we used spymemcached client. It is a simple and easy to use memcached library. I won’t explain all the code, line by line, but I can tell the idea behind the code: first, call
package com.franciscosouza.memcached.filter;
import java.io.IOException;
import java.io.PrintWriter;
import java.io.StringWriter;
import java.net.InetSocketAddress;
import javax.servlet.Filter;
import javax.servlet.FilterChain;
import javax.servlet.FilterConfig;
import javax.servlet.ServletException;
import javax.servlet.ServletOutputStream;
import javax.servlet.ServletRequest;
import javax.servlet.ServletResponse;
import javax.servlet.http.HttpServletRequest;
import javax.servlet.http.HttpServletResponse;
import javax.servlet.http.HttpServletResponseWrapper;
import net.spy.memcached.MemcachedClient;
/**
* Servlet Filter implementation class MemcachedFilter
*/
public class MemcachedFilter implements Filter {
private MemcachedClient mmc;
static class MemcachedHttpServletResponseWrapper extends HttpServletResponseWrapper {
private StringWriter sw = new StringWriter();
public MemcachedHttpServletResponseWrapper(HttpServletResponse response) {
super(response);
}
public PrintWriter getWriter() throws IOException {
return new PrintWriter(sw);
}
public ServletOutputStream getOutputStream() throws IOException {
throw new UnsupportedOperationException();
}
public String toString() {
return sw.toString();
}
}
/**
* Default constructor.
*/
public MemcachedFilter() {
}
/**
* @see Filter#destroy()
*/
public void destroy() {
}
/**
* @see Filter#doFilter(ServletRequest, ServletResponse, FilterChain)
*/
public void doFilter(ServletRequest request, ServletResponse response, FilterChain chain) throws IOException, ServletException {
MemcachedHttpServletResponseWrapper wrapper = new MemcachedHttpServletResponseWrapper((HttpServletResponse) response);
chain.doFilter(request, wrapper);
HttpServletRequest inRequest = (HttpServletRequest) request;
HttpServletResponse inResponse = (HttpServletResponse) response;
String content = wrapper.toString();
PrintWriter out = inResponse.getWriter();
out.print(content);
if (!inRequest.getMethod().equals("POST")) {
String key = inRequest.getRequestURI();
mmc.set(key, 5, content);
}
}
/**
* @see Filter#init(FilterConfig)
*/
public void init(FilterConfig fConfig) throws ServletException {
try {
mmc = new MemcachedClient(new InetSocketAddress("localhost", 11211));
} catch (IOException e) {
e.printStackTrace();
throw new ServletException(e);
}
}
}
doFilter
method on FilterChain
, because we want to get the response and work with that. Take a look at the MemcachedHttpServletResponseWrapper
instance, it encapsulates the response and makes easier to play with response content. MemcachedClient
provided by spymemcached. The request URI is the key and timeout is 5 seconds. That is it! Now you can just run Tomcat on port
<?xml version="1.0" encoding="UTF-8"?>
<web-app xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns="http://java.sun.com/xml/ns/javaee" xmlns:web="http://java.sun.com/xml/ns/javaee/web-app_2_5.xsd" xsi:schemaLocation="http://java.sun.com/xml/ns/javaee http://java.sun.com/xml/ns/javaee/web-app_2_5.xsd" id="WebApp_ID" version="2.5">
<display-name>memcached sample</display-name>
<filter>
<filter-name>vraptor</filter-name>
<filter-class>br.com.caelum.vraptor.VRaptor</filter-class>
</filter>
<filter>
<filter-name>memcached</filter-name>
<filter-class>com.franciscosouza.memcached.filter.MemcachedFilter</filter-class>
</filter>
<filter-mapping>
<filter-name>memcached</filter-name>
<url-pattern>/*</url-pattern>
</filter-mapping>
<filter-mapping>
<filter-name>vraptor</filter-name>
<url-pattern>/*</url-pattern>
<dispatcher>FORWARD</dispatcher>
<dispatcher>REQUEST</dispatcher>
</filter-mapping>
</web-app>
8080
and nginx on port 80
, and access http://localhost
on your browser. Try some it: raise up the cache timeout, navigate on application and turn off Tomcat. You will still be able to navigate on some pages that use GET request method (users list, home and users form). por fsouza (noreply@blogger.com) em 10 de December de 2019 às 03:42
% wget http://www.tipfy.org/tipfy.build.tar.gzAfter it, we go to the project folder and see the project structure provided by tipfy. There is a directory called "app", where the App Engine app is located. The app.yaml file is in the app directory, so we open that file and change the application id and the application version. Here is the app.yaml file:
% tar -xvzf tipfy.0.6.2.build.tar.gz
% mv project gaeseries
application: gaeseriesAfter this, we can start to code our application. tipfy deals with requests using handlers. A handler is a class that has methods to deal with different kinds of requests. That remember me a little the Strut Actions (blergh), but tipfy is a Python framework, what means that it is easier to build web application using it!
version: 4
runtime: python
api_version: 1
derived_file_type:
- python_precompiled
handlers:
- url: /(robots\.txt|favicon\.ico)
static_files: static/\1
upload: static/(.*)
- url: /remote_api
script: $PYTHON_LIB/google/appengine/ext/remote_api/handler.py
login: admin
- url: /_ah/queue/deferred
script: main.py
login: admin
- url: /.*
script: main.py
% mkdir blogAfter create the application structure, we install it by putting the application inside the "apps_installed" list on config.py file:
% touch blog/__init__.py
# -*- coding: utf-8 -*-See the line 22. Inside the application folder, let’s create a Python module called models.py. This module is exactly the same of Flask post:
"""
config
~~~~~~
Configuration settings.
:copyright: 2009 by tipfy.org.
:license: BSD, see LICENSE for more details.
"""
config = {}
# Configurations for the 'tipfy' module.
config['tipfy'] = {
# Enable debugger. It will be loaded only in development.
'middleware': [
'tipfy.ext.debugger.DebuggerMiddleware',
],
# Enable the Hello, World! app example.
'apps_installed': [
'apps.hello_world',
'apps.blog',
],
}
from google.appengine.ext import dbAfter create the model, let’s start building the project by creating the post listing handler. The handlers will be in a module called handlers.py, inside the application folder. Here is the handlers.py code:
class Post(db.Model):
title = db.StringProperty(required = True)
content = db.TextProperty(required = True)
when = db.DateTimeProperty(auto_now_add = True)
author = db.UserProperty(required = True)
# -*- coding: utf-8 -*-See that we get a list containing all posts from the database and send it to the list_posts.html template. Like Flask, tipfy uses Jinja2 as template engine by default. Following the same way, let’s create a base.html file who represents the layout of the project. This file should be inside the templates folder and contains the following code:
from tipfy import RequestHandler
from tipfy.ext.jinja2 import render_response
from models import Post
class PostListingHandler(RequestHandler):
def get(self):
posts = Post.all()
return render_response('list_posts.html', posts=posts)
<html>And now we can create the list_posts.html template extending the base.html template:
<head>
<meta http-equiv="Content-type" content="text/html; charset=utf-8"/>
<title>{% block title %}{% endblock %}</title>
</head>
<body id="">
{% block content %}{% endblock %}
</body>
</html>
{% extends "base.html" %}Can we access the list of posts now by the URL? No, we can’t yet. Now we have to map the handler to a URL, and we will be able to access the list of posts through the browser. On tipfy, all URL mappings of an application are located in a Python module called urls.py. Create it with the following code:
{% block title %}
Posts list
{% endblock %}
{% block content %}
Listing all posts:
<ul>
{% for post in posts %}
<li>
{{ post.title }} (written by {{ post.author.nickname() }})
{{ post.content }}
</li>
{% endfor %}
</ul>
{% endblock %}
from tipfy import RuleIt is very simple: a Python module containing a function called get_rules, that receives the app object as parameter and return a list containing the rules of the application (each rule is an instance of tipfy.Rule class). Now we can finally see the empty post list on the browser, by running the App Engine development server and touching the http://localhost:8080/posts URL on the browser. Run the following command on the project root:
def get_rules(app):
rules = [
Rule('/posts', endpoint='post-listing', handler='apps.blog.handlers.PostListingHandler'),
]
return rules
% /usr/local/google_appengine/dev_appserver.py appAnd check the browser at http://localhost:8080/posts. And we see the empty list. Now, let’s create the protected handler which will create a new post. tipfy has an auth extension, who makes very easy to deal with authentication using the native Google App Engine users API. To use that, we need to configure the session extension, changing the conf.py module, by adding the following code lines:
config['tipfy.ext.session'] = {Now we are ready to create the NewPostHandler. We will need to deal with forms, and tipfy has an extension for integration with WTForms, so we have to download and install WTForms and that extension in the project:
'secret_key' : 'just_dev_testH978DAGV9B9sha_W92S',
}
% wget http://bitbucket.org/simplecodes/wtforms/get/tip.tar.bz2Now we have WTForms extension installed and ready to be used. Let’s create the PostForm class, and then create the handler. I put both classes in the handlers.py file (yeah, including the form). Here is the PostForm class code:
% tar -xvf tip.tar.bz2
% cp -r wtforms/wtforms/ ~/Projetos/gaeseries/app/lib/
% wget http://pypi.python.org/packages/source/t/tipfy.ext.wtforms/tipfy.ext.wtforms-0.6.tar.gz
% tar -xvzf tipfy.ext.wtforms-0.6.tar.gz
% cp -r tipfy.ext.wtforms-0.6/tipfy ~/Projetos/gaeseries/app/distlib
class PostForm(Form):Add this class to the handlers.py module:
csrf_protection = True
title = fields.TextField('Title', validators=[validators.Required()])
content = fields.TextAreaField('Content', validators=[validators.Required()])
class NewPostHandler(RequestHandler, AppEngineAuthMixin, AllSessionMixins):A lot of news here: first, tipfy explores the multi-inheritance Python feature and if you will use the auth extension by the native App Engine users API, you have to create you handler class extending AppEngineAuthMixin and AllSessionMixins classes, and add to the middleware list the SessionMiddleware class. See more at the tipfy docs.
middleware = [SessionMiddleware]
@login_required
def get(self, **kwargs):
return render_response('new_post.html', form=self.form)
@login_required
def post(self, **kwargs):
if self.form.validate():
post = Post(
title = self.form.title.data,
content = self.form.content.data,
author = self.auth_session
)
post.put()
return redirect('/posts')
return self.get(**kwargs)
@cached_property
def form(self):
return PostForm(self.request)
{% extends "base.html" %}Now, we can deploy the application on Google App Engine by simply running this command:
{% block title %}
New post
{% endblock %}
{% block content %}
<form action="" method="post" accept-charset="utf-8">
<p>
<label for="title">{{ form.title.label }}</label>
{{ form.title|safe }}
{% if form.title.errors %}
<ul class="errors">
{% for error in form.title.errors %}
<li>{{ error }}</li>
{% endfor %}
</ul>
{% endif %}
</p>
<p>
<label for="content">{{ form.content.label }}</label>
{{ form.content|safe }}
{% if form.content.errors %}
<ul class="errors">
{% for error in form.content.errors %}
<li>{{ error }}</li>
{% endfor %}
</ul>
{% endif %}
</p>
<p><input type="submit" value="Save post"/></p>
</form>
{% endblock %}
% /usr/local/google_appengine/appcfg.py update appAnd you can check the deployed application live here: http://4.latest.gaeseries.appspot.com.
por fsouza (noreply@blogger.com) em 10 de December de 2019 às 03:42
application: gaeseriesWe just set the application ID, the version and the URL handlers. We will handle all request in main.py file. Late on this post, I will show the main.py module, the script that handles Flask with Google App Engine. Now, let’s create the Flask application, and deal with App Engine later :)
version: 3
runtime: python
api_version: 1
handlers:
- url: .*
script: main.py
% wget http://github.com/mitsuhiko/flask/zipball/0.6On my computer, the project is under ~/Projetos/blog/gaeseries, put all downloaded tools on the root of your application. Now we have everything that we need to start to create our Flask application, so let’s create a Python package called blog, it will be the application directory:
% unzip mitsuhiko-flask-0.6-0-g5cadd9d.zip
% cp -r mitsuhiko-flask-5cadd9d/flask ~/Projetos/blog/gaeseries
% wget http://pypi.python.org/packages/source/W/Werkzeug/Werkzeug-0.6.2.tar.gz
% tar -xvzf Werkzeug-0.6.2.tar.gz
% cp -r Werkzeug-0.6.2/werkzeug ~/Projetos/blog/gaeseries/
% wget http://pypi.python.org/packages/source/J/Jinja2/Jinja2-2.5.tar.gz
% tar -xvzf Jinja2-2.5.tar.gz
% cp -r Jinja2-2.5/jinja2 ~/Projetos/blog/gaeseries/
% wget http://pypi.python.org/packages/source/s/simplejson/simplejson-2.1.1.tar.gz
% tar -xvzf simplejson-2.1.1.tar.gz
% cp -r simplejson-2.1.1/simplejson ~/Projetos/blog/gaeseries/
% mkdir blogInside the __init__.py module, we will create our Flask application and start to code. Here is the __init__.py code:
% touch blog/__init__.py
from flask import FlaskWe imported two modules: settings and views. So we should create the two modules, where we will put the application settings and the views of applications (look that Flask deals in the same way that Django, calling “views” functions that receives a request and returns a response, instead of call it “actions” (like web2py). Just create the files:
import settings
app = Flask('blog')
app.config.from_object('blog.settings')
import views
% touch blog/views.pyHere is the settings.py sample code:
% touch blog/settings.py
DEBUG=TrueNow is the time to define the model Post. We will define our models inside the application directory, in a module called models.py:
SECRET_KEY='dev_key_h8hfne89vm'
CSRF_ENABLED=True
CSRF_SESSION_LKEY='dev_key_h8asSNJ9s9=+'
from google.appengine.ext import dbThe last property is a UserProperty, a “foreign key” to a user. We will use the Google App Engine users API, so the datastore API provides this property to establish a relationship between custom models and the Google account model.
class Post(db.Model):
title = db.StringProperty(required = True)
content = db.TextProperty(required = True)
when = db.DateTimeProperty(auto_now_add = True)
author = db.UserProperty(required = True)
from blog import appOn the last line of the view, we called the function render_template, which renders a template. The first parameter of this function is the template to be rendered, we passed the list_posts.html, so let’s create it using the Jinja2 syntax, inspired by Django templates. Inside the application directory, create a subdirectory called templates and put inside it a HTML file called base.html. That file will be the application layout and here is its code:
from models import Post
from flask import render_template
@app.route('/posts')
def list_posts():
posts = Post.all()
return render_template('list_posts.html', posts=posts)
<html>And now create the list_posts.html template, with the following code:
<head>
<meta http-equiv="Content-type" content="text/html; charset=utf-8"/>
<title>{% block title %}Blog{% endblock %}</title>
</head>
<body>
{% block content %}{% endblock %}
</body>
</html>
{% extends "base.html" %}Now, to test it, we need to run Google App Engine development server on localhost. The app.yaml file defined a main.py script as handler for all requests, so to use Google App Engine local development server, we need to create the main.py file that run our application. Every Flask application is a WSGI application, so we can use an App Engine tool for running WSGI application. In that way, the main.py script is really simple:
{% block content %}
<ul>
{% for post in posts %}
<li>
{{ post.title }} (written by {{ post.author.nickname() }})
{{ post.content }}
</li>
{% endfor %}
</ul>
{% endblock %}
from google.appengine.ext.webapp.util import run_wsgi_appThe script uses the run_wsgi_app function provided by webapp, the built-in Google Python web framework for App Engine. Now, we can run the application in the same way that we ran in the web2py post:
from blog import app
run_wsgi_app(app)
% /usr/local/google_appengine/dev_appserver.py .And if you access the URL http://localhost:8080/posts in your browser, you will see a blank page, just because there is no posts on the database. Now we will create a login protected view to write and save a post on the database. Google App Engine does not provide a decorator for validate when a user is logged, and Flask doesn’t provide it too. So, let’s create a function decorator called login_required and decorate the view new_post with that decorator. I created the decorator inside a decorators.py module and import it inside the views.py module. Here is the decorators.py code:
from functools import wrapsIn the new_post view we will deal with forms. IMO, WTForms is the best way to deal with forms in Flask. There is a Flask extension called Flask-WTF, and we can install it in our application for easy dealing with forms. Here is how can we install WTForms and Flask-WTF:
from google.appengine.api import users
from flask import redirect, request
def login_required(func):
@wraps(func)
def decorated_view(*args, **kwargs):
if not users.get_current_user():
return redirect(users.create_login_url(request.url))
return func(*args, **kwargs)
return decorated_view
% wget http://pypi.python.org/packages/source/W/WTForms/WTForms-0.6.zipNow we have installed WTForms and Flask-WTF, and we can create a new WTForm with two fields: title and content. Remember that the date and author will be filled automatically with the current datetime and current user. Here is the PostForm code (I put it inside the views.py file, but it is possible to put it in a separated forms.py file):
% unzip WTForms-0.6.zip
% cp -r WTForms-0.6/wtforms ~/Projetos/blog/gaeseries/
% wget http://pypi.python.org/packages/source/F/Flask-WTF/Flask-WTF-0.2.3.tar.gz
% tar -xvzf Flask-WTF-0.2.3.tar.gz
% cp -r Flask-WTF-0.2.3/flaskext ~/Projetos/blog/gaeseries/
from flaskext import wtfNow we can create the new_post view:
from flaskext.wtf import validators
class PostForm(wtf.Form):
title = wtf.TextField('Title', validators=[validators.Required()])
content = wtf.TextAreaField('Content', validators=[validators.Required()])
@app.route('/posts/new', methods = ['GET', 'POST'])Now, everything we need is to build the new_post.html template, here is the code for this template:
@login_required
def new_post():
form = PostForm()
if form.validate_on_submit():
post = Post(title = form.title.data,
content = form.content.data,
author = users.get_current_user())
post.put()
flash('Post saved on database.')
return redirect(url_for('list_posts'))
return render_template('new_post.html', form=form)
{% extends "base.html" %}Now everything is working. We can run Google App Engine local development server and access the URL http://localhost:8080/posts/new on the browser, then write a post and save it! Everything is ready to deploy, and the deploy process is the same of web2py, just run on terminal:
{% block content %}
<h1 id="">Write a post</h1>
<form action="{{ url_for('new_post') }}" method="post" accept-charset="utf-8">
{{ form.csrf_token }}
<p>
<label for="title">{{ form.title.label }}</label>
{{ form.title|safe }}
{% if form.title.errors %}
<ul class="errors">
{% for error in form.title.errors %}
<li>{{ error }}</li>
{% endfor %}
</ul>
{% endif %}
</p>
<p>
<label for="content">{{ form.content.label }}</label>
{{ form.content|safe }}
{% if form.content.errors %}
<ul class="errors">
{% for error in form.content.errors %}
<li>{{ error }}</li>
{% endfor %}
</ul>
{% endif %}
</p>
<p><input type="submit" value="Save post"/></p>
</form>
{% endblock %}
% /usr/local/google_appengine/appcfg.py update .And now the application is online :) Check this out: http://3.latest.gaeseries.appspot.com (use your Google Account to write posts).
por fsouza (noreply@blogger.com) em 10 de December de 2019 às 03:42
application: gaeseriesI changed only the two first lines, everything else was provided by web2py. The web2py project contains a subdirectory called applications where the web2py applications are located. There is an application called welcome used as scaffold to build new applications. So, let’s copy this directory and rename it to blog. Now we can walk in the same way that we walked in the django post: we will use two actions on a controller: one protected by login, where we will save posts, and other public action, where we will list all posts.
version: 2
api_version: 1
runtime: python
handlers:
- url: /(?P<a>.+?)/static/(?P<b>.+)
static_files: applications/\1/static/\2
upload: applications/(.+?)/static/(.+)
secure: optional
expiration: "90d"
- url: /admin-gae/.*
script: $PYTHON_LIB/google/appengine/ext/admin
login: admin
- url: /_ah/queue/default
script: gaehandler.py
login: admin
- url: .*
script: gaehandler.py
secure: optional
skip_files: |
^(.*/)?(
(app\.yaml)|
(app\.yml)|
(index\.yaml)|
(index\.yml)|
(#.*#)|
(.*~)|
(.*\.py[co])|
(.*/RCS/.*)|
(\..*)|
((admin|examples|welcome)\.tar)|
(applications/(admin|examples)/.*)|
(applications/.*?/databases/.*) |
(applications/.*?/errors/.*)|
(applications/.*?/cache/.*)|
(applications/.*?/sessions/.*)|
)$
current_user_id = (auth.user and auth.user.id) or 0This code looks a little strange, but it is very simple: we define a database table called posts with four fields: title (a varchar – default type), content (a text), author (a
db.define_table('posts', db.Field('title'),
db.Field('content', 'text'),
db.Field('author', db.auth_user, default=current_user_id, writable=False),
db.Field('date', 'datetime', default=request.now, writable=False)
)
db.posts.title.requires = IS_NOT_EMPTY()
db.posts.content.requires = IS_NOT_EMPTY()
def index():As you can see, is just a few of code :) Now we need to make the posts/index.html view. The web2py views system allow the developer to use native Python code on templates, what means that the developer/designer has more power and possibilities. Here is the code of the view posts/index.html (it should be inside the views directory):
posts = db().select(db.posts.ALL)
return response.render('posts/index.html', locals())
{{extend 'layout.html'}}And now we can run the Google App Engine server locally by typing the following command inside the project root (I have the Google App Engine SDK extracted on my /usr/local/google_appengine):
<h1 id="">Listing all posts</h1>
<dl>
{{for post in posts:}}
<dt>{{=post.title}} (written by {{=post.author.first_name}})</dt>
<dd>{{=post.content}}</dd>
{{pass}}
</dl>
% /usr/local/google_appengine/dev_appserver.py .If you check the URL http://localhost:8080/blog/posts, then you will see that we have no posts in the database yet, so let’s create the login protected action that saves a post on the database. Here is the action code:
@auth.requires_login()Note that there is a decorator. web2py includes a complete authentication and authorization system, which includes an option for new users registries. So you can access the URL /blog/default/user/register and register yourself to write posts :) Here is the posts/new.html view code, that displays the form:
def new():
form = SQLFORM(db.posts, fields=['title','content'])
if form.accepts(request.vars, session):
response.flash = 'Post saved.'
redirect(URL('blog', 'posts', 'index'))
return response.render('posts/new.html', dict(form=form))
{{extend 'layout.html'}}After it the application is ready to the deploy. The way to do it is running the following command on the project root:
<h1 id="">
Save a new post</h1>
{{=form}}
% /usr/local/google_appengine/appcfg.py update .And see the magic! :) You can check this application live here: http://2.latest.gaeseries.appspot.com/ (you can login with the e-mail demo@demo.com and the password demo, you can also register yourself).
por fsouza (noreply@blogger.com) em 10 de December de 2019 às 03:42
These commands will boostrap the environment, setting up a bootstrap machine which will manage your services; deploy mysql and wordpress instances; add a relation between them; and expose the wordpress port. The voilà, we have a wordpress deployed, and ready to serve our posts. Amazing, huh?
% juju bootstrap
% juju deploy mysql
% juju deploy wordpress
% juju add-relation wordpress mysql
% juju expose wordpress
juju
command line tool in almost any OS (including Mac OS), right now you are able do deploy only Ubuntu-based services (you must use an Ubuntu instance or container). yum
. cc_
and implement a `handle` function (for example, a module called "yum_packages" would be written to a file called cc_yum_packages.py
). So, here is the code for the module yum_packages
: The module installs all packages listed in cloud-init yaml file. If we want to install `emacs-nox` package, we would write this yaml file and use it as user data in the instance:
import subprocess
import traceback
from cloudinit import CloudConfig, util
frequency = CloudConfig.per_instance
def yum_install(packages):
cmd = ["yum", "--quiet", "--assumeyes", "install"]
cmd.extend(packages)
subprocess.check_call(cmd)
def handle(_name, cfg, _cloud, log, args):
pkglist = util.get_cfg_option_list_or_str(cfg, "packages", [])
if pkglist:
try:
yum_install(pkglist)
except subprocess.CalledProcessError:
log.warn("Failed to install yum packages: %s" % pkglist)
log.debug(traceback.format_exc())
raise
return True
cloud-init already works on Fedora, with Python 2.7, but to work on CentOS 6, with Python 2.6, it needs a patch:
#cloud-config
modules:
- yum_packages
packages: [emacs-nox]
I've packet up this module and this patch in a RPM package that must be pre-installed in the lxc template and AMI images. Now, we need to change Juju in order to make it use the
--- cloudinit/util.py 2012-05-22 12:18:21.000000000 -0300
+++ cloudinit/util.py 2012-05-31 12:44:24.000000000 -0300
@@ -227,7 +227,7 @@
stderr=subprocess.PIPE, stdin=subprocess.PIPE)
out, err = sp.communicate(input_)
if sp.returncode is not 0:
- raise subprocess.CalledProcessError(sp.returncode, args, (out, err))
+ raise subprocess.CalledProcessError(sp.returncode, args)
return(out, err)
yum_packages
module, and include all RPM packages that we need to install when the machine borns. _collect_packages
, that returns the list of packages that will be installed in the machine after it is spawned; and render
that returns the file itself. Here is our CentOSCloudInit
class (within the patch): The other change we need is in the
diff -u juju-0.5-bzr531.orig/juju/providers/common/cloudinit.py juju-0.5-bzr531/juju/providers/common/cloudinit.py
--- juju-0.5-bzr531.orig/juju/providers/common/cloudinit.py 2012-05-31 15:42:17.480769486 -0300
+++ juju-0.5-bzr531/juju/providers/common/cloudinit.py 2012-05-31 15:55:13.342884919 -0300
@@ -324,3 +324,32 @@
"machine-id": self._machine_id,
"juju-provider-type": self._provider_type,
"juju-zookeeper-hosts": self._join_zookeeper_hosts()}
+
+
+class CentOSCloudInit(CloudInit):
+
+ def _collect_packages(self):
+ packages = [
+ "bzr", "byobu", "tmux", "python-setuptools", "python-twisted",
+ "python-txaws", "python-zookeeper", "python-devel", "juju"]
+ if self._zookeeper:
+ packages.extend([
+ "zookeeper", "libzookeeper", "libzookeeper-devel"])
+ return packages
+
+ def render(self):
+ """Get content for a cloud-init file with appropriate specifications.
+
+ :rtype: str
+
+ :raises: :exc:`juju.errors.CloudInitError` if there isn't enough
+ information to create a useful cloud-init.
+ """
+ self._validate()
+ return format_cloud_init(
+ self._ssh_keys,
+ packages=self._collect_packages(),
+ repositories=self._collect_repositories(),
+ scripts=self._collect_scripts(),
+ data=self._collect_machine_data(),
+ modules=["ssh", "yum_packages", "runcmd"])
format_cloud_init
function, in order to make it recognize the modules
parameter that we used above, and tell cloud-init to not run apt-get
(update nor upgrade). Here is the patch: This patch is also packed up within juju-centos-6 repository, which provides sources for building RPM packages for juju, and also some pre-built RPM packages.
diff -ur juju-0.5-bzr531.orig/juju/providers/common/utils.py juju-0.5-bzr531/juju/providers/common/utils.py
--- juju-0.5-bzr531.orig/juju/providers/common/utils.py 2012-05-31 15:42:17.480769486 -0300
+++ juju-0.5-bzr531/juju/providers/common/utils.py 2012-05-31 15:44:06.605014021 -0300
@@ -85,7 +85,7 @@
def format_cloud_init(
- authorized_keys, packages=(), repositories=None, scripts=None, data=None):
+ authorized_keys, packages=(), repositories=None, scripts=None, data=None, modules=None):
"""Format a user-data cloud-init file.
This will enable package installation, and ssh access, and script
@@ -117,8 +117,8 @@
structure.
"""
cloud_config = {
- "apt-update": True,
- "apt-upgrade": True,
+ "apt-update": False,
+ "apt-upgrade": False,
"ssh_authorized_keys": authorized_keys,
"packages": [],
"output": {"all": "| tee -a /var/log/cloud-init-output.log"}}
@@ -136,6 +136,11 @@
if scripts:
cloud_config["runcmd"] = scripts
+ if modules:
+ cloud_config["modules"] = modules
+
output = safe_dump(cloud_config)
output = "#cloud-config\n%s" % (output)
return output
cloudinit
pre-installed, configure your juju environments.yaml
file to use this image in the environment and you are ready to deploy cloud services on CentOS machines using Juju! ubuntu
to interact with its machines, so you will need to create this user in your CentOS AMI/template.yum
repository (I haven't submitted them to any public repository): por fsouza (noreply@blogger.com) em 10 de December de 2019 às 03:42
package main
import (
"fmt"
"sync"
"time"
)
type Fork struct {
sync.Mutex
}
type Table struct {
philosophers chan Philosopher
forks []*Fork
}
func NewTable(forks int) *Table {
t := new(Table)
t.philosophers = make(chan Philosopher, forks - 1)
t.forks = make([]*Fork, forks)
for i := 0; i < forks; i++ {
t.forks[i] = new(Fork)
}
return t
}
func (t *Table) PushPhilosopher(p Philosopher) {
p.table = t
t.philosophers <- data-blogger-escaped-0="" data-blogger-escaped-1="" data-blogger-escaped-2="" data-blogger-escaped-3="" data-blogger-escaped-4="" data-blogger-escaped-:="range" data-blogger-escaped-_="" data-blogger-escaped-able="" data-blogger-escaped-anscombe="" data-blogger-escaped-artin="" data-blogger-escaped-chan="" data-blogger-escaped-e9="" data-blogger-escaped-eat="" data-blogger-escaped-eating...="" data-blogger-escaped-eter="" data-blogger-escaped-f="" data-blogger-escaped-fed.="" data-blogger-escaped-fed="" data-blogger-escaped-fmt.printf="" data-blogger-escaped-for="" data-blogger-escaped-func="" data-blogger-escaped-getforks="" data-blogger-escaped-go="" data-blogger-escaped-heidegger="" data-blogger-escaped-homas="" data-blogger-escaped-index="" data-blogger-escaped-int="" data-blogger-escaped-is="" data-blogger-escaped-leftfork.lock="" data-blogger-escaped-leftfork.unlock="" data-blogger-escaped-leftfork="" data-blogger-escaped-leibniz="" data-blogger-escaped-len="" data-blogger-escaped-lizabeth="" data-blogger-escaped-lombard="" data-blogger-escaped-main="" data-blogger-escaped-make="" data-blogger-escaped-n="" data-blogger-escaped-nagel="" data-blogger-escaped-name="" data-blogger-escaped-ork="" data-blogger-escaped-ottfried="" data-blogger-escaped-p.eat="" data-blogger-escaped-p.fed="" data-blogger-escaped-p.getforks="" data-blogger-escaped-p.name="" data-blogger-escaped-p.putforks="" data-blogger-escaped-p.table.popphilosopher="" data-blogger-escaped-p.table.pushphilosopher="" data-blogger-escaped-p.table="nil" data-blogger-escaped-p.think="" data-blogger-escaped-p="" data-blogger-escaped-philosopher="" data-blogger-escaped-philosopherindex="" data-blogger-escaped-philosophers="" data-blogger-escaped-popphilosopher="" data-blogger-escaped-pre="" data-blogger-escaped-putforks="" data-blogger-escaped-return="" data-blogger-escaped-rightfork.lock="" data-blogger-escaped-rightfork.unlock="" data-blogger-escaped-rightfork="" data-blogger-escaped-s="" data-blogger-escaped-string="" data-blogger-escaped-struct="" data-blogger-escaped-t.forks="" data-blogger-escaped-t="" data-blogger-escaped-table="" data-blogger-escaped-think="" data-blogger-escaped-thinking...="" data-blogger-escaped-time.sleep="" data-blogger-escaped-type="" data-blogger-escaped-was="">
Any feedback is very welcome.
por fsouza (noreply@blogger.com) em 10 de December de 2019 às 03:42
por fsouza (noreply@blogger.com) em 10 de December de 2019 às 03:42
application: gaeseriesI will use one version for each part of the series, so it is the first version because it is the first part =D In settings.py, we just uncomment the app django.contrib.auth line inside the INSTALLED_APPS tuple, because we want to use the built-in auth application instead of the Google Accounts API provided by App Engine.
version: 1
runtime: python
api_version: 1
default_expiration: '365d'
handlers:
- url: /remote_api
script: $PYTHON_LIB/google/appengine/ext/remote_api/handler.py
login: admin
- url: /_ah/queue/deferred
script: djangoappengine/deferred/handler.py
login: admin
- url: /media/admin
static_dir: django/contrib/admin/media/
- url: /.*
script: djangoappengine/main/main.py
% python manage.py startapp coreIt is a famous Django command, that creates the application structure, which is a Python package containing 3 Python modules: models, tests and views. Now we have to create the Post model. Here is the code of models.py file:
from django.db import modelsNow we just need to “install” the core application putting it on INSTALLED_APPS tuple in settings.py file and Django will be ready to play with BigTable. :) We will use the django.contrib.auth app, so let’s run a manage command to create a superuser:
from django.contrib.auth.models import User
class Post(models.Model):
title = models.CharField(max_length = 200)
content = models.TextField()
date = models.DateTimeField(auto_now_add = True)
user = models.ForeignKey(User)
% python manage.py createsuperuserAfter create the superuser, we need to setup login and logout URLs, and make two templates. So, in urls.py file, put two mappings to login and logout views. The file will look like this:
from django.conf.urls.defaults import *Here is the registration/login.html template:
urlpatterns = patterns('',
('^$', 'django.views.generic.simple.direct_to_template',
{'template': 'home.html'}),
('^login/$', 'django.contrib.auth.views.login'),
('^logout/$', 'django.contrib.auth.views.logout'),
)
{% extends "base.html" %}And registration/logged_out.html template:
{% block content %}
<p>Fill the form below to login in the system ;)</p>
{% if form.errors %}
<p>Your username and password didn't match. Please try again.</p>
{% endif %}
<form method="post" action="{% url django.contrib.auth.views.login %}">{% csrf_token %}
<table>
<tr>
<td>{{ form.username.label_tag }}</td>
<td>{{ form.username }}</td>
</tr>
<tr>
<td>{{ form.password.label_tag }}</td>
<td>{{ form.password }}</td>
</tr>
</table>
<input type="submit" value="login" />
<input type="hidden" name="next" value="{{ next }}" />
</form>
{% endblock %}
{% extends "base.html" %}See the two added lines in highlight. In settings.py file, add three lines:
{% block content %}
Bye :)
{% endblock %}
LOGIN_URL = '/login/'And we are ready to code =) Let’s create the login protected view, where we will write and save a new post. To do that, first we need to create a Django Form, to deal with the data. There are two fields in this form: title and content, when the form is submitted, the user property is filled with the current logged user and the date property is filled with the current time. So, here is the code of the ModelForm:
LOGOUT_URL = '/logout/'
LOGIN_REDIRECT_URL = '/'
class PostForm(forms.ModelForm):Here is the views.py file, with the two views (one “mocked up”, with a simple redirect):
class Meta:
model = Post
exclude = ('user',)
def save(self, user, commit = True):
post = super(PostForm, self).save(commit = False)
post.user = user
if commit:
post.save()
return post
from django.contrib.auth.decorators import login_requiredThere is only two steps to do to finally save posts on BigTable: map a URL for the views above and create the new_post.html template. Here is the mapping code:
from django.shortcuts import render_to_response
from django.template import RequestContext
from django.http import HttpResponseRedirect
from django.core.urlresolvers import reverse
from forms import PostForm
@login_required
def new_post(request):
form = PostForm()
if request.method == 'POST':
form = PostForm(request.POST)
if form.is_valid():
form.save(request.user)
return HttpResponseRedirect(reverse('core.views.list_posts'))
return render_to_response('new_post.html',
locals(), context_instance=RequestContext(request)
)
def list_posts(request):
return HttpResponseRedirect('/')
('^posts/new/$', 'core.views.new_post'),And here is the template code:
('^posts/$', 'core.views.list_posts'),
{% extends "base.html" %}Now, we can run on terminal ./manage.py runserver and access the URL http://localhost:8000/posts/new on the browser, see the form, fill it and save the post :D The last one step is list all posts in http://localhost:8000/posts/. The list_posts view is already mapped to the URL /posts/, so we just need to create the code of the view and a template to show the list of posts. Here is the view code:
{% block content %}
<form action="{% url core.views.new_post %}" method="post" accept-charset="utf-8">
{% csrf_token %}
{{ form.as_p }}
<p><input type="submit" value="Post!"/></p>
</form>
{% endblock %}
def list_posts(request):And the list_posts.html template code:
posts = Post.objects.all()
return render_to_response('list_posts.html',
locals(), context_instance=RequestContext(request)
)
{% extends "base.html" %}Finished? Not yet :) The application now is ready to deploy. How do we deploy it? Just one command:
{% block content %}
<dl>
{% for post in posts %}
<dt>{{ post.title }} (written by {{ post.user.username }})</dt>
<dd>{{ post.content }}</dd>
{% endfor %}
</dl>
{% endblock %}
% python manage.py deployDone! Now, to use everything that we have just created on App Engine remote server, just create a super user in that server and enjoy:
% python manage.py remote createsuperuserYou can check this application flying on Google App Engine: http://1.latest.gaeseries.appspot.com (use demo for username and password in login page).
por fsouza (noreply@blogger.com) em 10 de December de 2019 às 03:42
In January I wrote a post for the Rust 2019 call for blogs. The 2020 call is aiming for an RFC and roadmap earlier this time, so here is my 2020 post =]
#[wasm_bindgen]
but for FFIThis sort of happened... because WebAssembly is growing =]
I was very excited when Interface Types showed up in August, and while it is still very experimental it is moving fast and bringing saner paths for interoperability than raw C FFIs. David Beazley even point this at the end of his PyCon India keynote, talking about how easy is to get information out of a WebAssembly module compared to what had to be done for SWIG.
This doesn't solve the problem where strict C compatibility is required, or for platforms where a WebAssembly runtime is not available, but I think it is a great solution for scientific software (or, at least, for my use cases =]).
I did some of those this year (bbhash-sys and mqf), and also found some great crates to use in my projects. Rust is picking up steam in bioinformatics, being used as the primary choice for high quality software (like varlociraptor, or the many coming from 10X Genomics) but it is still somewhat hard to find more details (I mostly find it on Twitter, and sometime Google Scholar alerts). It would be great to start bringing this info together, which leads to...
Hey, this one happened! Luca Palmieri started a conversation on reddit and the #science-and-ai Discord channel on the Rust community server was born! I think it works pretty well, and Luca also has being doing a great job running workshops and guiding the conversation around rust-ml.
Rust is amazing because it is very good at bringing many concepts and ideas that seem contradictory at first, but can really shine when synthesized. But can we share this combined wisdom and also improve the situation in other places too? Despite the "Rewrite it in Rust" meme, increased interoperability is something that is already driving a lot of the best aspects of Rust:
Interoperability with other languages: as I said before, with WebAssembly (and Rust being having the best toolchain for it) there is a clear route to achieve this, but it will not replace all the software that already exist and can benefit from FFI and C compatibility. Bringing together developers from the many language specific binding generators (helix, neon, rustler, PyO3...) and figuring out what's missing from them (or what is the common parts that can be shared) also seems productive.
Interoperability with new and unexplored domains. I think Rust benefits enormously from not focusing only in one domain, and choosing to prioritize CLI, WebAssembly, Networking and Embedded is a good subset to start tackling problems, but how to guide other domains to also use Rust and come up with new contributors and expose missing pieces of the larger picture?
Another point extremely close to interoperability is training. A great way to interoperate with other languages and domains is having good documentation and material from transitioning into Rust without having to figure everything at once. Rust documentation is already amazing, especially considering the many books published by each working group. But... there is a gap on the transitions, both from understanding the basics of the language and using it, to the progression from beginner to intermediate and expert.
I see good resources for JavaScript and Python developers, but we are still covering a pretty small niche: programmers curious enough to go learn another language, or looking for solutions for problems in their current language.
Can we bring more people into Rust?
RustBridge is obviously the reference here,
but there is space for much,
much more.
Using Rust in The Carpentries lessons?
Creating RustOpenSci
,
mirroring the communities of practice of rOpenSci and pyOpenSci?
Neste tutorial, será abordado o processo de criação de um dict ou dicionário, a partir de um ou mais dicts em Python.
Como já é de costume da linguagem, isso pode ser feito de várias maneiras diferentes.
Pra começar, vamos supor que temos os seguintes dicionários:
dict_1 = {
'a': 1,
'b': 2,
}
dict_2 = {
'b': 3,
'c': 4,
}
Como exemplo, vamos criar um novo dicionário chamado new_dict com os valores de dict_1 e dict_2 logo acima. Uma abordagem bem conhecida é utilizar o método update.
new_dict = {}
new_dcit.update(dict_1)
new_dcit.update(dict_2)
Assim, temos que new_dict será:
>> print(new_dict)
{
'a': 1,
'b': 3,
'c': 4,
}
Este método funciona bem, porém temos de chamar o método update para cada dict que desejamos mesclar em new_dict. Não seria interessante se fosse possível passar todos os dicts necessários já na inicialização de new_dict?
O Python 3 introduziu uma maneira bem interessante de se fazer isso, utilizando os operadores **
.
new_dict = {
**dict_1,
**dict_2,
}
Assim, de maneira semelhante ao exemplo anterior, temos que new_dict será :
>> print(new_dict['a'])
1
>> print(new_dict['b'])
3
>> print(new_dict['c'])
4
Ao utilizamos o procedimento de inicialização acima, devemos tomar conseiderar alguns fatores. Apenas os valores do primeiro nível serão realmente duplicados no novo dicionário. Como exemplo, vamos alterar uma chave presente em ambos os dicts e verificar se as mesmas possuem o mesmo valor:
>> dict_1['a'] = 10
>> new_dict['a'] = 11
>> print(dict_1['a'])
10
>> print(new_dict['a'])
11
Porém isso muda quando um dos valores de dict_1 for uma list, outro dict ou algum objeto complexo. Por exemplo:
dict_3 = {
'a': 1,
'b': 2,
'c': {
'd': 5,
}
}
e agora, vamos criar um novo dict a partir desse:
new_dict = {
**dict_3,
}
Como no exemplo anterior, podemos imaginar que foi realizado uma cópia de todos os elementos de dict_3, porém isso não é totalmente verdade. O que realmente aconteceu é que foi feita uma cópia superficial dos valores de dict_3, ou seja, apenas os valores de primeiro nível foram duplicados. Observe o que acontece quando alteramos o valor do dict presente na chave c.
>> new_dict['c']['d'] = 11
>> print(new_dict['c']['d'])
11
>> print(dict_3['c']['d'])
11
# valor anterior era 5
No caso da chave c, ela contem uma referência para outra estrutura de dados (um dict, no caso). Quando alteramos algum valor de dict_3['c'], isso reflete em todos os dict que foram inicializados com dict_3. Em outras palavras, deve-se ter cuidado ao inicializar um dict a partir de outros dicts quando os mesmos possuírem valores complexos, como list, dict ou outros objetos (os atributos deste objeto não serão duplicados).
De modo a contornar este inconveniente, podemos utilizar o método deepcopy da lib nativa copy. Agora, ao inicializarmos new_dict:
import copy
dict_3 = {
'a': 1,
'b': 2,
'c': {
'd': 5,
}
}
new_dict = copy.deepcopy(dict_3)
O método deepcopy realiza uma cópia recursiva de cada elemento de dict_3, resolvendo nosso problema. Veja mais um exemplo:
>> new_dict['c']['d'] = 11
>> print(new_dict['c']['d'])
11
>> print(dict_3['c']['d'])
5
# valor não foi alterado
Este artigo tenta demonstrar de maneira simples a criação de dicts, utilizando os diversos recursos que a linguagem oferece bem como os prós e contras de cada abordagem.
Para mais detalhes e outros exemplos, deem uma olhada neste post do forum da Python Brasil aqui.
É isso pessoal. Obrigado por ler!
Eu e o prof. Leonardo Cruz da Faculdade de Ciências Sociais estamos juntos trabalhando no desenvolvimento do Laboratório Amazônico de Estudos Sociotécnicos da UFPA.
Nossa proposta é realizar leituras e debates críticos sobre o tema da sociologia da tecnologia, produzir pesquisas teóricas e empíricas na região amazônica sobre as relações entre tecnologia e sociedade, e trabalhar com tecnologias livres em comunidades próximas a Belém.
No momento estamos com um grupo de estudos montado com cronograma de textos e filmes para trabalharmos e debatermos criticamente. Esse grupo será o embrião para a orientação de alunos de graduação e pós em temas como impacto da inteligência artificial, computação e guerra, cibernética, vigilantismo, capitalismo de plataforma, fake news, pirataria, software livre, e outros.
Aos interessados, nosso cronograma de estudos está disponível nesse link.
E para quem usa Telegram, pode acessar o grupo de discussão aqui.
Quaisquer dúvidas, só entrar em contato!