izeye: July 2016

Friday, July 29, 2016

Share Code style schemes in IntelliJ

To share Code style schemes in IntelliJ, do as follows:

File -> Export Settings... -> Select None -> Code style schemes -> OK

Reference:
https://www.jetbrains.com/help/idea/2016.2/exporting-and-importing-settings.html

Install spaCy

To install spaCy, do as follows:

Johnnyui-MacBook-Pro:~ izeye$ python -m pip install -U pip virtualenv
...
Johnnyui-MacBook-Pro:~ izeye$ virtualenv .env -p python2
Running virtualenv with interpreter /Library/Frameworks/Python.framework/Versions/2.7/bin/python2
New python executable in /Users/izeye/.env/bin/python
Installing setuptools, pip, wheel...done.
Johnnyui-MacBook-Pro:~ izeye$ source .env/bin/activate
(.env) Johnnyui-MacBook-Pro:~ izeye$ pip install spacy
(.env) Johnnyui-MacBook-Pro:~ izeye$ python
Python 2.7.11 (v2.7.11:6d1b6a68f775, Dec 5 2015, 12:54:16)
[GCC 4.2.1 (Apple Inc. build 5666) (dot 3)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> import spacy
>>>
(.env) Johnnyui-MacBook-Pro:~ izeye$ python -m spacy.en.download
Downloading...
Downloaded 532.28MB 100.00% 0.24MB/s eta 0s
archive.gz checksum/md5 OK
Model successfully installed.
(.env) Johnnyui-MacBook-Pro:~ izeye$ python -c "import spacy; spacy.load('en'); print('OK')"
OK

(.env) Johnnyui-MacBook-Pro:~ izeye$ python -c "import os; import spacy; print(os.path.dirname(spacy.__file__))"

/Users/izeye/.env/lib/python2.7/site-packages/spacy

(.env) Johnnyui-MacBook-Pro:~ izeye$ python -m pip install -U pytest
...

(.env) Johnnyui-MacBook-Pro:~ izeye$ python -m pytest /Users/izeye/.env/lib/python2.7/site-packages/spacy --vectors --model --slow

...

(.env) Johnnyui-MacBook-Pro:~ izeye$

Reference:
https://spacy.io/docs#getting-started

Checkstyle RightCurly alone with IntelliJ

To make Checkstyle RightCurly alone happy with IntelliJ, do as follows:

File -> Settings... -> Code Style -> Java -> Wrapping and Braces

* 'if()' statement
'else' on new line -> true

* 'try' statement
'catch' on new line -> true
'finally' on new line -> true

Reformat Code...

ERROR: virtualenv is not compatible with this system or executable

I got the following errors:

$ virtualenv .env
Using base prefix '/Users/izeye/anaconda'
New python executable in /Users/izeye/.env/bin/python
ERROR: The executable /Users/izeye/.env/bin/python is not functioning
ERROR: It thinks sys.prefix is '/Users/izeye' (should be '/Users/izeye/.env')
ERROR: virtualenv is not compatible with this system or executable
$

I just gave up to use Python 3 and worked around with Python 2 as follows:

$ virtualenv .env -p python2
Running virtualenv with interpreter /Library/Frameworks/Python.framework/Versions/2.7/bin/python2
New python executable in /Users/izeye/.env/bin/python
Installing setuptools, pip, wheel...done.
$

Add @author tags for Javadoc comments in IntelliJ

To add @author tags for Javadoc comments in IntelliJ, do as follows:

Preferences... -> File and Code Templates -> Includes -> File Header

/**
* Fill me.
*
* @author Johnny Lim
*/

Wednesday, July 27, 2016

Use CheckStyle in IntelliJ

To use CheckStyle in IntelliJ, do as follows:

File -> Settings... -> CheckStyle

Add a CheckStyle configuration file and activate it.

Open `Checkstyle` window and click `Check Project`.

Apply a Copyright comment to all Java source files in IntelliJ

To apply a Copyright comment to all Java source files in IntelliJ, do as follows:

IntelliJ IDEA -> Preferences...

Copyright -> Copyright Profiles

Add a profile as follows:

```
Copyright 2016 the original author or authors.

Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at

http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.
```

In `Copyright`, select one for `Default project copyright`.

Add a scope.

Finally, apply the Copyright as follows:

`src/main/java` -> Update Copyright...
`src/test/java` -> Update Copyright...

Reference:
https://www.jetbrains.com/help/idea/2016.1/generating-and-updating-copyright-notice.html

Monday, July 25, 2016

IllegalArgumentException[No custom metadata prototype registered for type [licenses], node like missing plugins]

If you encounter the following error:

[2016-07-25 16:23:24,384][INFO ][discovery.zen ] [Alex Wilder] failed to send join request to master [{Surtur}{8l2V-7MmSvKyC4oChA1gPA}{1.2.3.4}{1.2.3.4:9300}], reason [RemoteTransportException[[Surtur][1.2.3.4:9300][internal:discovery/zen/join]]; nested: IllegalStateException[failure when sending a validation request to node]; nested: RemoteTransportException[[Alex Wilder][1.2.3.5:9300][internal:discovery/zen/join/validate]]; nested: IllegalArgumentException[No custom metadata prototype registered for type [licenses], node like missing plugins]; ]

Install missing plugins as follows:

./bin/plugin install license
./bin/plugin install marvel-agent

Thursday, July 21, 2016

AWK fields and `if` sample

This is an AWK fields and `if` sample:

cat logs/user_agent/user_agent.log | awk 'BEGIN { FS = "\t" }; { if ($1 == "1234") print $2 }' > user_agent_pc.txt

References:
https://www.gnu.org/software/gawk/manual/html_node/Field-Separators.html
http://www.thegeekstuff.com/2010/02/awk-conditional-statements/?utm_source=feedburner&utm_medium=feed&utm_campaign=Feed%3A+TheGeekStuff+(The+Geek+Stuff)

Show histogram on live objects in Java heap

To show histogram on live objects in Java heap, do as follows:

jmap -histo:live 1234

Monday, July 18, 2016

Disable replicas of a new index in Elasticsearch

To disable replicas of a new index in Elasticsearch, do as follows:

curl -XPUT 'localhost:9200/_template/logstash_template' -d '
{
"template" : "logstash-*",
"settings" : {
"number_of_replicas" : 0
}
}'

Reference:
http://stackoverflow.com/questions/24553718/updating-the-default-index-number-of-replicas-setting-for-new-indices

Disable replicas of an existing index in Elasticsearch

To disable replicas of an existing index in Elasticsearch, do as follows:

curl -XPUT 'localhost:9200/logstash-2016.07.18/_settings' -d '
{
"index" : {
"number_of_replicas" : 0
}
}'

Reference:
https://www.elastic.co/guide/en/elasticsearch/reference/current/indices-update-settings.html

Setup Elasticsearch cluster

Add the following configuration to `config/elasticsearch.yml` in each instance of Elasticsearch:

cluster:
name: some-log

network:
host:
- _eth1_
- _local_

discovery.zen.ping.unicast.hosts: ["1.2.3.4", "1.2.3.5", "1.2.3.6", "1.2.3.7", "1.2.3.8"]
discovery.zen.minimum_master_nodes: 1

Note the value of `discovery.zen.minimum_master_nodes` is used for simplicity. Based on the recommendation it will be 3:

# Prevent the "split brain" by configuring the majority of nodes (total number of nodes / 2 + 1):
#
# discovery.zen.minimum_master_nodes: 3

Thursday, July 14, 2016

Change Elasticsearch heap size

To change Elasticsearch heap size, use the `ES_HEAP_SIZE` environment variable as follows:

ES_HEAP_SIZE=8g ./bin/elasticsearch

Reference:
https://www.elastic.co/guide/en/elasticsearch/guide/current/heap-sizing.html

Wednesday, July 13, 2016

Install Marvel

Intall Marvel into Elasticsearch and Kibana as follows:

cd programs/elasticsearch-2.3.3
./bin/plugin install license
./bin/plugin install marvel-agent

cd ../kibana-4.5.1-linux-x64
./bin/kibana plugin --install elasticsearch/marvel/latest

Restart Elasticsearch and Kibana.

Check the following URL:

http://localhost:5601/app/marvel

Reference:
https://www.elastic.co/kr/downloads/marvel

Show all document in an index in Elasticsearch

To show all document in an index in Elasticsearch, do as follows:

$ curl 'localhost:9200/logstash/_search?pretty=true&q=*:*'
{
"took" : 2,
"timed_out" : false,
"_shards" : {
"total" : 5,
"successful" : 5,
"failed" : 0
},
"hits" : {
"total" : 1,
"max_score" : 1.0,
"hits" : [ {
"_index" : "logstash",
"_type" : "logstash",
"_id" : "AVXjVaB4eRCf5XO_Qkwg",
"_score" : 1.0,
"_source" : {
"firstName" : "Johnny",
"lastName" : "Lim"
}
} ]
}
}
$

Reference:
http://stackoverflow.com/questions/8829468/elasticsearch-query-to-return-all-records

Monday, July 11, 2016

ZooKeeper Hello, world!

Install ZooKeeper as follows:

tar zxvf zookeeper-3.4.8.tar.gz

Setup and run ZooKeeper as follows:

cd zookeeper-3.4.8

conf/zoo.cfg

tickTime=2000
dataDir=/Users/izeye/zookeeper-data
clientPort=2181

./bin/zkServer.sh start

Test ZooKeeper as follows:

./bin/zkCli.sh

[zk: localhost:2181(CONNECTED) 0] help
ZooKeeper -server host:port cmd args
stat path [watch]
set path data [version]
ls path [watch]
delquota [-n|-b] path
ls2 path [watch]
setAcl path acl
setquota -n|-b val path
history
redo cmdno
printwatches on|off
delete path [version]
sync path
listquota path
rmr path
get path [watch]
create [-s] [-e] path data acl
addauth scheme auth
quit
getAcl path
close
connect host:port
[zk: localhost:2181(CONNECTED) 1] ls /
[zookeeper]
[zk: localhost:2181(CONNECTED) 2] create /zk_test my_data
Created /zk_test
[zk: localhost:2181(CONNECTED) 3] ls /
[zookeeper, zk_test]
[zk: localhost:2181(CONNECTED) 4] get /zk_test
my_data
cZxid = 0x11
ctime = Mon Jul 11 21:03:22 KST 2016
mZxid = 0x11
mtime = Mon Jul 11 21:03:22 KST 2016
pZxid = 0x11
cversion = 0
dataVersion = 0
aclVersion = 0
ephemeralOwner = 0x0
dataLength = 7
numChildren = 0
[zk: localhost:2181(CONNECTED) 5] set /zk_test junk
cZxid = 0x11
ctime = Mon Jul 11 21:03:22 KST 2016
mZxid = 0x12
mtime = Mon Jul 11 21:05:11 KST 2016
pZxid = 0x11
cversion = 0
dataVersion = 1
aclVersion = 0
ephemeralOwner = 0x0
dataLength = 4
numChildren = 0
[zk: localhost:2181(CONNECTED) 6] get /zk_test
junk
cZxid = 0x11
ctime = Mon Jul 11 21:03:22 KST 2016
mZxid = 0x12
mtime = Mon Jul 11 21:05:11 KST 2016
pZxid = 0x11
cversion = 0
dataVersion = 1
aclVersion = 0
ephemeralOwner = 0x0
dataLength = 4
numChildren = 0
[zk: localhost:2181(CONNECTED) 7] delete /zk_test
[zk: localhost:2181(CONNECTED) 8] ls /
[zookeeper]
[zk: localhost:2181(CONNECTED) 9]

Reference:
https://zookeeper.apache.org/doc/r3.4.8/zookeeperStarted.html

Friday, July 8, 2016

How to change Logstash's default max heap size

To change Logstash's default max heap size, do as follows:

LS_HEAP_SIZE=4g ./bin/logstash -f generator.conf

You can check if it works with `jps -v` as follows:

$ jps -v
15582 Main -XX:+UseParNewGC -XX:+UseConcMarkSweepGC -Djava.awt.headless=true -XX:CMSInitiatingOccupancyFraction=75 -XX:+UseCMSInitiatingOccupancyOnly -XX:+HeapDumpOnOutOfMemoryError -Xmx4g -Xss2048k -Djffi.boot.library.path=/home/izeye/programs/logstash-2.3.4/vendor/jruby/lib/jni -XX:+UseParNewGC -XX:+UseConcMarkSweepGC -Djava.awt.headless=true -XX:CMSInitiatingOccupancyFraction=75 -XX:+UseCMSInitiatingOccupancyOnly -XX:+HeapDumpOnOutOfMemoryError -XX:HeapDumpPath=/home/izeye/programs/logstash-2.3.4/heapdump.hprof -Xbootclasspath/a:/home/izeye/programs/logstash-2.3.4/vendor/jruby/lib/jruby.jar -Djruby.home=/home/izeye/programs/logstash-2.3.4/vendor/jruby -Djruby.lib=/home/izeye/programs/logstash-2.3.4/vendor/jruby/lib -Djruby.script=jruby -Djruby.shell=/bin/sh
15646 Jps -Dapplication.home=/home/izeye/programs/jdk1.8.0_45 -Xms8m
$

You can see `-Xmx4g` (ie. 4GB).

Reference:
https://www.elastic.co/guide/en/logstash/current/command-line-flags.html

Logstash's default max heap size

To know Logstash's default max heap size, use `jps -v` as follows:

$ jps -v
15396 Main -XX:+UseParNewGC -XX:+UseConcMarkSweepGC -Djava.awt.headless=true -XX:CMSInitiatingOccupancyFraction=75 -XX:+UseCMSInitiatingOccupancyOnly -XX:+HeapDumpOnOutOfMemoryError -Xmx1g -Xss2048k -Djffi.boot.library.path=/home/izeye/programs/logstash-2.3.4/vendor/jruby/lib/jni -XX:+UseParNewGC -XX:+UseConcMarkSweepGC -Djava.awt.headless=true -XX:CMSInitiatingOccupancyFraction=75 -XX:+UseCMSInitiatingOccupancyOnly -XX:+HeapDumpOnOutOfMemoryError -XX:HeapDumpPath=/home/izeye/programs/logstash-2.3.4/heapdump.hprof -Xbootclasspath/a:/home/izeye/programs/logstash-2.3.4/vendor/jruby/lib/jruby.jar -Djruby.home=/home/izeye/programs/logstash-2.3.4/vendor/jruby -Djruby.lib=/home/izeye/programs/logstash-2.3.4/vendor/jruby/lib -Djruby.script=jruby -Djruby.shell=/bin/sh
15460 Jps -Dapplication.home=/home/izeye/programs/jdk1.8.0_45 -Xms8m
$

You can see `-Xmx1g` (ie. 1GB).

The result is from Logstash 2.3.4.

How to get JVM default max heap size

To get JVM default max heap size, use the following command:

$ java -XX:+PrintFlagsFinal -version | grep HeapSize
uintx ErgoHeapSizeLimit = 0 {product}
uintx HeapSizePerGCThread = 87241520 {product}
uintx InitialHeapSize := 262144000 {product}
uintx LargePageHeapSizeThreshold = 134217728 {product}
uintx MaxHeapSize := 4179623936 {product}
java version "1.8.0_45"
Java(TM) SE Runtime Environment (build 1.8.0_45-b14)
Java HotSpot(TM) 64-Bit Server VM (build 25.45-b02, mixed mode)
$

In this case, you can see it's 4GB.

Reference:
http://stackoverflow.com/questions/12797560/command-line-tool-to-find-java-heap-size-and-memory-used-linux

How to get VM parameters of running a Java process

To get VM parameters of running a Java process, do as follows:

$ jps -v
15286 Jps -Dapplication.home=/home/izeye/programs/jdk1.8.0_45 -Xms8m
$

How to pass an inline environment variable to an application in Linux

To pass an inline environment variable to an application in Linux, do as follows:

$ LS_HEAP_SIZE=4g ./some-script.sh
4g
$ echo $LS_HEAP_SIZE

$

`some-script.sh` simply includes `echo` for the environment variable as follows:

echo $LS_HEAP_SIZE

Note that the environment variable is not available in the next prompt.

How to unset an environment variable set by `export` in Linux

To unset an environment variable set by `export` in Linux, use `unset` as follows:

$ export LS_HEAP_SIZE=4g
$ echo $LS_HEAP_SIZE
4g
$ unset LS_HEAP_SIZE
$ echo $LS_HEAP_SIZE

$

Reference:
http://stackoverflow.com/questions/6877727/how-do-i-delete-unset-an-exported-environment-variable

Benchmark Logstash Kafka input plugin with no-op output except metrics

Test environment is as follows:

```
CPU: Intel L5640 2.26 GHz 6 cores * 2 EA
Memory: SAMSUNG PC3-10600R 4 GB * 4 EA
HDD: TOSHIBA SAS 10,000 RPM 300 GB * 6 EA

OS: CentOS release 6.6 (Final)

Logstash 2.3.4
```

I used the following configuration:

```
input {
kafka {
zk_connect => '1.2.3.4:2181'
topic_id => 'some-log'
consumer_threads => 1
}
}

filter {
metrics {
meter => "events"
add_tag => "metric"
}
}

output {
if "metric" in [tags] {
stdout { codec => line {
format => "Count: %{[events][count]}"
}
}
}
}
```

I got the following result:

```
./bin/logstash -f some-log-kafka.conf
Settings: Default pipeline workers: 24
Pipeline main started
Count: 9614
Count: 23080
Count: 37087
Count: 50815
Count: 64517
Count: 78296
Count: 91977
Count: 105990
```
Default `flush_interval` is 5 seconds, so it looks roughly 14K per 5 seconds (2.8K per second).

With `consumer_threads` set to 10, I got the following result:

```
./bin/logstash -f impression-log-kafka.conf
Settings: Default pipeline workers: 24
Pipeline main started
Count: 9599
Count: 23254
Count: 37253
Count: 51029
Count: 64881
Count: 78868
Count: 92663
Count: 106267
```

It looks increasing `consumer_threads` doesn't make much difference.

Based on benchmark using my simple no-op consumer built with Kafka client Java library in the same machine, I expected around 30K and at least 10K but it's just 1/10 of the expected performance.

I'm not sure this could be enhanced by customizing configuration.

As a base test, I tested with `generator` as follows:

```
input {
generator { }
}

filter {
metrics {
meter => "events"
add_tag => "metric"
}
}

output {
#stdout { }

if "metric" in [tags] {
stdout { codec => line { format => "Count: %{[events][count]}"
}
}
}
}
```

I got the following result:

```
./bin/logstash -f generator.conf
Settings: Default pipeline workers: 24
Pipeline main started
Count: 200584
Count: 424425
Count: 651640
Count: 881605
Count: 1110150
```

It looks roughly 220K per 5 seconds (44K per second). It's not good as much as I expected as my simple no-op consumer built with Kafka client Java library consumed from 30K to 50K per second.

What am I missing here?

References:
https://www.elastic.co/guide/en/logstash/current/plugins-filters-metrics.html
http://izeye.blogspot.kr/2016/07/benchmark-simple-no-op-kafka-consumer.html

Benchmark a simple no-op Kafka consumer using Kafka client Java library

Test environment is as follows:

```
CPU: Intel L5640 2.26 GHz 6 cores * 2 EA
Memory: SAMSUNG PC3-10600R 4 GB * 4 EA
HDD: TOSHIBA SAS 10,000 RPM 300 GB * 6 EA

OS: CentOS release 6.6 (Final)

Kafka server 0.9.0.0
Kafka client Java library 0.9.0.1
```

I used a custom tool as follows:

```
git clone https://github.com/izeye/kafka-consumer.git
cd kafka-consumer/
./gradlew clean bootRepackage
java -jar build/libs/kafka-consumer-1.0.jar --spring.profiles.active=noop --kafka.consumer.bootstrap-servers=1.2.3.4:9092 --kafka.consumer.group-id=logstash --kafka.consumer.topic=some-log
```

I got the following result:

```
# of consumed logs per second: 29531
# of consumed logs per second: 38848
# of consumed logs per second: 28747
# of consumed logs per second: 49191
# of consumed logs per second: 28797
```

It consumed from 30K to 50K.

org.apache.kafka.common.protocol.types.SchemaException: Error reading field 'topic_metadata': Error reading array of size 552313, only 36 bytes available

If you try to connect from Kafka client 10.0.0.0 to Kafka server 0.9.0.0, you will get the following exception:

Caused by: org.apache.kafka.common.protocol.types.SchemaException: Error reading field 'topic_metadata': Error reading array of size 552313, only 36 bytes available
at org.apache.kafka.common.protocol.types.Schema.read(Schema.java:73) ~[kafka-clients-0.10.0.0.jar:na]
at org.apache.kafka.clients.NetworkClient.parseResponse(NetworkClient.java:380) ~[kafka-clients-0.10.0.0.jar:na]
at org.apache.kafka.clients.NetworkClient.handleCompletedReceives(NetworkClient.java:449) ~[kafka-clients-0.10.0.0.jar:na]
at org.apache.kafka.clients.NetworkClient.poll(NetworkClient.java:269) ~[kafka-clients-0.10.0.0.jar:na]

Changing Kafka client version to 0.9.0.1 solves the problem.

Thursday, July 7, 2016

How to extract a range of lines in a text file to another file in Linux

To extract a range of lines in a text file to another file in Linux, use the following command:

sed -n '1000,2000p' some.log > new.log

List Kafka consumer groups

To list Kafka consumer groups, use the following command:

./bin/kafka-consumer-groups.sh --zookeeper localhost:2181 --list

Transfer logs from Kafka to Elasticsearch via Logstash

You can transfer logs from Kafka to Elasticsearch via Logstash with the follwoing configuration:

input {
kafka {
topic_id => 'some_log'
}
}

filter {
grok {
patterns_dir => ["./patterns"]
match => { "message" => "%{INT:log_version}\t%{INT:some_id}\t%{DATA:some_field}\t%{GREEDYDATA:last_field}" }
}

if [some_id] not in ["1", "2", "3"] {
drop { }
}
}

output {
elasticsearch {
hosts => [ "1.2.3.4:9200" ]
}

#stdout {
#codec => json
# codec => rubydebug
#}
}

Note that the last field can't be `DATA`. If you use `DATA`, the last field won't be parsed.

Reference:
http://stackoverflow.com/questions/38240392/logstash-grok-filter-doesnt-work-for-the-last-field

How to insert a tab in Mac terminal

To insert a tab in Mac terminal, do as follows:

control + `V` + tab

Reference:
https://discussions.apple.com/thread/2225213?tstart=0