<?xml version="1.0" encoding="UTF-8"?>
<?xml-stylesheet type="text/xsl" media="screen" href="/~d/styles/rss2full.xsl"?><?xml-stylesheet type="text/css" media="screen" href="http://feeds.sentric.ch/~d/styles/itemcontent.css"?><rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:wfw="http://wellformedweb.org/CommentAPI/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:sy="http://purl.org/rss/1.0/modules/syndication/" xmlns:slash="http://purl.org/rss/1.0/modules/slash/" xmlns:feedburner="http://rssnamespace.org/feedburner/ext/1.0" version="2.0">

<channel>
	<title>sentric</title>
	
	<link>http://www.sentric.ch</link>
	<description>We Make Sense of Your Data</description>
	<lastBuildDate>Wed, 15 May 2013 08:38:37 +0000</lastBuildDate>
	<language>en-US</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
	<generator>http://wordpress.org/?v=3.4.2</generator>
		<atom10:link xmlns:atom10="http://www.w3.org/2005/Atom" rel="self" type="application/rss+xml" href="http://feeds.sentric.ch/sentric" /><feedburner:info uri="sentric" /><atom10:link xmlns:atom10="http://www.w3.org/2005/Atom" rel="hub" href="http://pubsubhubbub.appspot.com/" /><item>
		<title>Expansion of the portfolio: YMC repositions itself</title>
		<link>http://feeds.sentric.ch/~r/sentric/~3/q0kwtl9lUbI/expansion-of-the-portfolio-ymc-repositions-itself</link>
		<comments>http://www.sentric.ch/blog/expansion-of-the-portfolio-ymc-repositions-itself#comments</comments>
		<pubDate>Tue, 14 May 2013 13:42:14 +0000</pubDate>
		<dc:creator>Jean-Pierre König</dc:creator>
				<category><![CDATA[Blog]]></category>
		<category><![CDATA[News]]></category>

		<guid isPermaLink="false">http://www.sentric.ch/?p=2114</guid>
		<description><![CDATA[YMC strategically repositions itself and expands the existing service of Web Solutions by two complementary areas: Big Data Analytics and Mobile Applications. Big Data YMC has acquired the startup Sentric, which is also based in Kreuzlingen. Sentric has been the first Swiss provider of services in the field of Big Data. The team has already [...]]]></description>
			<content:encoded><![CDATA[<p dir="ltr"><a href="http://www.ymc.ch" target="_blank">YMC</a> strategically repositions itself and expands the existing service of Web Solutions by two complementary areas: Big Data Analytics and Mobile Applications.</p>
<h1>Big Data</h1>
<p>YMC has acquired the startup Sentric, which is also based in Kreuzlingen. Sentric has been the first Swiss provider of services in the field of Big Data. The team has already accomplished numerous successful projects for customers, including applications for the analysis of log files and user behavior, as well as recommendation systems. The employees of Sentric are frequently requested as speakers on conferences, they are also internationally known as the developers of Hannibal, an open source solution for monitoring HBase.</p>
<p>Among Sentric’s personnel are Christian Gügi and Jean-Pierre König, the founders of the Swiss Big Data User Group. YMC was recently certified as the first German-speaking training partner of Cloudera, a leading distributor of the Hadoop platform.</p>
<p>&#8220;High traffic web applications automatically generate large amounts of data&#8221;, said Jean-Pierre König, who will lead the business unit big data analytics from now on. &#8220;The fact that analysis has an intrinsic value is not yet widely recognized.&#8221; But especially for large customers it is attractive to be able to purchase recommender systems and analytics from the same supplier as the core application.</p>
<h1>Mobile Applications</h1>
<p>YMC has been developing web projects for mobile devices for quite some time. Now, a specialized team for the development of mobile applications begins its work: With the acquisition of Sentric, several experts in the areas of interface design and user experience are joining YMC. André Bohna, a respected iOS expert and until recently technically responsible for the iOS development at HolidayCheck AG, takes the lead.</p>
<p>YMC already has a high expertise in developing on the Android platform. &#8220;I am delighted to be able to lead an interdisciplinary team that is able to cover the entire value chain, from the first ideas to the implementation on the main mobile platforms.&#8221; André Bohna says.</p>
<h1>About YMC</h1>
<p>YMC AG was founded in 2001 as a specialized service provider for web technologies. The company employs 24 experts at Kreuzlingen with specializations such as software engineering, project management / development methodology, creation and design as well as server administration and operation.</p>
<p>YMC provides consulting, creation, custom development and safe (highly-available) hosting as well as operation and maintenance for online projects. Reference customers include Swiss Radio and Television (SRF), ETH Zurich, SOS-Kinderdorf Germany, WWF Switzerland and the publishing group Georg von Holtzbrinck. The project references include web portals with a focus on publishing systems and e-commerce, mobile and big data applications. The service also includes the technical integration of project solutions into existing IT systems and business processes as well as long-term maintenance and support contracts.</p>
<p>YMC sets a high value on standards compliance and usability and prefers to use agile development methods such as Scrum or Kanban. YMC also has a specialization in open source. Participating in relevant conferences and community events but also publishing articles and source code belong to the range of responsibilities of each employee and thus to the working culture. YMC also acts as a sponsor and grants its employees the freedom to create their own open source initiatives.</p>
<img src="http://feeds.feedburner.com/~r/sentric/~4/q0kwtl9lUbI" height="1" width="1"/>]]></content:encoded>
			<wfw:commentRss>http://www.sentric.ch/blog/expansion-of-the-portfolio-ymc-repositions-itself/feed</wfw:commentRss>
		<slash:comments>0</slash:comments>
		<feedburner:origLink>http://www.sentric.ch/blog/expansion-of-the-portfolio-ymc-repositions-itself?utm_source=rss&amp;utm_medium=rss&amp;utm_campaign=expansion-of-the-portfolio-ymc-repositions-itself</feedburner:origLink></item>
		<item>
		<title>Hannibal: New Features and the Future</title>
		<link>http://feeds.sentric.ch/~r/sentric/~3/ATAy8Tniwn4/hannibal-new-features-and-the-future</link>
		<comments>http://www.sentric.ch/blog/hannibal-new-features-and-the-future#comments</comments>
		<pubDate>Mon, 08 Apr 2013 06:46:11 +0000</pubDate>
		<dc:creator>Nils Kübler</dc:creator>
				<category><![CDATA[Blog]]></category>
		<category><![CDATA[hannibal]]></category>
		<category><![CDATA[HBase]]></category>
		<category><![CDATA[Monitoring]]></category>

		<guid isPermaLink="false">http://www.sentric.ch/?p=2080</guid>
		<description><![CDATA[A few months have passed since I last worked on Hannibal, but last week I had the opportunity to do so. I worked on a few issues in GitHub and thought about new features for the tool. In this post I will demonstrate the main new features for Hannibal and in the end, will talk [...]]]></description>
			<content:encoded><![CDATA[<p>A few months have passed since I last worked on Hannibal, but last week I had the opportunity to do so. I worked on a few issues in GitHub and thought about new features for the tool. In this post I will demonstrate the main new features for Hannibal and in the end, will talk a little bit about the future.</p>
<h1>Newest Features</h1>
<h2>Easier Configuration</h2>
<p>Thanks to another contribution from <a href="https://github.com/alexandre-normand">Alexandre Normand</a>, Hannibal is now able to detect most logfile-patterns by itself. This means Hannibal will now display compaction-metrics without further configuration in most cases.</p>
<p>This feature has actually been included in Hannibal for two months, but I thought it is really worth mentioning it.</p>
<h2>Easy Deployment</h2>
<p>Thanks to <a href="https://github.com/sentric/hannibal/issues/17" target="_blank">GitHub-Issue 17</a>, Hannibal can now be built into a single <em>tgz</em> file. This way you are able to build Hannibal on a different machine from where you run it. See the <a href="https://github.com/sentric/hannibal/blob/master/README.markdown" target="_blank">README</a> for details.</p>
<p>Maybe we could also provide precompiled packages in the future. What do you think?</p>
<h2>Sort Options for the Table-View</h2>
<p>Thanks to another <a href="https://github.com/sentric/hannibal/issues/12" target="_blank">GitHub-Issue 12</a>, the sort-order for the <em>Region Sizes</em> Chart can now be changed. This could be handy when you need to see the size of the neighbours for a region.</p>
<div id="attachment_2081" class="wp-caption aligncenter" style="width: 623px"><a href="http://www.sentric.ch/blog/hannibal-new-features-and-the-future/attachment/screen-shot-2013-04-04-at-5-54-53-pm" rel="attachment wp-att-2081"><img class="size-medium wp-image-2081" title="Table chart with new sort options" src="http://www.sentric.ch/wp-content/uploads/2013/04/Screen-Shot-2013-04-04-at-5.54.53-PM-613x329.png" alt="" width="613" height="329" /></a><p class="wp-caption-text">Table chart with new sort options</p></div>
<h2>Experimental Compactions Graph</h2>
<p>When I thought about new features for Hannibal, I thought one of the reasons to use Hannibal is to lessen the occurence of compaction storms. But actually, there is no way to see them. So I tried to create an experimental graph that visualizes the overall compactions over the cluster:</p>
<div id="attachment_2082" class="wp-caption aligncenter" style="width: 623px"><a href="http://www.sentric.ch/blog/hannibal-new-features-and-the-future/attachment/screen-shot-2013-04-04-at-5-46-11-pm" rel="attachment wp-att-2082"><img class="size-medium wp-image-2082" title="The new experimental Compaction History Chart" src="http://www.sentric.ch/wp-content/uploads/2013/04/Screen-Shot-2013-04-04-at-5.46.11-PM-613x235.png" alt="" width="613" height="235" /></a><p class="wp-caption-text">The new experimental Compaction History Chart</p></div>
<p>I don&#8217;t know wether that is really helpful though. What do you think about it? If you are keen you can test the graph by yourself (it&#8217;s available in the <a href="https://github.com/sentric/hannibal/tree/next" target="_blank"><em>next</em></a>-branch). <del>The chart is currently a bit hidden: it&#8217;s located right below the <em>Region Distribution</em> chart).</del> The chart is located on the page <em>Compactions</em>, which is available in the main menu.</p>
<p>I think the graph would be really great if you could see what impact the compactions have on your read/write performance.</p>
<h2>Future Development</h2>
<p>We keep asking ourselves: &#8220;What is the main goal for Hannibal?&#8221;. The answer is simply that it provides stuff to help tuning the HBase setup, that doesn&#8217;t come out of the box with HBase.</p>
<p>After attending Lars George&#8217;s talk: <a href="http://www.slideshare.net/Hadoop_Summit/hbase-sizing-notes-17826645" target="_blank">HBase Sizing Notes</a> at <a href="http://hadoopsummit.org/amsterdam/" target="_blank">Hadoop Summit in Amsterdam,</a> I believe that we are on the right track. We already addressed things like region-splits and compaction-storms. But there is so much more that can be done wrong and many things can only be detected by looking at the logfile, nowhere else -  that&#8217;s where I believe Hannibal should jump in.</p>
<p>I think about concentrating on getting more out of the logs. One great example is the forced memstore flushes. These flushes occur when the number of HLogs is too high or under memory pressure. For the first case, a log-entry is written: &#8220;<em>Too many hlogs: logs=33, maxlogs=32; forcing flush of 1 regions(s)</em>&#8220;. For Hannibal, it should be easy to record and put warnings like these on a graph.</p>
<p>I would love to hear from you what you think the right direction could be for Hannibal.</p>
<img src="http://feeds.feedburner.com/~r/sentric/~4/ATAy8Tniwn4" height="1" width="1"/>]]></content:encoded>
			<wfw:commentRss>http://www.sentric.ch/blog/hannibal-new-features-and-the-future/feed</wfw:commentRss>
		<slash:comments>0</slash:comments>
		<feedburner:origLink>http://www.sentric.ch/blog/hannibal-new-features-and-the-future?utm_source=rss&amp;utm_medium=rss&amp;utm_campaign=hannibal-new-features-and-the-future</feedburner:origLink></item>
		<item>
		<title>Hello Europe! Hadoop has landed.</title>
		<link>http://feeds.sentric.ch/~r/sentric/~3/SLjA1X4igJw/hello-europe-hadoop-has-landed</link>
		<comments>http://www.sentric.ch/blog/hello-europe-hadoop-has-landed#comments</comments>
		<pubDate>Thu, 28 Mar 2013 12:28:39 +0000</pubDate>
		<dc:creator>Jean-Pierre König</dc:creator>
				<category><![CDATA[Blog]]></category>
		<category><![CDATA[amsterdam]]></category>
		<category><![CDATA[conference]]></category>
		<category><![CDATA[dwh]]></category>
		<category><![CDATA[europe]]></category>
		<category><![CDATA[Hadoop]]></category>
		<category><![CDATA[hadoopsummit]]></category>
		<category><![CDATA[hive]]></category>
		<category><![CDATA[Hortonworks]]></category>
		<category><![CDATA[NoSQL]]></category>
		<category><![CDATA[stinger]]></category>
		<category><![CDATA[tez]]></category>

		<guid isPermaLink="false">http://www.sentric.ch/?p=2047</guid>
		<description><![CDATA[Last week we were in Amsterdam at the Hadoop Summit 2013. This was the first Hadoop Summit in Europe, so things are picking up momentum over here too. #HadoopSummit great that #hadoop has landed on the european mainland. Thank you @hortonworks &#8212; Rob Dielemans (@robdielemans) March 21, 2013 This event was a great opportunity to [...]]]></description>
			<content:encoded><![CDATA[<p>Last week we were in Amsterdam at the Hadoop Summit 2013. This was the first Hadoop Summit in Europe, so things are picking up momentum over here too.<br />
<script async src="//platform.twitter.com/widgets.js" charset="utf-8"></script></p>
<blockquote class="twitter-tweet"><p><a href="https://twitter.com/search/%23HadoopSummit">#HadoopSummit</a> great that <a href="https://twitter.com/search/%23hadoop">#hadoop</a> has landed on the european mainland. Thank you @<a href="https://twitter.com/hortonworks">hortonworks</a></p>
<p>&mdash; Rob Dielemans (@robdielemans) <a href="https://twitter.com/robdielemans/status/314641493845483520">March 21, 2013</a></p></blockquote>
<p>This event was a great opportunity to meet the leaders, share development and administrative experiences and finally to see great new stuff in action.</p>
<h1>From noSQL to proSQL</h1>
<p>The Hadoop ecosystem is now stronger than ever. We have seen more and more expansion in development tools and abstractions as well as administrative support over the last year. Nevertheless some parts are still missing or simply not usable for certain users. The Hadoop community is working hard on this.</p>
<blockquote class="twitter-tweet"><p>I can&#8217;t help noticing that the NoSQL movement is now the proSQL movement <a href="https://twitter.com/search/%23HadoopSummit">#HadoopSummit</a></p>
<p>— Steve Jones (@mosesjones) <a href="https://twitter.com/mosesjones/status/314654498981294080">March 21, 2013</a></p></blockquote>
<p>The major focus of the community is on fast query capabilities. We have seen different initiatives over the last year with Impala and Apache Drill. Now Hortonworks introduced Stinger and Tez.</p>
<blockquote class="twitter-tweet"><p><a href="https://twitter.com/search/%23hadoopsummit">#hadoopsummit</a>Stinger is a community initiative to make SQL queries on Hadoop run faster, not a single project</p>
<p>— Bence Arato (@BenceArato) <a href="https://twitter.com/BenceArato/status/314654474998276096">March 21, 2013</a></p></blockquote>
<p>According to Hortonworks, 50% of the Hadoop users depend on Hive as structured query engine. Some of the major drivers behind the scenes are today’s data warehouse or analytic platforms. SQL is the de-facto standard for business intelligence use cases such as interactive data exploration, visualization and parameterized reporting. To ensure Hive remains the “de-facto standard” for SQL queries with Hadoop, Hive’s SQL capabilities and its query performance must be enhanced. That’s what Stinger/Tez is about.</p>
<blockquote class="twitter-tweet"><p><a href="https://twitter.com/search/%23hadoopsummit">#hadoopsummit</a> @<a href="https://twitter.com/t3rmin4t0r">t3rmin4t0r</a> on stinger: &#8220;Aim is to get interactive queries responsibly, not &#8216;performance at any cost&#8217;! <a title="http://twitter.com/acmurthy/status/314685130268631040/photo/1" href="http://t.co/hU6MtLyObM">twitter.com/acmurthy/statu…</a></p>
<p>— Arun C Murthy (@acmurthy) <a href="https://twitter.com/acmurthy/status/314685130268631040">March 21, 2013</a></p></blockquote>
<h1>Hadoop and Enterprise Data Warehouse</h1>
<p>We have seen many projects in the early stages, prototypes and proof of concepts, where companies learn and try Hadoop. It somehow very often starts with downloading a sandbox and leads to small/midsize Hadoop cluster deployments for ETL/ELT processing next to a data warehouse or analytics platform. But where does this lead us?</p>
<blockquote class="twitter-tweet"><p>I wonder when I will hear the first time that &#8220;<a href="https://twitter.com/search/%23Hadoop">#Hadoop</a> is no longer a addition to your traditional <a href="https://twitter.com/search/%23DWH">#DWH</a>, it&#8217;s an replacement&#8221; <a href="https://twitter.com/search/%23hadoopsummit">#hadoopsummit</a></p>
<p>— jpkoenig (@jpkoenig) <a href="https://twitter.com/jpkoenig/status/314363123899432960">March 20, 2013</a></p></blockquote>
<p>My question has been answered by Patrick Angeles’s talk: “Hadoop and the Enterprise Data Warehouse”. He outlines that Hadoop will take an increasingly larger role in the enterprise data environment. We will live with data warehouses side-by-side with Hadoop for a long time to come. It&#8217;s also equally likely that existing data warehouse products and solutions will evolve to become more Hadoop-ish.</p>
<blockquote class="twitter-tweet"><p><a href="https://twitter.com/search/%23hadoopsummit">#hadoopsummit</a> panel: the easiest way to find the money for a Hadoop cluster is cut back on oracle and netezza support</p>
<p>— Steve Loughran (@steveloughran) <a href="https://twitter.com/steveloughran/status/314667399305654272">March 21, 2013</a></p></blockquote>
<p>Try it, learn it, do it!</p>
<img src="http://feeds.feedburner.com/~r/sentric/~4/SLjA1X4igJw" height="1" width="1"/>]]></content:encoded>
			<wfw:commentRss>http://www.sentric.ch/blog/hello-europe-hadoop-has-landed/feed</wfw:commentRss>
		<slash:comments>1</slash:comments>
		<feedburner:origLink>http://www.sentric.ch/blog/hello-europe-hadoop-has-landed?utm_source=rss&amp;utm_medium=rss&amp;utm_campaign=hello-europe-hadoop-has-landed</feedburner:origLink></item>
		<item>
		<title>Build a Better Customer Experience Model with Big Data</title>
		<link>http://feeds.sentric.ch/~r/sentric/~3/nviq6z4lcrg/build-a-better-customer-experience-model-with-big-data</link>
		<comments>http://www.sentric.ch/blog/build-a-better-customer-experience-model-with-big-data#comments</comments>
		<pubDate>Tue, 26 Mar 2013 08:54:03 +0000</pubDate>
		<dc:creator>Jean-Pierre König</dc:creator>
				<category><![CDATA[Blog]]></category>
		<category><![CDATA[big data]]></category>
		<category><![CDATA[customer behaviour]]></category>
		<category><![CDATA[customer experience]]></category>
		<category><![CDATA[hollywood principle]]></category>
		<category><![CDATA[product development]]></category>
		<category><![CDATA[r&d]]></category>
		<category><![CDATA[tracking]]></category>

		<guid isPermaLink="false">http://www.sentric.ch/?p=2035</guid>
		<description><![CDATA[As a topic of digital marketing, customer experience management is the intersection of many different disciplines, including: design, marketing, branding and interactions. Besides the business and technology aspects of the customer experience management model, there are customers or real people who are the ultimate judges as to whether or not a product/service is desirable. Keeping your finger on the [...]]]></description>
			<content:encoded><![CDATA[<p>As a topic of digital marketing, customer experience management is the intersection of many different disciplines, including: design, marketing, branding and interactions. Besides the business and technology aspects of the customer experience management model, there are customers or real people who are the ultimate judges as to whether or not a product/service is desirable. Keeping your finger on the pulse of your customers is exceptionally important and something that Big Data can help with.</p>
<div id="attachment_2043" class="wp-caption alignnone" style="width: 623px"><a href="http://www.sentric.ch/blog/build-a-better-customer-experience-model-with-big-data/attachment/277676760_53776af94b_z" rel="attachment wp-att-2043"><img class="size-medium wp-image-2043" title="Hollywood" src="http://www.sentric.ch/wp-content/uploads/2013/03/277676760_53776af94b_z-613x458.jpg" alt="" width="613" height="458" /></a><p class="wp-caption-text">CC 2.0 by Chang&#8217;r | http://flic.kr/p/qxaCm</p></div>
<h1>Intersection of Business and People</h1>
<p>This is where emotional innovation occurs. Where there is a real connection between the business and the people, two way dialogue will occur.  This will lead to the people becoming more and more loyal as the business listens and adapts their portfolio of products/services to the people.  A great balance of desirability and viability are achieved and your customers become your greatest evangelists.</p>
<p>Marketing is an exercise in relationship building. Too often, the marketer gets caught up in the idea that the customer experience is just about customers interacting with a product or service. This is simply not the case. One of the chief opportunities frequently overlooked is the ability to experience customers. Most companies are operating blind at this point. They deliver products and services to a customer as a stickman. What do they really know about their customers? Are your customers your greatest evangelists?</p>
<h1>Big Data</h1>
<p>Today we have affordable solutions to store huge amount of data from different sources. We even have the capabilities to process data in parallel in order to do analytics, fraud and security detection, mining, risk management, business intelligence and many more. But how could Big Data support an organisation to build a better customer experience model?</p>
<p>Results-oriented customer experience marketing relies on effectively communicating how the product or service solves problems through use of the product or service. Turning your marketing efforts to addressing the problems and needs of customers is the foundational tenet of delivering incredible customer experience.</p>
<p>But what is the core you want to do with your business? You have to demonstrate value to produce loyal customers, to increase revenue by getting people to purchase more from you, and to attract new customers. You have to be committed to delivering value as you reduce your operating/production costs whilst keeping product quality high. Does the result-oriented customer experience model address this? Not in most of the cases. It focuses efforts to produce loyal customers that keep coming back. And here is one way to support this initiative.</p>
<h1>The Hollywood Principle</h1>
<p>Most of our digital devices nowadays are connected to the Internet and we will buy more and more with built-in Internet connectivity in the future. One driver for this is the need of personalisation or customisation. As an example, think about a coffee machine. A housing hiding the internals, a few buttons, a water tank, coffee container and a cable. As the manufacturers of this product, you have no details about your customers&#8217; interactions with the product, their preferred settings and the condition of the coffee machine itself. Now, imagine it&#8217;s equiped with LAN or WiFi hardware and connected to the Internet. Besides the fact that you are now able to experience customers&#8217; behaviour and usage patterns, the coffee machine is in a position to submit its condition back to you. This information is almost always most successful when doing failure pattern detection. As a customer, I do not have to call a service hotline any longer. I would expect a pre-fail service for all my devices. This opens up a complete new business model for services and products and drives emotional innovation.</p>
<p>That&#8217;s what we call the <strong>Hollywood Principle</strong>: <em>Don&#8217;t call us. We&#8217;ll call you!</em></p>
<p>From a wider perspective, this model could be applied to almost any digital device in the consumer and business to business market. It doesn&#8217;t matter whether its a TV set-top box, a coffee machine, a TAN generator or a production line.</p>
<p>Big Data can help you build sustainable products with a built-in pre-failure handling by collecting and analysing it&#8217;s runtime conditions. There is no question about whether manufacturers will do this, it&#8217;s just a question of time.</p>
<p>Think about your products and what you actually know about how they work for your customers&#8230;maybe it&#8217;s time to look for more information &#8211; before your competitors do!</p>
<img src="http://feeds.feedburner.com/~r/sentric/~4/nviq6z4lcrg" height="1" width="1"/>]]></content:encoded>
			<wfw:commentRss>http://www.sentric.ch/blog/build-a-better-customer-experience-model-with-big-data/feed</wfw:commentRss>
		<slash:comments>0</slash:comments>
		<feedburner:origLink>http://www.sentric.ch/blog/build-a-better-customer-experience-model-with-big-data?utm_source=rss&amp;utm_medium=rss&amp;utm_campaign=build-a-better-customer-experience-model-with-big-data</feedburner:origLink></item>
		<item>
		<title>Lambda Architecture, Part 1</title>
		<link>http://feeds.sentric.ch/~r/sentric/~3/AKf-H3PKle8/lambda-architecture-part-1</link>
		<comments>http://www.sentric.ch/blog/lambda-architecture-part-1#comments</comments>
		<pubDate>Fri, 08 Mar 2013 09:54:25 +0000</pubDate>
		<dc:creator>Christian Gügi</dc:creator>
				<category><![CDATA[Blog]]></category>
		<category><![CDATA[batch processing]]></category>
		<category><![CDATA[big data]]></category>
		<category><![CDATA[Hadoop]]></category>
		<category><![CDATA[Lambda architecture]]></category>
		<category><![CDATA[realtime]]></category>
		<category><![CDATA[Storm]]></category>
		<category><![CDATA[Stream processing]]></category>

		<guid isPermaLink="false">http://www.sentric.ch/?p=1967</guid>
		<description><![CDATA[We are witnessing a paradigm shift from batch based data processing to real-time data processing using the Hadoop framework. Despite this progress it is still a challenge to process web-scale data in real-time. A lot of technologies can be used to create such a complete data processing system &#8211; but to choose the right tools, [...]]]></description>
			<content:encoded><![CDATA[<p>We are witnessing a paradigm shift from batch based data processing to real-time data processing using the Hadoop framework. Despite this progress it is still a challenge to process web-scale data in real-time. A lot of technologies can be used to create such a complete data processing system &#8211; but to choose the right tools, to incorporate and orchestrate them is complex and daunting.</p>
<p>Nathan Marz defines the most general data system as a system that runs arbitrary functions on arbitrary data. This leads to the following equation &#8220;<strong><em>query =  function(all data)</em></strong>&#8221; which is the basis of all data systems. The Lambda Architecture defines a clear set of architectural principles for building robust and scalable data systems that obey the equation above. He is also currently writing the book “<a href="http://www.manning.com/marz/" target="_blank"><em>Big Data &#8211; Principles and best practices of scalable realtime data systems</em></a>”.</p>
<p>The Lambda Architecture is based on three main design principles:</p>
<ul>
<li>human fault-tolerance &#8211; the system is unsusceptible to data loss or data corruption because at scale it could be irreparable.</li>
<li>data immutability &#8211; store data in it&#8217;s rawest form immutable and for perpetuity.</li>
<li>recomputation &#8211; with the two principles above it is always possible to (re)-compute results by running a function on the raw data.</li>
</ul>
<p>In general the Lambda Architecture is composed of three layers: the batch layer, the serving layer and the speed layer.</p>
<div><a href="http://www.sentric.ch/blog/lambda-architecture-part-1/attachment/lambda_arch-3" rel="attachment wp-att-1998"><img class="size-medium wp-image-1998" title="Lambda Architecture" src="http://www.sentric.ch/wp-content/uploads/2013/03/Lambda_Arch2-613x488.jpg" alt="" width="613" height="488" /></a></div>
<p>&nbsp;</p>
<p><strong>Batch layer</strong></p>
<p>The batch layer contains the immutable, constantly growing master dataset stored on a distributed file system like HDFS. With batch processing (MapReduce) arbitrary views &#8211; so called batch views are computed from this raw dataset. So Hadoop is a perfect fit for the concept of the batch layer.</p>
<p><strong>Serving layer</strong></p>
<p>The job of the serving layer is to load  and expose the batch views in a datastore so that they can be queried. This serving layer datastore does not require random writes &#8211; but must support batch updates and random reads &#8211; and can therefore be extraordinarily simple (candidates could be <a href="https://github.com/nathanmarz/elephantdb" target="_blank">ElephantDB</a> or <a href="http://www.project-voldemort.com/voldemort/" target="_blank">Voldemort</a>).</p>
<p><strong>Speed layer</strong></p>
<p><strong></strong>This layer deals only with new data and compensates for the high latency updates of the serving layer. It leverages stream processing systems (<a href="http://storm-project.net/" target="_blank">Storm</a>, <a href="http://incubator.apache.org/s4/" target="_blank">S4</a>, <a href="http://spark-project.org/" target="_blank">Spark</a>) and random read/write datastores to compute the realtime views (<a href="http://hbase.apache.org/" target="_blank">HBase</a>). These views remain valid until the data have found their way through the batch and serving layer.</p>
<p>To get a complete result, the batch and realtime views must be queried and the results merged together .</p>
<p><strong>Conclusion</strong></p>
<p>The Lambda Architecture is the first approach that handles the complexity of Big Data systems by defining a clear set of principles. Sentric adopted these architectural principles (or at least part of them) for our customers as they are great approach that can be applied to any Big Data system. Specifically immutability, human fault-tolerance and recomputation are really nice principles that can be easily adopted with the Hadoop platform.<br />
Depending on realtime requirements, often enough the speed layer is not even needed.  If omitted, it makes the whole system even less complex, but the beauty of the Lambda Architecture is that the speed layer can be integrated later on without a huge hassle.</p>
<p>In part II of our series we&#8217;ll write about the batch layer with an example case. So stay tuned!</p>
<img src="http://feeds.feedburner.com/~r/sentric/~4/AKf-H3PKle8" height="1" width="1"/>]]></content:encoded>
			<wfw:commentRss>http://www.sentric.ch/blog/lambda-architecture-part-1/feed</wfw:commentRss>
		<slash:comments>0</slash:comments>
		<feedburner:origLink>http://www.sentric.ch/blog/lambda-architecture-part-1?utm_source=rss&amp;utm_medium=rss&amp;utm_campaign=lambda-architecture-part-1</feedburner:origLink></item>
		<item>
		<title>Case Study: Retail WiFi Log-file Analysis with Hadoop and Impala, Part 4</title>
		<link>http://feeds.sentric.ch/~r/sentric/~3/dqynHzgyyLg/case-study-retail-wifi-log-file-analysis-with-hadoop-and-impala-part-4</link>
		<comments>http://www.sentric.ch/blog/case-study-retail-wifi-log-file-analysis-with-hadoop-and-impala-part-4#comments</comments>
		<pubDate>Fri, 08 Feb 2013 15:16:41 +0000</pubDate>
		<dc:creator>Jean-Pierre König</dc:creator>
				<category><![CDATA[Blog]]></category>
		<category><![CDATA[Hackathon]]></category>
		<category><![CDATA[hive]]></category>
		<category><![CDATA[impala]]></category>
		<category><![CDATA[in-store analysis]]></category>
		<category><![CDATA[log analysis]]></category>
		<category><![CDATA[query]]></category>
		<category><![CDATA[Visualization]]></category>
		<category><![CDATA[wifi signals]]></category>

		<guid isPermaLink="false">http://www.sentric.ch/?p=1922</guid>
		<description><![CDATA[In the previous article we explained how to parse, transform and finally load data into Hive’s warehouse. Now it’s time to talk about querying the data. Before we start, here is how a sample of the data looks like: [crayon-519351bf63ef0/] As you can see, there is still some noise in the last column. We are [...]]]></description>
			<content:encoded><![CDATA[<p>In the <a title="Case Study: Retail WiFi Log-file Analysis with Hadoop and Impala, Part 3" href="http://www.sentric.ch/blog/case-study-retail-wifi-log-file-analysis-with-hadoop-and-impala-part-3">previous article</a> we explained how to parse, transform and finally load data into Hive’s warehouse. Now it’s time to talk about querying the data. Before we start, here is how a sample of the data looks like:</p><pre class="crayon-plain-tag">[root@cdh-master ~]# hadoop fs -cat /user/hive/warehouse/routerlogs/part-00000
1358756939,2013,1,21,9,28,59,+01:00,buffalo,hostapd,wlan0,98:0c:82:dc:8b:15,MLME,MLME-AUTHENTICATE.indication(98:0c:82:dc:8b:15, OPEN_SYSTEM)
1358756939,2013,1,21,9,28,59,+01:00,buffalo,hostapd,wlan0,98:0c:82:dc:8b:15,MLME,MLME-DELETEKEYS.request(98:0c:82:dc:8b:15)
1358756939,2013,1,21,9,28,59,+01:00,buffalo,hostapd,wlan0,98:0c:82:dc:8b:15,IEEE 802.11,authenticated
1358756939,2013,1,21,9,28,59,+01:00,buffalo,hostapd,wlan0,98:0c:82:dc:8b:15,IEEE 802.11,association OK (aid 2)
1358756939,2013,1,21,9,28,59,+01:00,buffalo,hostapd,wlan0,98:0c:82:dc:8b:15,IEEE 802.11,associated (aid 2)
1358756939,2013,1,21,9,28,59,+01:00,buffalo,hostapd,wlan0,98:0c:82:dc:8b:15,MLME,MLME-ASSOCIATE.indication(98:0c:82:dc:8b:15)
1358756939,2013,1,21,9,28,59,+01:00,buffalo,hostapd,wlan0,98:0c:82:dc:8b:15,MLME,MLME-DELETEKEYS.request(98:0c:82:dc:8b:15)
1358757010,2013,1,21,9,30,10,+01:00,buffalo,hostapd,wlan0,98:0c:82:dc:8b:15,IEEE 802.11,deauthenticated</pre><p>As you can see, there is still some noise in the last column. We are interested in “<em>authentication OK</em>&#8216; and “<em>deauthenticated</em>” messages only. The messages from the router are not standardized (as are protocols such as TCP). We found that those two status messages are the closest ones to our understanding of a &#8220;login&#8221;/&#8221;logout&#8221; on the router. Let’s reduce the data set to those lines. During this step we do the duration calculation as well.</p>
<p>We used Cloudera&#8217;s real time query engine <a href="http://www.cloudera.com/content/cloudera/en/products/cloudera-enterprise-core/cloudera-enterprise-RTQ.html">Impala</a> for this task and here is how the query looks like (ts refers as timestamp):</p><pre class="crayon-plain-tag">SELECT A.ts, MIN(B.ts - A.ts), A.host, A.mac_address FROM routerlogs A, routerlogs B WHERE A.host = B.host AND A.mac_address = B.mac_address AND A.ts &lt;= B.ts AND A.message LIKE '%authentication OK%' AND B.message LIKE '%deauthenticated%' GROUP BY A.host, A.mac_address, A.ts;</pre><p>We already talked about Impala’s early state of development and that it lacks the ability to CREATE a table from a query output. At this point we did copy &amp; paste the results into a CSV file, created a new Hive table called ‘<em>visit_duration</em>’ and loaded the CSV file into it. Here is how we did it:</p><pre class="crayon-plain-tag">create table visit_duration (
ts int,
duration_in_seconds int,
router string,
mac_address string)
row format delimited
fields terminated by ',';</pre><p>Now we have the data we need to answer the following questions:</p>
<ul>
<li>How many people visited the store (unique visitors)?<br />
Note: Unlike the traditional customer frequency counter at the doors we have mac addresses at the log files that are unique for mobile phones. Supposed people do not change their mobile phones we can recognize unique visitors and not just visits.</li>
<li>How many visits did we have?</li>
<li>What is the average visit duration?</li>
<li>How many people are new vs. returning?</li>
</ul>
<p>While we had a setup with 2 WiFi routers to simulate different stores we continue to describe the process for just one of them called “<em>buffalo</em>”, aka store number one.</p>
<p>Counting the visits for store number one is very simple:</p><pre class="crayon-plain-tag">Copyright (c) 2012 Cloudera, Inc. All rights reserved.

(Build version: Impala v0.3 (3cb725b) built on Fri Nov 23 13:51:59 PST 2012)
[localhost:21000] &gt; SELECT COUNT(*) FROM visit_duration WHERE router  = "buffalo";                    
135</pre><p>The plot (figure 1) indicates that about 85% of the visits were detected in store number one and about 15% in store number two. One might draw the conclusion that store number one is in a much better location with more occasional customers. But let’s gain more insights by analysing the number of unique visitors.</p>
<div id="attachment_1923" class="wp-caption alignnone" style="width: 623px"><a href="http://www.sentric.ch/blog/case-study-retail-wifi-log-file-analysis-with-hadoop-and-impala-part-4/attachment/plot-visits" rel="attachment wp-att-1923"><img class="size-medium wp-image-1923  " title="Visits for stores number one &amp; two" src="http://www.sentric.ch/wp-content/uploads/2013/02/plot-visits-613x245.png" alt="Visits for stores number one &amp; two" width="613" height="245" /></a><p class="wp-caption-text">Figure 1 &#8211; Visits for stores number one &amp; two</p></div>
<p>Collecting the number of unique visitors is even simpler as we have the mac addresses of visitors that make them unique:</p><pre class="crayon-plain-tag">[localhost:21000] &gt; SELECT COUNT(DISTINCT(mac_address)) FROM visit_duration WHERE router = "buffalo";
9</pre><p>This plot (figure 2) gives us more details about the customers. It turns out that the 135 visits in store number one were caused by just 9 unique visitors while store number two encountered 5 unique visitors.</p>
<div id="attachment_1925" class="wp-caption alignnone" style="width: 623px"><a href="http://www.sentric.ch/blog/case-study-retail-wifi-log-file-analysis-with-hadoop-and-impala-part-4/attachment/plot-unique-visitors" rel="attachment wp-att-1925"><img class="size-medium wp-image-1925 " title="Unique visitors" src="http://www.sentric.ch/wp-content/uploads/2013/02/plot-unique-visitors-613x245.png" alt="Unique visitors" width="613" height="245" /></a><p class="wp-caption-text">Figure 2 &#8211; Unique visitors</p></div>
<p>This shows us how important it is to have a thorough look at your data. We realize now, that the big difference shown in figure 1 is not that big anymore. It also shows us that there must have been some customers who returned to store number one. So let’s go into more details here.</p>
<p>To answer the new vs. returning ratio we had to perform this query:</p><pre class="crayon-plain-tag">[localhost:21000] &gt; SELECT count(distinct(A.mac_address)) FROM visit_duration A, visit_duration B WHERE A.mac_address = B.mac_address AND A.ts != B.ts AND A.router = "buffalo";
7</pre><p>This result gives us the number of returning users. Since we already know the total number of visitors (which is 9 for store number one), we are able to calculate the proportion of new users and plot a graph (figure 3).</p>
<div id="attachment_1926" class="wp-caption alignnone" style="width: 623px"><a href="http://www.sentric.ch/blog/case-study-retail-wifi-log-file-analysis-with-hadoop-and-impala-part-4/attachment/plot-new-vs-returning" rel="attachment wp-att-1926"><img class="size-medium wp-image-1926" title="Figure 3 - New vs. returning users" src="http://www.sentric.ch/wp-content/uploads/2013/02/plot-new-vs-returning-613x244.png" alt="Figure 3 - New vs. returning users" width="613" height="244" /></a><p class="wp-caption-text">Figure 3 &#8211; New vs. returning users</p></div>
<p>This plot (figure 3) indicates that we have more returning than new users in both stores. In store number two we didn’t see a new user over the past 4 days at all. It’s probably a good idea to start a marketing campaign which aims at new customers, e.g. to give out vouchers for the first purchase.<br />
But maybe there are other reasons behind this figures. Store number one might be located in a shopping mall and store number two might be located somewhere in town where people like to walk around when the sun is shining. Perhaps it was raining during the last 4 days and store number one encountered the visits of some new customers because they chose the mall to go shopping and decided to visit store number one out of comfort, as they were “trapped” in the mall anyway. This assumption gives a perspective of what is possible with our BigData approach: why don’t we include weather data and investigate the effects on our visitors?</p>
<p>To investigate whether a customer just popped into our store out of boredom, let’s have a look how long he stayed in it. Answering the question about visit duration is done by using the aggregate function:</p><pre class="crayon-plain-tag">[localhost:21000] &gt; SELECT AVG(duration_in_seconds) FROM visit_duration WHERE router = "buffalo";
976.6666666666666</pre><p>The average visit duration in store number one was around 00:16:16h while the average visit duration in store number two was 00:06:06h.</p>
<div id="attachment_1927" class="wp-caption alignnone" style="width: 623px"><a href="http://www.sentric.ch/blog/case-study-retail-wifi-log-file-analysis-with-hadoop-and-impala-part-4/attachment/plot-avg-visit-duration" rel="attachment wp-att-1927"><img class="size-medium wp-image-1927" title="Figure 4 - Visit duration over the past 4 days" src="http://www.sentric.ch/wp-content/uploads/2013/02/plot-avg-visit-duration-613x245.png" alt="Figure 4 - Visit duration over the past 4 days" width="613" height="245" /></a><p class="wp-caption-text">Figure 4 &#8211; Visit duration over the past 4 days</p></div>
<p>The plot (figure 4) for the last 4 days vividly visualizes that the visit duration in store number one was evenly distributed while the distribution in store number two shows some peaks. We can also see that visitors tend to stay in shop number one much longer. Assuming that both shops sell the same product (which seems to need some consultation) one might think that store number two did not sell a single product. But maybe the customers just enjoyed consultation in store number one and then bought the product in store number two. We would need to include sales figures to investigate this.</p>
<p>During our work of writing queries we acquired a better understanding of the data and the information it carries. Unsurprisingly, we realized that we can answer a different question as well:</p>
<ul>
<li>What is the average length of time between two visits?</li>
</ul>
<p>And here is how it goes:</p><pre class="crayon-plain-tag">[localhost:21000] &gt; SELECT B.ts, MIN(B.ts - A.ts), A.router, A.mac_address FROM visit_duration A, visit_duration B WHERE A.router = B.router AND A.router = "buffalo" AND A.mac_address = B.mac_address AND A.ts + A.duration_in_seconds &lt;= B.ts GROUP BY A.router, A.mac_address, B.ts;
1358758959	5045	buffalo	10:68:3f:40:20:2d	
1358771917	754	buffalo	d8:d1:cb:e9:ed:6c	
1358766344	628	buffalo	d8:d1:cb:e9:ed:6c	
1358764299	47	buffalo	d8:d1:cb:e9:ed:6c	
1358771935	18	buffalo	d8:d1:cb:e9:ed:6c	
1358517188	400	buffalo	24:ab:81:91:c8:62	
1358764341	89	buffalo	d8:d1:cb:e9:ed:6c  
…</pre><p>Since aggregated functions can not be nested, we calculated the average by hand: 7332.484127 seconds which is around 02:02:12h for store number one and 71053.64286 seconds which is around 19:44:14h for store number two.</p>
<p>Now let&#8217;s analyse the behavior of one particular user over both stores:</p><pre class="crayon-plain-tag">SELECT B.ts, MIN(B.ts - A.ts), A.router, A.mac_address FROM visit_duration A, visit_duration B WHERE A.router = B.router AND A.mac_address = B.mac_address AND A.ts + A.duration_in_seconds &lt;= B.ts AND A.mac_address = "10:68:3f:40:20:2d" GROUP BY A.router, A.mac_address, B.ts ORDER BY B.ts LIMIT 100;
...
1358759467	11	buffalo	10:68:3f:40:20:2d
1358760760	1293	buffalo	10:68:3f:40:20:2d
1358760892	132	buffalo	10:68:3f:40:20:2d
1358761202	326626	fonera	10:68:3f:40:20:2d
1358761459	257	fonera	10:68:3f:40:20:2d
1358761492	33	fonera	10:68:3f:40:20:2d
1358761552	60	fonera	10:68:3f:40:20:2d
1358761596	704	buffalo	10:68:3f:40:20:2d
...</pre><p>The calculated average duration between visits over both stores for this particular user is 15009.77778  seconds which is around 04:10:09h.</p>
<div id="attachment_1949" class="wp-caption alignnone" style="width: 623px"><a href="http://www.sentric.ch/blog/case-study-retail-wifi-log-file-analysis-with-hadoop-and-impala-part-4/attachment/plot-avg-duration-between-visits" rel="attachment wp-att-1949"><img class="size-medium wp-image-1949" title="Figure 5 - Average Duration Between Visits of one particular user" src="http://www.sentric.ch/wp-content/uploads/2013/02/plot-avg-duration-between-visits-613x247.png" alt="Figure 5 - Average Duration Between Visits of one particular user" width="613" height="247" /></a><p class="wp-caption-text">Figure 5 &#8211; Average Duration Between Visits of one particular user</p></div>
<p>There is a lot of useful information that can be derived from this plot (figure 5). Firstly, there is a repeating pattern of step-ins and step-outs within a short period of time. Perhaps this user was having a mobile phone conversation somewhere around the door of this store. Secondly, there was a step-out of store number one and a step-in into store number two within just 28 seconds. Imagine for a while both routers were in the same store on different floors. This pattern is a clear indicator that this particular user went from one floor to another, which should be interpreted as one visit.</p>
<h2>Conclusion</h2>
<ul>
<li>Analysing WiFi router log files could be done with a traditional RDBMS database approach as well. But one of the main benefits of this architecture is the ability to store a variety of semi structured files and to apply a schema afterwards. As the raw data contains a lot of information beyond our questions, it’s easy to answer different questions ad hoc. This effect could be leveraged whenever new log data from other sources can be processed and joined together.</li>
<li>Answering such questions based on WiFi router log files can be done without programming software by using graphical designers from existing BI/analysis and reporting tools with a BigData platform integration.</li>
<li>Given the fact that one can quickly ramp up a test cluster with a few nodes, similar problems can be solved within one day with a handful of engineers. The Cloudera Manager makes it very easy to install, maintain and monitor a Hadoop cluster and it can be used without profound understanding of the whole ecosystem.</li>
<li>Impala as a query engine is still in beta phase but querying massive amounts of data in real time is definitely the future. Hive does not support implicit JOINs that we used here. Furthermore we used JOIN with the “=” condition, where the left and right side comes from the same table which is not supported in Hive.</li>
<li>It’s possible to track paths from people based on WiFi router signals using triangulation. There a few projects following this idea. You can find some links below.</li>
<li>Assuming that a retail store has several floors, each of which equipped with a WiFi router, each visit interpreted as “login”/”logoff” on a particular router is not correct anymore. Additional data processing is required to identify visitors that just walk through the store and visit different floors as they “login” and “logout” within a short period of time between the levels, e.g. within 30 seconds.</li>
</ul>
<h2>Similar projects/solutions</h2>
<ul>
<li><em>(German) Handysignale verraten Wege der HB-Passanten</em><br />
Wo wird es für die Bevölkerung eng am Hauptbahnhof, woher strömen die Menschen in den Bahnhof, wohin gehen sie und wo staut es? Diese Fragen soll eine neue Technologie beantworten, die frei verfügbare Signale von Mobiltelefonen ortet. <em><a href="http://www.zol.ch/ueberregional/kanton-zuerich/Handysignale-verraten-Wege-der-HBPassanten/story/23499506">Zürcher Oberländer &#8211; 09.10.12</a></em></li>
<li><em>(German) MagicMap  - Ein System zur kooperativen Positionsbestimmung über  WLAN</em><br />
MagicMap ist eine reine Softwarelösung, die bei den mobilen Systemen außer einer konventionellen WLAN-Ausstattung keine weitere Hardware erfordert. Die WLAN Access Points können beliebig verteilt sein und es sind weder Eingriffe an der AP-Hardware noch an der Software nötig. <em><a href="http://www.magicmap.de/">magicmap.de</a> </em></li>
<li><em>(German) Euclid Zero: US-Startup überträgt Monitoring für Onlineshops auf lokalen Handel<br />
</em>Das US-Startup Euclid Analytics nutzt die WLAN-Funktion von Smartphones, um das Einkaufsverhalten von Kunden im lokalen Handel zu verfolgen und auszuwerten. So entstehen umfangreiche Statistiken, vergleichbar mit den Besucherstatistiken eines Onlineshops aus Google Analytics. <em><a href="http://t3n.de/news/euclid-zero-google-analytics-438080/">t3n &#8211; 22.01.2013</a></em></li>
<li><em>PlaceLab Geopositioning system</em><br />
The Place Lab software listens for wireless network base-stations, it can then look-up the coordinates of whatever networks it finds and will use triangulation to calculate its position. <em><a href="http://ntrg.cs.tcd.ie/undergrad/4ba2.05/group1/index.html">4BA2 Technology Survey</a></em></li>
<li><em>OsmocomBB Projct<br />
</em>OsmocomBB is an Free Software / Open Source GSM Baseband software implementation. It intends to completely replace the need for a proprietary GSM baseband software, such as a) drivers for the GSM analog and digital baseband (integrated and external) peripherals b) the GSM phone-side protocol stack, from layer 1 up to layer 3. <em><a href="http://bb.osmocom.org/trac/">osmocom.org</a></em></li>
<li><em>Indoor positioning system (IPS)</em><br />
An indoor positioning system (IPS) is a network of devices used to wirelessly locate objects or people inside a building. <em><a href="http://en.wikipedia.org/wiki/Indoor_positioning_system ">Wikipedia</a></em></li>
</ul>
<h2>Our Moment</h2>
<p>And this is how a hackathon at <a href="http://www.ymc.ch">YMC</a>/Sentric looks like:</p>
<p><div class="videoContainer"><iframe width="560" height="315" src="http://www.youtube.com/embed/U8rXvPQFLvA" frameborder="0" allowfullscreen=""></iframe></div></p>
<div id="attachment_1942" class="wp-caption alignleft" style="width: 160px"><a href="http://www.sentric.ch/blog/case-study-retail-wifi-log-file-analysis-with-hadoop-and-impala-part-4/attachment/jp_01_1000x1500-2" rel="attachment wp-att-1942"><img class="size-thumbnail wp-image-1942" title="JP_01_1000x1500" src="http://www.sentric.ch/wp-content/uploads/2013/02/JP_01_1000x1500-150x150.jpg" alt="Jean-Pierre König, CTO Sentric" width="150" height="150" /></a><p class="wp-caption-text">Jean-Pierre König, CTO Sentric</p></div>
<p>This is the final post of this series. If you have questions or feedback: jean(minus)pierre(dot)koenig(at)sentric(dot)ch.</p>
<p>Continue reading:</p>
<p><a title="Case Study: Retail WiFi Log-file Analysis with Hadoop and Impala, Part 1" href="http://www.sentric.ch/blog/case-study-retail-wifi-log-file-analysis-with-hadoop-and-impala-part-1">Case Study: Retail WiFi Log-file Analysis with Hadoop and Impala, Part 1<br />
</a><a title="Case Study: Retail WiFi Log-file Analysis with Hadoop and Impala, Part 1" href="http://www.sentric.ch/blog/case-study-retail-wifi-log-file-analysis-with-hadoop-and-impala-part-2">Case Study: Retail WiFi Log-file Analysis with Hadoop and Impala, Part 2<br />
</a><a title="Case Study: Retail WiFi Log-file Analysis with Hadoop and Impala, Part 3" href="http://www.sentric.ch/blog/case-study-retail-wifi-log-file-analysis-with-hadoop-and-impala-part-3">Case Study: Retail WiFi Log-file Analysis with Hadoop and Impala, Part 3 </a></p>
<p>&nbsp;</p>
<img src="http://feeds.feedburner.com/~r/sentric/~4/dqynHzgyyLg" height="1" width="1"/>]]></content:encoded>
			<wfw:commentRss>http://www.sentric.ch/blog/case-study-retail-wifi-log-file-analysis-with-hadoop-and-impala-part-4/feed</wfw:commentRss>
		<slash:comments>2</slash:comments>
		<feedburner:origLink>http://www.sentric.ch/blog/case-study-retail-wifi-log-file-analysis-with-hadoop-and-impala-part-4?utm_source=rss&amp;utm_medium=rss&amp;utm_campaign=case-study-retail-wifi-log-file-analysis-with-hadoop-and-impala-part-4</feedburner:origLink></item>
		<item>
		<title>Case Study: Retail WiFi Log-file Analysis with Hadoop and Impala, Part 3</title>
		<link>http://feeds.sentric.ch/~r/sentric/~3/iEfAZOGqs0M/case-study-retail-wifi-log-file-analysis-with-hadoop-and-impala-part-3</link>
		<comments>http://www.sentric.ch/blog/case-study-retail-wifi-log-file-analysis-with-hadoop-and-impala-part-3#comments</comments>
		<pubDate>Fri, 01 Feb 2013 12:08:41 +0000</pubDate>
		<dc:creator>Jean-Pierre König</dc:creator>
				<category><![CDATA[Blog]]></category>
		<category><![CDATA[cloudera]]></category>
		<category><![CDATA[ETL]]></category>
		<category><![CDATA[filtering]]></category>
		<category><![CDATA[flume]]></category>
		<category><![CDATA[hdfs]]></category>
		<category><![CDATA[hive]]></category>
		<category><![CDATA[impala]]></category>
		<category><![CDATA[log analysis]]></category>
		<category><![CDATA[MapReduce]]></category>
		<category><![CDATA[Oozie]]></category>
		<category><![CDATA[pentaho]]></category>
		<category><![CDATA[transformation]]></category>
		<category><![CDATA[Warehouse]]></category>
		<category><![CDATA[wifi signals]]></category>

		<guid isPermaLink="false">http://www.sentric.ch/?p=1889</guid>
		<description><![CDATA[In the previous article we described how to collect WiFi router logs with Flume to store in HDFS. This article will describe how we did the transformation, parsing, filtering and finally loading into Hive’s data warehouse. Let’s start by looking at the raw data sample on HDFS. [crayon-519351bf67f7a/] In order to import the raw data [...]]]></description>
			<content:encoded><![CDATA[<p>In the <a title="Case Study: Retail WiFi Log-file Analysis with Hadoop and Impala, Part 2" href="http://www.sentric.ch/blog/case-study-retail-wifi-log-file-analysis-with-hadoop-and-impala-part-2">previous article</a> we described how to collect WiFi router logs with Flume to store in HDFS. This article will describe how we did the transformation, parsing, filtering and finally loading into Hive’s data warehouse.</p>
<p>Let’s start by looking at the raw data sample on HDFS.</p><pre class="crayon-plain-tag">2013-01-17T15:50:41+01:00 192.168.201.197 dropbear[1172]: Child connection from 192.168.201.99:55001
2013-01-17T15:50:46+01:00 192.168.201.197 dropbear[1172]: Password auth succeeded for 'root' from 192.168.201.99:55001
2013-01-17T15:50:52+01:00 192.168.201.197 dropbear[1172]: Exit (root): Disconnect received
2013-01-17T15:52:14+01:00 fonera hostapd: wlan0: STA 8c:64:22:3a:74:1f IEEE 802.11: disassociated due to inactivity
2013-01-17T15:52:14+01:00 fonera hostapd: wlan0: STA 8c:64:22:3a:74:1f MLME: MLME-DISASSOCIATE.indication(8c:64:22:3a:74:1f, 4)
2013-01-17T15:52:14+01:00 fonera hostapd: wlan0: STA 8c:64:22:3a:74:1f MLME: MLME-DELETEKEYS.request(8c:64:22:3a:74:1f)</pre><p>In order to import the raw data to the Hive data warehouse we need to parse the raw data into a comma separated format. From the Data Scientist perspective we would like to accomplish this task with a proper tool. There are quite a few open-source BI tools on the market for this: Palo, SpargoBI, Pentaho, Talend and many more. We did a short evaluation and finally used Pentaho Data Integration. It’s Cloudera Distribution support enabled us to design a MapReduce job for distributed processing across multiple nodes for this task without any programming environment.</p>
<div id="attachment_1890" class="wp-caption alignnone" style="width: 615px"><a href="http://www.sentric.ch/blog/case-study-retail-wifi-log-file-analysis-with-hadoop-and-impala-part-3/attachment/pentaho-data-integration-graphical-designer" rel="attachment wp-att-1890"><img class=" wp-image-1890      " title="Pentaho Data Integration’s Graphical Designer" src="http://www.sentric.ch/wp-content/uploads/2013/01/Pentaho-Data-Integration-Graphical-Designer.png" alt="Pentaho Data Integration’s Graphical Designer" width="605" height="255" /></a><p class="wp-caption-text">Pentaho Data Integration’s Graphical Designer</p></div>
<p>The map phase will read all the raw log files collected by Flume on HDFS. The input is interpreted as TextInputFormat and therefore every line will go through a regex evaluation during the map phase. With this step we accomplished two things in one step: filtering and transformation.</p>
<h2>Transformation</h2>
<p>By matching a particular line against a regular expression here we can group information that we are interested in. We use this to split up the line in fields that will be used as columns later on. This is the regex for the transformation:</p><pre class="crayon-plain-tag">^((\d{4})-(\d{2})-(\d{2})\w(\d{2}):(\d{2}):(\d{2})([+-]\d{2}:\d{2})) ([.a-zA-Z_0-9]*?) (.*?): (.*?): \w*? ([\w+:]{0,18}) (.*?): (.*)$</pre><p>Matching lines have how a pseudo schema like this:</p><pre class="crayon-plain-tag">iso_8601 String
year Integer
month Integer
day Integer
hour Integer
minute Integer
second Integer
timezone String
host String
facility_level String
service_level String
mac_address String
protocol String
message String</pre><p></p>
<h2>Filtering</h2>
<p>All lines that do not match the regular expression are filtered. We are not interested in those lines because they carry useless information for this case. Here some examples:</p><pre class="crayon-plain-tag">2013-01-17T15:50:41+01:00 192.168.201.197 dropbear[1172]: Child connection from 192.168.201.99:55001
2013-01-17T15:50:46+01:00 192.168.201.197 dropbear[1172]: Password auth succeeded for 'root' from 192.168.201.99:55001
2013-01-17T15:50:52+01:00 192.168.201.197 dropbear[1172]: Exit (root): Disconnect received</pre><p>During the next step ‘Filter Rows’ we remove empty lines. This is to ensure there are no empty lines in the output after matching against the regular expression.</p>
<p>In order to produce a comma separated file we used a ‘User Defined Java Expression’ and concatenate the emitted fields delimiting by ‘,’. At this point we did a further transformation: ISO 8601 string to unix timestamp, a very important one. To answer time related questions, e.g. average visit duration we need values to calculate with. The unix timestamp is suitable for this.</p>
<p>Here is the ‘User Defined Java Expression’:</p><pre class="crayon-plain-tag">(javax.xml.bind.DatatypeConverter.parseDateTime(iso_8601).getTimeInMillis()/1000) + "," + year + "," + month + "," + day + "," + hour + "," + minute + "," + second + "," + timezone + "," + host + "," + facility_level + "," + service_level + "," + mac_address + "," + protocol + "," + message</pre><p>At the very end, the transformed and parsed raw data lands in HDFS once the MapReduce job has finished. Let’s look at a random sample:</p><pre class="crayon-plain-tag">1358765267,2013,1,21,11,47,47,+01:00,buffalo,hostapd,wlan0,10:68:3f:40:20:2d,IEEE 802.1X,authorizing port
1358765267,2013,1,21,11,47,47,+01:00,buffalo,hostapd,wlan0,10:68:3f:40:20:2d,WPA,pairwise key handshake completed (RSN)</pre><p>Now we have parsed log files on HDFS. We used Pentaho Data Integration once again to import the data to Hive’s warehouse. Before doing so we created a table that matches the previously defined schema with the query editor of the beeswax user interface:</p>
<div id="attachment_1891" class="wp-caption alignnone" style="width: 658px"><a href="http://www.sentric.ch/blog/case-study-retail-wifi-log-file-analysis-with-hadoop-and-impala-part-3/attachment/create-table-in-hive" rel="attachment wp-att-1891"><img class="size-full wp-image-1891" title="Create the routerlogs table in Hive" src="http://www.sentric.ch/wp-content/uploads/2013/01/Create-table-in-Hive.png" alt="Create the routerlogs table in Hive" width="648" height="618" /></a><p class="wp-caption-text">Create the &#8216;routerlogs&#8217; table in Hive</p></div>
<p>Loading data into the table is basically done by copying files on HDFS from <em>hdfs://cdh-master.cdh-cluster:8020/user/jpkoenig/routerlogs/parse</em> to <em>hdfs://cdh-master.cdh-cluster:8020/user/hive/warehouse/routerlogs</em> with a wildcard <em>part.*</em> for the file mask.*</p>
<p>* We configured our Pentaho MapReduce job to clean output path before execution. Every time the job is executed it will produce newly created files on HDFS with the same file mask. Here is what the ouptut folder looks like.</p><pre class="crayon-plain-tag">[root@cdh-master ~]# hadoop fs -ls /user/jpkoenig/routerlogs/parse
Found 91 items
drwxrwxrwx   - jpkoenig jpkoenig          0 2013-01-21 15:24 /user/jpkoenig/routerlogs/parse/_logs
-rw-r--r--   3 jpkoenig jpkoenig     118963 2013-01-21 15:25 /user/jpkoenig/routerlogs/parse/part-00000
-rw-r--r--   3 jpkoenig jpkoenig     100500 2013-01-21 15:25 /user/jpkoenig/routerlogs/parse/part-00001
-rw-r--r--   3 jpkoenig jpkoenig      11826 2013-01-21 15:25 /user/jpkoenig/routerlogs/parse/part-00002
…</pre><p>Importing this into the Hive warehouse by copying files on HDFS is not suitable for incremental updates. This procedure will replace the table data every time. For a production system you should consider the following things:</p>
<ul>
<li>automating the MapReduce job on a scheduled base, e.g with Oozie</li>
<li>ensure incremental updates on the Hive table by using partitioned table technique or unique output file naming</li>
</ul>
<p>That&#8217;s it for this step. You can download the Pentaho jobs here:</p>
<p><a href="http://www.sentric.ch/blog/case-study-retail-wifi-log-file-analysis-with-hadoop-and-impala-part-3/attachment/pentaho-project-parse-transform-filter-wifi-router-logs-with-mapreduce-and-hive" rel="attachment wp-att-1909">Pentaho project files: Parse Mapper, MapReduce Job, Load to Hive Job</a></p>
<div id="attachment_1845" class="wp-caption alignleft" style="width: 160px"><a href="http://www.sentric.ch/blog/case-study-retail-wifi-log-file-analysis-with-hadoop-and-impala-part-1/attachment/jp_01_1000x1500" rel="attachment wp-att-1845"><img class="size-thumbnail wp-image-1845" title="Jean-Pierre Koenig, CTO at Sentric" src="http://www.sentric.ch/wp-content/uploads/2013/01/JP_01_1000x1500-150x150.jpg" alt="" width="150" height="150" /></a><p class="wp-caption-text">Jean-Pierre Koenig, CTO at Sentric</p></div>
<p>Now we have everything in place. It’s time to write queries! We will write about querying the data with Impala in the next post. Stay tuned! And, again, contact me if you have questions or feedback: jean(minus)pierre(dot)koenig(at)sentric(dot)ch.</p>
<p>Continue reading:</p>
<p><a title="Case Study: Retail WiFi Log-file Analysis with Hadoop and Impala, Part 1" href="http://www.sentric.ch/blog/case-study-retail-wifi-log-file-analysis-with-hadoop-and-impala-part-1">Case Study: Retail WiFi Log-file Analysis with Hadoop and Impala, Part 1<br />
</a><a title="Case Study: Retail WiFi Log-file Analysis with Hadoop and Impala, Part 2" href="http://www.sentric.ch/blog/case-study-retail-wifi-log-file-analysis-with-hadoop-and-impala-part-2">Case Study: Retail WiFi Log-file Analysis with Hadoop and Impala, Part 2 </a><a title="Case Study: Retail WiFi Log-file Analysis with Hadoop and Impala, Part 3" href="http://www.sentric.ch/blog/case-study-retail-wifi-log-file-analysis-with-hadoop-and-impala-part-3"><br />
</a><a title="Case Study: Retail WiFi Log-file Analysis with Hadoop and Impala, Part 4" href="http://www.sentric.ch/blog/case-study-retail-wifi-log-file-analysis-with-hadoop-and-impala-part-4">Case Study: Retail WiFi Log-file Analysis with Hadoop and Impala, Part 4</a></p>
<img src="http://feeds.feedburner.com/~r/sentric/~4/iEfAZOGqs0M" height="1" width="1"/>]]></content:encoded>
			<wfw:commentRss>http://www.sentric.ch/blog/case-study-retail-wifi-log-file-analysis-with-hadoop-and-impala-part-3/feed</wfw:commentRss>
		<slash:comments>0</slash:comments>
		<feedburner:origLink>http://www.sentric.ch/blog/case-study-retail-wifi-log-file-analysis-with-hadoop-and-impala-part-3?utm_source=rss&amp;utm_medium=rss&amp;utm_campaign=case-study-retail-wifi-log-file-analysis-with-hadoop-and-impala-part-3</feedburner:origLink></item>
		<item>
		<title>Case Study: Retail WiFi Log-file Analysis with Hadoop and Impala, Part 2</title>
		<link>http://feeds.sentric.ch/~r/sentric/~3/h8Gbzt9rW3A/case-study-retail-wifi-log-file-analysis-with-hadoop-and-impala-part-2</link>
		<comments>http://www.sentric.ch/blog/case-study-retail-wifi-log-file-analysis-with-hadoop-and-impala-part-2#comments</comments>
		<pubDate>Tue, 29 Jan 2013 11:05:26 +0000</pubDate>
		<dc:creator>Gerd Koenig</dc:creator>
				<category><![CDATA[Blog]]></category>
		<category><![CDATA[cloudera]]></category>
		<category><![CDATA[configuration]]></category>
		<category><![CDATA[data ingestion]]></category>
		<category><![CDATA[flume]]></category>
		<category><![CDATA[Hackathon]]></category>
		<category><![CDATA[hdfs]]></category>
		<category><![CDATA[log analysis]]></category>
		<category><![CDATA[openWRT]]></category>
		<category><![CDATA[syslog]]></category>
		<category><![CDATA[wifi]]></category>

		<guid isPermaLink="false">http://www.sentric.ch/?p=1856</guid>
		<description><![CDATA[Following on from Jean-Pierre’s introduction to this experiment in part 1, I will now expand on the technical details of the data ingestion process using Flume. As you can see in figure 2 from the previous post, first of all we had to collect log data as a data source to be read by Flume [...]]]></description>
			<content:encoded><![CDATA[<p>Following on from Jean-Pierre’s introduction to this experiment in <a title="Case Study: Retail WiFi Log-file Analysis with Hadoop and Impala, Part 1" href="http://www.sentric.ch/blog/case-study-retail-wifi-log-file-analysis-with-hadoop-and-impala-part-1">part 1</a>, I will now expand on the technical details of the data ingestion process using Flume.</p>
<p><a href="http://www.sentric.ch/blog/case-study-retail-wifi-log-file-analysis-with-hadoop-and-impala-part-2/attachment/wifi-based-in-store-analysis-with-hadoop-and-impala-hackathon2" rel="attachment wp-att-1872"><img class="alignnone size-full wp-image-1872" title="WiFi based In-Store analysis with Hadoop-and-Impala, Hackathon" src="http://www.sentric.ch/wp-content/uploads/2013/01/WiFi-based-In-Store-analysis-with-Hadoop-and-Impala-Hackathon2.jpg" alt="WiFi based In-Store analysis with Hadoop-and-Impala, Hackathon" width="500" height="281" /></a></p>
<p>As you can see in figure 2 from the previous post, first of all we had to collect log data as a data source to be read by Flume afterwards. There were two WiFi access points available, a Buffalo WZR-HP-G300NH2 and a Fonera. Both of them installed with <a href="https://openwrt.org/" target="_blank">OpenWRT</a> as operating system.<br />
Therefore we configured the WiFi access points to send all their local syslog messages to a central syslog server. This can easily be done via OpenWRT’s Unified Configuration Interface, simply called UCI. Assuming your syslog server listens on address 192.168.0.1, and we want the most detailed log output, the configuration looks like:</p><pre class="crayon-plain-tag">#&gt;uci set system.@system[0].log_ip=192.168.0.1
#&gt;uci set system.@system[0].conloglevel=8
#&gt;uci commit
#&gt;reboot</pre><p>Additionally, the syslog server needs to be configured to accept messages from remote hosts and instructed where to write those messages to. In our scenario we just wanted to write them to a text file. Since we were using Rsyslog as syslog server, the corresponding configuration file is /etc/rsyslog.conf.</p>
<p>The messages are being sent by UDP packages, so we had to enable syslog reception by UDP on port 514. Assuming that each of the WiFi access points has an IP address starting with 192.168.0 the configuration settings are:</p><pre class="crayon-plain-tag">$ModLoad imudp
$UDPServerRun 514
if $fromhost-ip startswith '192.168.0.' then /var/log/hackathon-logs.log
&amp; ~</pre><p>The syslog daemon needs to be restarted by executing /etc/init.d/rsyslog restart to apply the changes. Afterwards you’ll see an excerpt of the log-file:</p><pre class="crayon-plain-tag">2013-01-21T13:39:51+01:00 buffalo hostapd: wlan0: STA 10:68:3f:40:xx:yy WPA: start authentication
2013-01-21T13:39:51+01:00 buffalo hostapd: wlan0: STA 10:68:3f:40:xx:yy IEEE 802.1X: unauthorizing port
2013-01-21T13:39:51+01:00 buffalo hostapd: wlan0: STA 10:68:3f:40:xx:yy WPA: sending 1/4 msg of 4-Way Handshake
2013-01-21T13:39:51+01:00 buffalo hostapd: wlan0: STA 10:68:3f:40:xx:yy WPA: received EAPOL-Key frame (2/4 Pairwise)
2013-01-21T13:39:51+01:00 buffalo hostapd: wlan0: STA 10:68:3f:40:xx:yy WPA: sending 3/4 msg of 4-Way Handshake
2013-01-21T13:39:51+01:00 buffalo hostapd: wlan0: STA 10:68:3f:40:xx:yy WPA: received EAPOL-Key frame (4/4 Pairwise)
2013-01-21T13:39:51+01:00 buffalo hostapd: wlan0: STA 10:68:3f:40:xx:yy IEEE 802.1X: authorizing port
2013-01-21T13:39:51+01:00 buffalo hostapd: wlan0: STA 10:68:3f:40:xx:yy WPA: pairwise key handshake completed (RSN)
2013-01-21T13:41:25+01:00 fonera hostapd: wlan0: STA 24:ab:81:91:xx:yy IEEE 802.11: authentication OK (open system)
2013-01-21T13:41:25+01:00 fonera hostapd: wlan0: STA 24:ab:81:91:xx:yy MLME: MLME-AUTHENTICATE.indication(24:ab:81:91:c8:62, OPEN_SYSTEM)
2013-01-21T13:41:25+01:00 fonera hostapd: wlan0: STA 24:ab:81:91:xx:yy MLME: MLME-DELETEKEYS.request(24:ab:81:91:c8:62)
2013-01-21T13:41:25+01:00 fonera hostapd: wlan0: STA 24:ab:81:91:xx:yy IEEE 802.11: authenticated
2013-01-21T13:41:25+01:00 fonera hostapd: wlan0: STA 24:ab:81:91:xx:yy IEEE 802.11: association OK (aid 1)
2013-01-21T13:41:25+01:00 fonera hostapd: wlan0: STA 24:ab:81:91:xx:yy IEEE 802.11: associated (aid 1)</pre><p>Now that we have a log-file as data source we set up <a href="https://cwiki.apache.org/FLUME/home.html" target="_blank">Flume</a> to stream the incoming content to HDFS. Due to the Flume terminology we had the components:</p>
<ul>
<li>data from the log-file as <em>source</em></li>
<li>HDFS folder /user/flume/hackathon-datastream as <em>sink</em><br />
we preferred a flat directory layout to simplify the access/processing of the files later on</li>
<li>a <em>channel</em>, c1, to connect the source to the sink</li>
</ul>
<p>The easiest way of streaming the incoming log messages to the channel (and additionally keeping the log-file on the local hard disk) is to configure the source of type <em>exec</em>. Thereby you just have to configure a linux command that “listens” on that file, yes, it is just a <em>tail</em> command. Additionally we wanted to have the data as plain datastream files in HDFS, not as so called sequence files. Processing those files afterwards was much easier in our particular hackathon environment. Here is how the overall flume configuration looks like:</p><pre class="crayon-plain-tag">a1.sources = r1
a1.sinks = k1
a1.channels = c1
a1.channels.c1.type = memory
a1.channels.c1.capacity = 1000
a1.channels.c1.transactionCapacity = 100 
# get data from exec command
a1.sources.r1.type = exec
a1.sources.r1.command = tail -F /var/log/hackathon-logs.log
a1.sources.r1.interceptors = i1 i2
a1.sources.r1.interceptors.i1.type = org.apache.flume.interceptor.HostInterceptor$Builder
a1.sources.r1.interceptors.i1.preserveExisting = false
a1.sources.r1.interceptors.i1.hostHeader = hostname
a1.sources.r1.interceptors.i2.type = org.apache.flume.interceptor.TimestampInterceptor$Builder
# define hdfs sink
a1.sinks.k1.type = hdfs
a1.sinks.k1.hdfs.path = hdfs://cdh-master.cdh-cluster:8020/user/flume/hackathon-datastream
a1.sinks.k1.hdfs.rollInterval = 120
a1.sinks.k1.hdfs.rollCount = 100
a1.sinks.k1.hdfs.rollSize = 0
a1.sinks.k1.hdfs.fileType = DataStream
# bind source and sink to channel
a1.sources.r1.channels = c1
a1.sinks.k1.channel = c1</pre><p>The lines containing “interceptors” are needed to put additional header information in the data stream, in our case the hostname and timestamp.</p>
<p>Flume was finally started manually by executing:</p><pre class="crayon-plain-tag">flume-ng agent --conf-file ./flume-datastream.conf --name a1 -Dflume.root.logger=INFO,console</pre><p>Since Flume is able to collect data from various sources it is possible to configure Flume as “syslog server” itself. The WiFi access points would send their log messages directly to the Flume agent. The pro of this configuration is, that you don’t need additional software (the syslog server). Why we’ve chosen the method described above is that we had a running syslog server already and our first attempt was to use that.</p>
<div id="attachment_1860" class="wp-caption alignleft" style="width: 95px"><a href="http://www.sentric.ch/blog/case-study-retail-wifi-log-file-analysis-with-hadoop-and-impala-part-2/attachment/gerd_01_sw" rel="attachment wp-att-1860"><img class=" wp-image-1860       " style="font-size: 8pt;" title="Gerd Koenig, Cloudera Certified Administrator for Apache Hadoop " src="http://www.sentric.ch/wp-content/uploads/2013/01/Gerd_01_sw.jpg" alt="Gerd Koenig, Cloudera Certified Administrator for Apache Hadoop " width="85" height="128" /></a><p class="wp-caption-text">Gerd Koenig, Certified Hadoop Administrator</p></div>
<p>Now that we have the log data in HDFS, the next article of this series will proceed with transforming, parsing and filtering them. Stay tuned for part 3 and feel free to drop me your feedback, gerd(dot)koenig(at)sentric(dot)ch.</p>
<p>Continue reading:</p>
<p><a title="Case Study: Retail WiFi Log-file Analysis with Hadoop and Impala, Part 1" href="http://www.sentric.ch/blog/case-study-retail-wifi-log-file-analysis-with-hadoop-and-impala-part-1">Case Study: Retail WiFi Log-file Analysis with Hadoop and Impala, Part 1<br />
</a><a title="Case Study: Retail WiFi Log-file Analysis with Hadoop and Impala, Part 3" href="http://www.sentric.ch/blog/case-study-retail-wifi-log-file-analysis-with-hadoop-and-impala-part-3">Case Study: Retail WiFi Log-file Analysis with Hadoop and Impala, Part 3<br />
</a><a title="Case Study: Retail WiFi Log-file Analysis with Hadoop and Impala, Part 3" href="http://www.sentric.ch/blog/case-study-retail-wifi-log-file-analysis-with-hadoop-and-impala-part-4">Case Study: Retail WiFi Log-file Analysis with Hadoop and Impala, Part4 </a></p>
<img src="http://feeds.feedburner.com/~r/sentric/~4/h8Gbzt9rW3A" height="1" width="1"/>]]></content:encoded>
			<wfw:commentRss>http://www.sentric.ch/blog/case-study-retail-wifi-log-file-analysis-with-hadoop-and-impala-part-2/feed</wfw:commentRss>
		<slash:comments>0</slash:comments>
		<feedburner:origLink>http://www.sentric.ch/blog/case-study-retail-wifi-log-file-analysis-with-hadoop-and-impala-part-2?utm_source=rss&amp;utm_medium=rss&amp;utm_campaign=case-study-retail-wifi-log-file-analysis-with-hadoop-and-impala-part-2</feedburner:origLink></item>
		<item>
		<title>Case Study: Retail WiFi Log-file Analysis with Hadoop and Impala, Part 1</title>
		<link>http://feeds.sentric.ch/~r/sentric/~3/LB7q2tNnFv4/case-study-retail-wifi-log-file-analysis-with-hadoop-and-impala-part-1</link>
		<comments>http://www.sentric.ch/blog/case-study-retail-wifi-log-file-analysis-with-hadoop-and-impala-part-1#comments</comments>
		<pubDate>Fri, 25 Jan 2013 09:35:17 +0000</pubDate>
		<dc:creator>Jean-Pierre König</dc:creator>
				<category><![CDATA[Blog]]></category>
		<category><![CDATA[bi]]></category>
		<category><![CDATA[cloudera]]></category>
		<category><![CDATA[data management system]]></category>
		<category><![CDATA[data scientist]]></category>
		<category><![CDATA[data warehouse]]></category>
		<category><![CDATA[flume]]></category>
		<category><![CDATA[Hadoop]]></category>
		<category><![CDATA[hive]]></category>
		<category><![CDATA[impala]]></category>
		<category><![CDATA[in-store analysis]]></category>
		<category><![CDATA[log analysis]]></category>
		<category><![CDATA[openWRT]]></category>
		<category><![CDATA[pentaho]]></category>
		<category><![CDATA[retail]]></category>
		<category><![CDATA[tracking]]></category>
		<category><![CDATA[wifi signals]]></category>

		<guid isPermaLink="false">http://www.sentric.ch/?p=1834</guid>
		<description><![CDATA[This week we were inspired to do some research, driven by an idea: It must be possible to bring the concepts of tracking users in the online world to retail stores. We are not the experts in retail but we know that one of the most important key performance indicators is revenue per square metre. [...]]]></description>
			<content:encoded><![CDATA[<p>This week we were inspired to do some research, driven by an idea: It must be possible to bring the concepts of tracking users in the online world to retail stores. We are not the experts in retail but we know that one of the most important key performance indicators is revenue per square metre. We thought about bringing in some new metrics. From a wider perspective, data is produced by various sensors. With a real store in mind we figured out possible sensors stores could use &#8211; customer frequency counters at the doors, the cashier system, free WiFi access points, video capturing, temperature, background music, smells and many more. While for many of those sensors additional hardware and software is needed, for a few sensors solutions are around, e.g. video capturing with face or even eye recognition. We talked about our ideas with executives and consultants from the retail industry and they confirmed our idea is interesting to persue.</p>
<p><a href="http://www.sentric.ch/blog/case-study-retail-wifi-log-file-analysis-with-hadoop-and-impala-part-1/attachment/wifi-based-in-store-analysis-with-hadoop-and-impala-hackathon" rel="attachment wp-att-1835"><img class="alignnone size-full wp-image-1835" title="WiFi based In-Store analysis with Hadoop-and-Impala, Hackathon" src="http://www.sentric.ch/wp-content/uploads/2013/01/WiFi-based-In-Store-analysis-with-hadoop-and-impala-hackathon.jpg" alt="WiFi based In-Store analysis with Hadoop-and-Impala, Hackathon" width="500" height="281" /></a></p>
<p>We thought the most interesting sensor data (that doesn’t require additional hardware/software) could be the WiFi access points. Especially given that many visitors will have WiFi enabled mobile phones. With it’s log files we should be able to answer at least the following questions for a particular store:</p>
<ul>
<li>How many people visited the store (unique visits)?</li>
<li>How many visits did we have in total?</li>
<li>What is the average visit duration?</li>
<li>How many people are new vs. returning?</li>
</ul>
<h1>How do we answer these questions?</h1>
<p>Before we started designing a blueprint solution we first of all asked ourselves:</p>
<ul>
<li>Who would be asked to answer questions like this?</li>
<li>Who is this person?</li>
<li>What tools does this person expect to use?</li>
<li>And what is a typical skill set of this person?</li>
<li>How do they work?</li>
</ul>
<p>From an interview with a industry leading company we knew that these questions will be answered by analysts. They use data warehouses and they typically have a business intelligence (BI), analysis and report tool with access to the data warehouse. They are used to useing SQL to answer questions.</p>
<p>With our experience at Sentric, we knew that solving the problem with a Big Data approach will introduce a new person &#8211; the “<a href="http://en.wikipedia.org/wiki/Data_science" target="_blank">Data Scientist</a>”. Right, at that point we slightly adjusted our mission.</p>
<h1>So, how do we answer these questions as a Data Scientist?</h1>
<p>From a high level of abstraction the answer is simple. We need a data management system with three pieces: ingest, store and process.</p>
<div id="attachment_1837" class="wp-caption alignnone" style="width: 623px"><a href="http://www.sentric.ch/blog/case-study-retail-wifi-log-file-analysis-with-hadoop-and-impala-part-1/attachment/traditional-approach" rel="attachment wp-att-1837"><img class="size-medium wp-image-1837" title="Traditional Data Management System Approach" src="http://www.sentric.ch/wp-content/uploads/2013/01/traditional-approach-613x91.png" alt="Traditional Data Management System Approach" width="613" height="91" /></a><p class="wp-caption-text">Traditional Data Management System Approach</p></div>
<p>We take this basis architecture and replace the generic terms while mapping it onto the Hadoop ecosystem.</p>
<div id="attachment_1838" class="wp-caption alignnone" style="width: 623px"><a href="http://www.sentric.ch/blog/case-study-retail-wifi-log-file-analysis-with-hadoop-and-impala-part-1/attachment/bigdata-approach" rel="attachment wp-att-1838"><img class="size-medium wp-image-1838 " title="Blueprint for a Data Management System with Hadoop" src="http://www.sentric.ch/wp-content/uploads/2013/01/bigdata-approach-613x156.png" alt="Blueprint for a Data Management System with Hadoop" width="613" height="156" /></a><p class="wp-caption-text">Blueprint for a Data Management System with Hadoop</p></div>
<p>With this Hadoop architecture a Data Scientist should be able to answer the questions without any programming environment. He/she can also use familiar BI, analysis and reporting tools as well.</p>
<h1>Setup</h1>
<p>We planned a hackathon together with our partner company <a href="http://www.ymc.ch" target="_blank">YMC</a> to prove this concept. Here are the ingredients:</p>
<ul>
<li>2 WiFi access points to simulate two different stores with <a href="https://openwrt.org/" target="_blank">OpenWRT</a>, a linux based firmware for routers, installed *</li>
<li>A virtual machine acting as central syslog daemon collecting all log messages from the WiFi routers</li>
<li><a href="https://ccp.cloudera.com/display/CDH4DOC/Flume+Installation" target="_blank">Flume</a> to move all log messages to HDFS, without any manual intervention (no transformation, no filtering)</li>
<li>A 4 node CDH4 cluster running on virtual machines (CentOS, 2 GB RAM, 100 GB HDD), installed and monitored with <a href="https://ccp.cloudera.com/display/SUPPORT/Cloudera+Manager+Downloads" target="_blank">Cloudera Manager</a></li>
<li><a href="http://www.pentaho.com/explore/pentaho-data-integration/" target="_blank">Pentaho Data Integration</a>&#8216;s graphical designer for data transformation, parsing, filtering and loading to the warehouse (Hive)</li>
<li><a href="https://ccp.cloudera.com/display/CDH4DOC/Hive+Installation" target="_blank">Hive</a> as data warehouse system on top of Hadoop to project structure onto data</li>
<li><a href="https://ccp.cloudera.com/display/IMPALA10BETADOC/Installing+and+Using+Cloudera+Impala" target="_blank">Impala</a> for querying data from Hive in real time</li>
<li>Microsoft Excel to visualize results **</li>
</ul>
<p>* We actually fired up the two WiFi routers before the hackathon to collect some data for a period of around 4 days.<br />
** Since Impala is still beta it only supports SELECT statements. Therefore it’s not able to CREATE new tables from query results in Hive’s warehouse. With this restriction we decided to copy &amp; paste query results into MS Excel for further analysis and visualization. Once Impala can CREATE tables a Data Scientist can access that data from their BI, analysis and reporting tools.</p>
<div id="attachment_1845" class="wp-caption alignleft" style="width: 160px"><a href="http://www.sentric.ch/blog/case-study-retail-wifi-log-file-analysis-with-hadoop-and-impala-part-1/attachment/jp_01_1000x1500" rel="attachment wp-att-1845"><img class="size-thumbnail wp-image-1845 " title="Jean-Pierre Koenig, CTO at Sentric" src="http://www.sentric.ch/wp-content/uploads/2013/01/JP_01_1000x1500-150x150.jpg" alt="" width="150" height="150" /></a><p class="wp-caption-text">Jean-Pierre Koenig              CTO at Sentric</p></div>
<p>In part 2 of this series you find details of data ingestion. If you would like to give us feedback or you want some more details, do not hesitate to contact me, jean(minus)pierre(dot)koenig(at)sentric.ch.</p>
<p>[Update]</p>
<p>Continue reading:</p>
<p><a title="Case Study: Retail WiFi Log-file Analysis with Hadoop and Impala, Part 2" href="http://www.sentric.ch/blog/case-study-retail-wifi-log-file-analysis-with-hadoop-and-impala-part-2">Case Study: Retail WiFi Log-file Analysis with Hadoop and Impala, Part 2<br />
</a><a title="Case Study: Retail WiFi Log-file Analysis with Hadoop and Impala, Part 3" href="http://www.sentric.ch/blog/case-study-retail-wifi-log-file-analysis-with-hadoop-and-impala-part-3">Case Study: Retail WiFi Log-file Analysis with Hadoop and Impala, Part 3</a></p>
<img src="http://feeds.feedburner.com/~r/sentric/~4/LB7q2tNnFv4" height="1" width="1"/>]]></content:encoded>
			<wfw:commentRss>http://www.sentric.ch/blog/case-study-retail-wifi-log-file-analysis-with-hadoop-and-impala-part-1/feed</wfw:commentRss>
		<slash:comments>12</slash:comments>
		<feedburner:origLink>http://www.sentric.ch/blog/case-study-retail-wifi-log-file-analysis-with-hadoop-and-impala-part-1?utm_source=rss&amp;utm_medium=rss&amp;utm_campaign=case-study-retail-wifi-log-file-analysis-with-hadoop-and-impala-part-1</feedburner:origLink></item>
		<item>
		<title>Hadoop training by Cloudera</title>
		<link>http://feeds.sentric.ch/~r/sentric/~3/Lo9ql8T60Xs/hadoop-training-by-cloudera</link>
		<comments>http://www.sentric.ch/blog/hadoop-training-by-cloudera#comments</comments>
		<pubDate>Thu, 17 Jan 2013 09:09:20 +0000</pubDate>
		<dc:creator>Gerd Koenig</dc:creator>
				<category><![CDATA[Blog]]></category>

		<guid isPermaLink="false">http://www.sentric.ch/?p=1825</guid>
		<description><![CDATA[Last week I attended an admin training about Hadoop, held by Cloudera in a comfortable and well prepared location in London. This 3-day course covers several topics of the Hadoop ecosystem, all within 500+ slides and some exercises. The range is from historical information, illustration of why Hadoop is needed, introduction to MapReduce and job [...]]]></description>
			<content:encoded><![CDATA[<p>Last week I attended an <a href="http://university.cloudera.com/training/apache_hadoop/administrator.html" target="_blank">admin training</a> about Hadoop, held by <a href="http://www.cloudera.com/" target="_blank">Cloudera</a> in a comfortable and well prepared location in London. This 3-day course covers several topics of the <a href="http://en.wikipedia.org/wiki/Hadoop">Hadoop</a> ecosystem, all within 500+ slides and some exercises. The range is from historical information, illustration of why Hadoop is needed, introduction to <a href="http://en.wikipedia.org/wiki/Mapreduce">MapReduce</a> and job scheduling up to planning, maintaining and troubleshooting a Hadoop cluster. Additional tools, e.g. Flume and Sqoop, are being discussed in an extra chapter also.<br />
Even though the title suggests exclusively administration related topics, from my point of view this training is more a general introduction to Hadoop to get the “big picture” and basic ideas of it. Thereby it is not limited to system administrators, it fits best for developers, IT architects, simply anybody who wants to start diving into Hadoop. On the other side the training doesn’t explain specific operations related tasks in detail, it is somehow a high-level view to the system concentrated on the suggestions from Cloudera.<br />
If you are responsible for maintaining a Hadoop cluster and understand the relationship of the involved daemons, have collected some experience, configured some parameters and run into troubles (and hopefully solved it afterwards) already, you will not benefit from this training. By the way, in this case you are outside the target audience of this training, so think twice in advance.</p>
<p>To put it in a nutshell:<br />
This course is perfectly suited to get a basic understanding of the concepts behind Hadoop. HDFS and MapReduce are explained very well. I have benefited from it in that I now understand how the different daemons relate to each other, what they are for, as well as what to do in case of an (Hadoop related) emergency.<br />
Now it’s time to gain experience and share it with the community.</p>
<p>If you have any feedback, feel free to contact me by <a href="javascript:DeCryptX('hfse/lpfojhAtfousjd/di')" target="_blank">mail</a> or <a href="https://twitter.com/gerd_koenig" target="_blank">twitter</a>.</p>
<img src="http://feeds.feedburner.com/~r/sentric/~4/Lo9ql8T60Xs" height="1" width="1"/>]]></content:encoded>
			<wfw:commentRss>http://www.sentric.ch/blog/hadoop-training-by-cloudera/feed</wfw:commentRss>
		<slash:comments>2</slash:comments>
		<feedburner:origLink>http://www.sentric.ch/blog/hadoop-training-by-cloudera?utm_source=rss&amp;utm_medium=rss&amp;utm_campaign=hadoop-training-by-cloudera</feedburner:origLink></item>
	</channel>
</rss><!-- Performance optimized by W3 Total Cache. Learn more: http://www.w3-edge.com/wordpress-plugins/

Page Caching using disk: enhanced
Database Caching 10/24 queries in 0.012 seconds using disk: basic
Object Caching 939/980 objects using disk: basic

Served from: www.sentric.ch @ 2013-05-15 11:13:35 -->
