|
Jared winick
2012-04-24, 13:35
Keith Turner
2012-04-24, 14:54
Keith Turner
2012-04-24, 14:57
Eric Newton
2012-04-24, 15:10
Billie J Rinaldi
2012-04-24, 17:40
Jared winick
2012-04-25, 04:17
Eric Newton
2012-04-25, 12:52
Aaron Cordova
2012-04-25, 13:43
Jared winick
2012-04-25, 19:10
Aaron Cordova
2012-04-26, 02:19
Jason Trost
2012-04-26, 10:49
Eric Newton
2012-04-27, 19:09
Jared winick
2012-04-30, 13:33
|
-
Trendulo - A Twitter Analytics Demo on AccumuloJared winick 2012-04-24, 13:35
I gave an Introduction to Apache
Accumulo<http://www.slideshare.net/jaredwinick/introduction-to-apache-accumulo> presentation last month at the Boulder/Denver Meetup<http://www.meetup.com/Boulder-Denver-Big-Data/events/55277392/> where I demoed an application that used Accumulo to provide real-time and historical access to words/phrases seen in Twitter messages as well as daily trend analysis. I finally got the demo polished up a bit and running on Amazon EC2 where it can be found at http://trendulo.com. Trendulo is still pretty Alpha at this point so please feel free to add to the existing documented issues at https://github.com/jaredwinick/trendulo where you can also obviously find the source. As an example, the following link will show the launch of Instagram's Android client, followed by Facebook's purchase and then a small increase in general "chatter" about the product http://goo.gl/XcCG8 Let me know if anyone has any questions or comments. Feel free to tweet @trendulo any interesting searches and I can retweet them out. Jared
-
Re: Trendulo - A Twitter Analytics Demo on AccumuloKeith Turner 2012-04-24, 14:54
Jared
Thats awesome! What happened on Mar 19 and 20? Keith On Tue, Apr 24, 2012 at 9:35 AM, Jared winick <[EMAIL PROTECTED]> wrote: > I gave an Introduction to Apache Accumulo presentation last month at > the Boulder/Denver Meetup where I demoed an application that used Accumulo > to provide real-time and historical access to words/phrases seen in Twitter > messages as well as daily trend analysis. I finally got the demo polished up > a bit and running on Amazon EC2 where it can be found > at http://trendulo.com. > > Trendulo is still pretty Alpha at this point so please feel free to add to > the existing documented issues at > https://github.com/jaredwinick/trendulo where you can also obviously find > the source. > > As an example, the following link will show the launch of Instagram's > Android client, followed by Facebook's purchase and then a small increase in > general "chatter" about the product http://goo.gl/XcCG8 > > Let me know if anyone has any questions or comments. Feel free to tweet > @trendulo any interesting searches and I can retweet them out. > > Jared > >
-
Re: Trendulo - A Twitter Analytics Demo on AccumuloKeith Turner 2012-04-24, 14:57
Jared,
Searching for the word school is neat, you can clearly see the weekends. The domain name is cool. Keith On Tue, Apr 24, 2012 at 9:35 AM, Jared winick <[EMAIL PROTECTED]> wrote: > I gave an Introduction to Apache Accumulo presentation last month at > the Boulder/Denver Meetup where I demoed an application that used Accumulo > to provide real-time and historical access to words/phrases seen in Twitter > messages as well as daily trend analysis. I finally got the demo polished up > a bit and running on Amazon EC2 where it can be found > at http://trendulo.com. > > Trendulo is still pretty Alpha at this point so please feel free to add to > the existing documented issues at > https://github.com/jaredwinick/trendulo where you can also obviously find > the source. > > As an example, the following link will show the launch of Instagram's > Android client, followed by Facebook's purchase and then a small increase in > general "chatter" about the product http://goo.gl/XcCG8 > > Let me know if anyone has any questions or comments. Feel free to tweet > @trendulo any interesting searches and I can retweet them out. > > Jared > >
-
Re: Trendulo - A Twitter Analytics Demo on AccumuloEric Newton 2012-04-24, 15:10
Aw, man, I'm not going to get anything done today! This is fun!
-Eric On Tue, Apr 24, 2012 at 9:35 AM, Jared winick <[EMAIL PROTECTED]> wrote: > I gave an Introduction to Apache Accumulo<http://www.slideshare.net/jaredwinick/introduction-to-apache-accumulo> presentation > last month at the Boulder/Denver Meetup<http://www.meetup.com/Boulder-Denver-Big-Data/events/55277392/> where > I demoed an application that used Accumulo to provide real-time and > historical access to words/phrases seen in Twitter messages as well as > daily trend analysis. I finally got the demo polished up a bit and running > on Amazon EC2 where it can be found at http://trendulo.com. > > Trendulo is still pretty Alpha at this point so please feel free to add to > the existing documented issues at https://github.com/jaredwinick/trendulo where > you can also obviously find the source. > > As an example, the following link will show the launch of Instagram's > Android client, followed by Facebook's purchase and then a small increase > in general "chatter" about the product http://goo.gl/XcCG8 > > Let me know if anyone has any questions or comments. Feel free to tweet > @trendulo any interesting searches and I can retweet them out. > > Jared > > >
-
Re: Trendulo - A Twitter Analytics Demo on AccumuloBillie J Rinaldi 2012-04-24, 17:40
That's so cool that I'm creating a new section for it on our page of links:
http://accumulo.apache.org/papers.html Billie On Tuesday, April 24, 2012 9:35:31 AM, "Jared winick" <[EMAIL PROTECTED]> wrote: > I gave an Introduction to Apache Accumulo presentation last month at > the Boulder/Denver Meetup where I demoed an application that used > Accumulo to provide real-time and historical access to words/phrases > seen in Twitter messages as well as daily trend analysis. I finally > got the demo polished up a bit and running on Amazon EC2 where it can > be found at http://trendulo.com . > > Trendulo is still pretty Alpha at this point so please feel free to > add to the existing documented issues at > https://github.com/jaredwinick/trendulo where you can also obviously > find the source. > > > As an example, the following link will show the launch of Instagram's > Android client, followed by Facebook's purchase and then a small > increase in general "chatter" about the product http://goo.gl/XcCG8 > > > Let me know if anyone has any questions or comments. Feel free to > tweet @trendulo any interesting searches and I can retweet them out. > > > Jared
-
Re: Trendulo - A Twitter Analytics Demo on AccumuloJared winick 2012-04-25, 04:17
Thanks for the kind words, I appreciate it. Keith, my ingest process
was down on Mar 19-20, so that is why I am missing data for that period. For those who are curious, I am receiving about 1.2 million tweets a day and have about 3 billion entries in my main table. I am actually getting by with everything running on an EC2 medium instance, which is obviously very far from ideal but I am trying to stay on a budget. I hope to add new features as time allows, things like near real-time trending and geospatial analytics. If anyone has any ideas for features they think would be interesting, just let me know or add them as issues on the github page. On Tue, Apr 24, 2012 at 11:40 AM, Billie J Rinaldi <[EMAIL PROTECTED]> wrote: > That's so cool that I'm creating a new section for it on our page of links: > http://accumulo.apache.org/papers.html > > Billie > > On Tuesday, April 24, 2012 9:35:31 AM, "Jared winick" <[EMAIL PROTECTED]> wrote: >> I gave an Introduction to Apache Accumulo presentation last month at >> the Boulder/Denver Meetup where I demoed an application that used >> Accumulo to provide real-time and historical access to words/phrases >> seen in Twitter messages as well as daily trend analysis. I finally >> got the demo polished up a bit and running on Amazon EC2 where it can >> be found at http://trendulo.com . >> >> Trendulo is still pretty Alpha at this point so please feel free to >> add to the existing documented issues at >> https://github.com/jaredwinick/trendulo where you can also obviously >> find the source. >> >> >> As an example, the following link will show the launch of Instagram's >> Android client, followed by Facebook's purchase and then a small >> increase in general "chatter" about the product http://goo.gl/XcCG8 >> >> >> Let me know if anyone has any questions or comments. Feel free to >> tweet @trendulo any interesting searches and I can retweet them out. >> >> >> Jared
-
Re: Trendulo - A Twitter Analytics Demo on AccumuloEric Newton 2012-04-25, 12:52
How many key-values does a single tweet become, on average? What's the
storage size per tweet? On Wed, Apr 25, 2012 at 12:17 AM, Jared winick <[EMAIL PROTECTED]>wrote: > Thanks for the kind words, I appreciate it. Keith, my ingest process > was down on Mar 19-20, so that is why I am missing data for that > period. > > For those who are curious, I am receiving about 1.2 million tweets a > day and have about 3 billion entries in my main table. I am actually > getting by with everything running on an EC2 medium instance, which is > obviously very far from ideal but I am trying to stay on a budget. > > I hope to add new features as time allows, things like near real-time > trending and geospatial analytics. If anyone has any ideas for > features they think would be interesting, just let me know or add them > as issues on the github page. > > On Tue, Apr 24, 2012 at 11:40 AM, Billie J Rinaldi > <[EMAIL PROTECTED]> wrote: > > That's so cool that I'm creating a new section for it on our page of > links: > > http://accumulo.apache.org/papers.html > > > > Billie > > > > On Tuesday, April 24, 2012 9:35:31 AM, "Jared winick" < > [EMAIL PROTECTED]> wrote: > >> I gave an Introduction to Apache Accumulo presentation last month at > >> the Boulder/Denver Meetup where I demoed an application that used > >> Accumulo to provide real-time and historical access to words/phrases > >> seen in Twitter messages as well as daily trend analysis. I finally > >> got the demo polished up a bit and running on Amazon EC2 where it can > >> be found at http://trendulo.com . > >> > >> Trendulo is still pretty Alpha at this point so please feel free to > >> add to the existing documented issues at > >> https://github.com/jaredwinick/trendulo where you can also obviously > >> find the source. > >> > >> > >> As an example, the following link will show the launch of Instagram's > >> Android client, followed by Facebook's purchase and then a small > >> increase in general "chatter" about the product http://goo.gl/XcCG8 > >> > >> > >> Let me know if anyone has any questions or comments. Feel free to > >> tweet @trendulo any interesting searches and I can retweet them out. > >> > >> > >> Jared >
-
Re: Trendulo - A Twitter Analytics Demo on AccumuloAaron Cordova 2012-04-25, 13:43
Speaking of storage - are you using EBS or local instance storage?
On Apr 25, 2012, at 8:52 AM, Eric Newton wrote: > How many key-values does a single tweet become, on average? What's the storage size per tweet? > > On Wed, Apr 25, 2012 at 12:17 AM, Jared winick <[EMAIL PROTECTED]> wrote: > Thanks for the kind words, I appreciate it. Keith, my ingest process > was down on Mar 19-20, so that is why I am missing data for that > period. > > For those who are curious, I am receiving about 1.2 million tweets a > day and have about 3 billion entries in my main table. I am actually > getting by with everything running on an EC2 medium instance, which is > obviously very far from ideal but I am trying to stay on a budget. > > I hope to add new features as time allows, things like near real-time > trending and geospatial analytics. If anyone has any ideas for > features they think would be interesting, just let me know or add them > as issues on the github page. > > On Tue, Apr 24, 2012 at 11:40 AM, Billie J Rinaldi > <[EMAIL PROTECTED]> wrote: > > That's so cool that I'm creating a new section for it on our page of links: > > http://accumulo.apache.org/papers.html > > > > Billie > > > > On Tuesday, April 24, 2012 9:35:31 AM, "Jared winick" <[EMAIL PROTECTED]> wrote: > >> I gave an Introduction to Apache Accumulo presentation last month at > >> the Boulder/Denver Meetup where I demoed an application that used > >> Accumulo to provide real-time and historical access to words/phrases > >> seen in Twitter messages as well as daily trend analysis. I finally > >> got the demo polished up a bit and running on Amazon EC2 where it can > >> be found at http://trendulo.com . > >> > >> Trendulo is still pretty Alpha at this point so please feel free to > >> add to the existing documented issues at > >> https://github.com/jaredwinick/trendulo where you can also obviously > >> find the source. > >> > >> > >> As an example, the following link will show the launch of Instagram's > >> Android client, followed by Facebook's purchase and then a small > >> increase in general "chatter" about the product http://goo.gl/XcCG8 > >> > >> > >> Let me know if anyone has any questions or comments. Feel free to > >> tweet @trendulo any interesting searches and I can retweet them out. > >> > >> > >> Jared >
-
Re: Trendulo - A Twitter Analytics Demo on AccumuloJared winick 2012-04-25, 19:10
So it is pretty brute force at ingest time to enable queries to be fast and
efficient. For each tweet it builds all 1,2, and 3-grams from the message in the tweet. So an example message of: "i can has cheezburger" would be translated into the following n-grams "i", "can", "has", "cheezburger", "i can", "can has", "has cheezburger", "i can has", "can has cheezburger" then for each n-gram, it keeps a daily and hourly counter using a SummingCombiner. The data model looks like: rowId: n-gram cf: DAY or HOUR cq: date value (ex. 20120425) value: counter so a single tweet turns into many key-values for each n-gram/time period. I would have to verify but on average I think it works out to about 1 tweet to 60 key-values. I end up seeing from a few hundred entries/sec inserted in the middle of the night to about 2000 entries/sec during peak evening times. I am not exactly sure how to answer the question about storage size per tweet as I am not actually storing the original tweet and if a counter already exists for an n-gram/time period, then incrementing that counter doesn't increase the storage size. I can follow up with the current storage I am using though. Aaron, I am using EBS now and I haven't seen any problems, that said my load is obviously not extreme. When I initially moved things from my home workstation to EC2, I had a few months of tweets to ingest. For that initial ingest I did run with local instance storage as I saw extremely variable performance when I first tried EBS. The instance storage was better, though not as good as what I see on bare metal. Jared On Wed, Apr 25, 2012 at 7:43 AM, Aaron Cordova <[EMAIL PROTECTED]> wrote: > Speaking of storage - are you using EBS or local instance storage? > > On Apr 25, 2012, at 8:52 AM, Eric Newton wrote: > > How many key-values does a single tweet become, on average? What's the > storage size per tweet? > > On Wed, Apr 25, 2012 at 12:17 AM, Jared winick <[EMAIL PROTECTED]>wrote: > >> Thanks for the kind words, I appreciate it. Keith, my ingest process >> was down on Mar 19-20, so that is why I am missing data for that >> period. >> >> For those who are curious, I am receiving about 1.2 million tweets a >> day and have about 3 billion entries in my main table. I am actually >> getting by with everything running on an EC2 medium instance, which is >> obviously very far from ideal but I am trying to stay on a budget. >> >> I hope to add new features as time allows, things like near real-time >> trending and geospatial analytics. If anyone has any ideas for >> features they think would be interesting, just let me know or add them >> as issues on the github page. >> >> On Tue, Apr 24, 2012 at 11:40 AM, Billie J Rinaldi >> <[EMAIL PROTECTED]> wrote: >> > That's so cool that I'm creating a new section for it on our page of >> links: >> > http://accumulo.apache.org/papers.html >> > >> > Billie >> > >> > On Tuesday, April 24, 2012 9:35:31 AM, "Jared winick" < >> [EMAIL PROTECTED]> wrote: >> >> I gave an Introduction to Apache Accumulo presentation last month at >> >> the Boulder/Denver Meetup where I demoed an application that used >> >> Accumulo to provide real-time and historical access to words/phrases >> >> seen in Twitter messages as well as daily trend analysis. I finally >> >> got the demo polished up a bit and running on Amazon EC2 where it can >> >> be found at http://trendulo.com . >> >> >> >> Trendulo is still pretty Alpha at this point so please feel free to >> >> add to the existing documented issues at >> >> https://github.com/jaredwinick/trendulo where you can also obviously >> >> find the source. >> >> >> >> >> >> As an example, the following link will show the launch of Instagram's >> >> Android client, followed by Facebook's purchase and then a small >> >> increase in general "chatter" about the product http://goo.gl/XcCG8 >> >> >> >> >> >> Let me know if anyone has any questions or comments. Feel free to >> >> tweet @trendulo any interesting searches and I can retweet them out.
-
Re: Trendulo - A Twitter Analytics Demo on AccumuloAaron Cordova 2012-04-26, 02:19
On Apr 25, 2012, at 3:10 PM, Jared winick wrote: > Aaron, I am using EBS now and I haven't seen any problems, that said my load is obviously not extreme. When I initially moved things from my home workstation to EC2, I had a few months of tweets to ingest. For that initial ingest I did run with local instance storage as I saw extremely variable performance when I first tried EBS. The instance storage was better, though not as good as what I see on bare metal. Thanks for the info. I get the sense that you can scale up a single server more easily using EBS since you can attach like 10 volumes and RAID them up together. More vols might mean less variability too depending on how you configure RAID. > Jared
-
Re: Trendulo - A Twitter Analytics Demo on AccumuloJason Trost 2012-04-26, 10:49
This is awesome Jared. Thanks for sharing.
On Tue, Apr 24, 2012 at 9:35 AM, Jared winick <[EMAIL PROTECTED]> wrote: > I gave an Introduction to Apache Accumulo<http://www.slideshare.net/jaredwinick/introduction-to-apache-accumulo> presentation > last month at the Boulder/Denver Meetup<http://www.meetup.com/Boulder-Denver-Big-Data/events/55277392/> where > I demoed an application that used Accumulo to provide real-time and > historical access to words/phrases seen in Twitter messages as well as > daily trend analysis. I finally got the demo polished up a bit and running > on Amazon EC2 where it can be found at http://trendulo.com. > > Trendulo is still pretty Alpha at this point so please feel free to add to > the existing documented issues at https://github.com/jaredwinick/trendulo where > you can also obviously find the source. > > As an example, the following link will show the launch of Instagram's > Android client, followed by Facebook's purchase and then a small increase > in general "chatter" about the product http://goo.gl/XcCG8 > > Let me know if anyone has any questions or comments. Feel free to tweet > @trendulo any interesting searches and I can retweet them out. > > Jared > > >
-
Re: Trendulo - A Twitter Analytics Demo on AccumuloEric Newton 2012-04-27, 19:09
On Wed, Apr 25, 2012 at 3:10 PM, Jared winick <[EMAIL PROTECTED]> wrote:
> I am not exactly sure how to answer the question about storage size per > tweet as I am not actually storing the original tweet and if a counter > already exists for an n-gram/time period, then incrementing that counter > doesn't increase the storage size. I can follow up with the current storage > I am using though. > I see I can make some estimates based on the information in your talk. The slides are awesome, btw. Using the information you provided: Dec 24 - March 12... that's 88 days. 2.6e9 entries, 3 million-ish tweets per day: 2.6e9 / (3e6 * 88) ~10 entries per tweet. Also, you report disk usage of 72G, which I will interpret as 72 * (1024 ** 3) bytes. So, each tweet, on average occupies: 72G / (88 * 3e6) Or, ~300 bytes. -Eric
-
Re: Trendulo - A Twitter Analytics Demo on AccumuloJared winick 2012-04-30, 13:33
Here is an up-to-date estimate. I naively reported disk usage as the "Disk
Used" field under the Accumulo Master section of the monitor. Currently it appears I am only actually using ~26 GB of storage for my Accumulo tables. This is based on the "% Used" * "Unreplicated Capacity" fields in the NameNode section of the monitor which is also corroborated by looking the the file system usage for the HDFS data directories. I have no other data in HDFS. Dec 24 - Apr 30 = 128 days 3.0 billion entries / 128 days = 23.4 million entries/day 23.4 million entries/day / 1.2 million tweets/day ~ 20 entries/tweet (not sure if I misrepresented the number of tweets per day as 3 million before, but it is about 1.2) 26GB / ( 128 * 1.2e6 ) ~ 182 bytes/tweet I am using the VARLEN encoding for the SummingCombiner which probably helps save a lot of space as I would imagine there are a lot of entries with a very small count as the language used on Twitter is far from normal. On Fri, Apr 27, 2012 at 1:09 PM, Eric Newton <[EMAIL PROTECTED]> wrote: > > On Wed, Apr 25, 2012 at 3:10 PM, Jared winick <[EMAIL PROTECTED]>wrote: > >> I am not exactly sure how to answer the question about storage size per >> tweet as I am not actually storing the original tweet and if a counter >> already exists for an n-gram/time period, then incrementing that counter >> doesn't increase the storage size. I can follow up with the current storage >> I am using though. >> > > I see I can make some estimates based on the information in your talk. The > slides are awesome, btw. > > Using the information you provided: Dec 24 - March 12... that's 88 days. > 2.6e9 entries, 3 million-ish tweets per day: > > 2.6e9 / (3e6 * 88) > > ~10 entries per tweet. > > Also, you report disk usage of 72G, which I will interpret as 72 * (1024 > ** 3) bytes. > > So, each tweet, on average occupies: 72G / (88 * 3e6) Or, ~300 bytes. > > -Eric > |