We're currently using Apache Kafka which is working out great, but does include a broker as a middleman. This is good for us since it's designed to accumulate unconsumed data (or even roll it back) and still provide good performance - rather than having that happen on the producers or consumers.
It would be cool to see some stats on throughput and latency relative to # of producers / consumers and amount of data currently accumulated in the producers (since there is no middleman).
I can't afford a $2000 unit now or a $500 unit a year or so from now, but the possibility of having something like this available for $100-$200 3 years from now is super exciting.
I just bought my first touring bike and plan on really getting into it over the next couple of years. Having reliable intern ate access while on a long trip would be great.
i'm curious because right after i registered, i received an email with the subject "Scouts: Please confirm subscription" which goes to a listserv and has a reply to: will -at- vinova.sg.
Yes. That seemed pretty deceptive to me too. Nowhere did I agree or appear to agree to signup for a listserv or newsletter. I probably would have, but the surprise email in my inbox makes me not trust this group now.
We've been using Pig extensively for our MR queries and we're very happy with it. While i haven't used Cascading, i do like the pipelining approach of Pig more than the SQL-influence method of Hive. It lends itself better for meeting our needs.
Testing is a good point. If you're running Pig through their APIs it is definitely easier to test than command-line running scripts. We've written test code that reads and runs pig scripts through the API using fixed sample data (stored in HDFS for easy access), read the results, and compare it to expected results (also stored in HDFS). Honestly, you don't need too much input data to prove the correctness of the query.
Remember that Pig also supports placeholders in your scripts so you that you can set them in run-time to define input/output paths, etc. This makes testing easier.
Dependencies can also be stored in HDFS which makes it simple to run your scripts w/o the need to distribute jars around.